|Home | About | Journals | Submit | Contact Us | Français|
Real-world scenes are incredibly complex and heterogeneous, yet we are able to identify and categorize them effortlessly. In humans, the ventral temporal Parahippocampal Place Area (PPA) has been implicated in scene processing, but scene information is contained in many visual areas, leaving their specific contributions unclear. While early theories of PPA emphasized its role in spatial processing, more recent reports of its function have emphasized semantic or contextual processing. Here, using functional imaging, we reconstructed the organization of scene representations across human ventral visual cortex by analyzing the distributed response to 96 diverse real-world scenes. We found that while individual scenes could be decoded in both PPA and early visual cortex (EVC), the structure of representations in these regions was vastly different. In both regions spatial rather than semantic factors defined the structure of representations. However, in PPA, representations were defined primarily by the spatial factor of expanse (open, closed) and in EVC primarily by distance (near, far). Further, independent behavioral ratings of expanse and distance correlated strongly with representations in PPA and pEVC, respectively. In neither region was content (manmade, natural) a major contributor to the overall organization. Further, the response of PPA could not be used to decode the high-level semantic category of scenes even when spatial factors were held constant, nor could category be decoded across different distances. These findings demonstrate, contrary to recent reports, that the response PPA primarily reflects spatial, not categorical or contextual aspects of real-world scenes.
Despite the complexity and heterogeneity of scenes, scene processing produces neural representations capable of supporting a variety of tasks including navigation, object identification, extraction of semantic information, and guidance of visual attention. Although much of visual cortex clearly contributes to scene processing, research has often focused on the Parahippocampal Place Area (PPA), which responds more strongly when people view scenes or buildings than individual objects or faces (Aguirre et al., 1998; Epstein and Kanwisher, 1998; Levy et al., 2001). While such scene selectivity suggests a specialized role in scene processing, the precise information extracted by PPA and the nature of the underlying neural representations remains unclear.
Some theories of PPA function suggest that it is primarily involved in encoding the spatial layout of scenes (Maguire et al., 1996; Epstein and Kanwisher, 1998; Park et al., 2011) and the retrieval of familiar scenes (Rosenbaum et al., 2004; Epstein and Higgins, 2007; Hayes et al., 2007). Consistent with these theories, there are anatomical projections from parietal into parahippocampal cortex (Kravitz et al., 2011) and anterograde amnesia for scene layouts has been reported following damage to regions encompassing PPA (Aguirre and D'Esposito, 1999; Barrash et al., 2000). However, more recent reports have proposed that PPA maintains representations of the contextual associations of individual objects rather than scenes, per se. (Bar, 2004; Bar et al., 2008; Gronau et al., 2008) (but see (Epstein and Ward, 2009)). Finally, is has been proposed that PPA is responsible for natural scene categorization, distinguishing among high-level conceptual categories of scenes (e.g. beaches, buildings)(Walther et al., 2009). Critically, however, other regions such as EVC and object-selective cortex evidenced equivalent categorization of scenes, making it difficult to determine the unique contribution of PPA.
The aim of the current study was to investigate, in a data-driven manner, the structure of scene representations across human ventral visual cortex using the distributed response patterns. We took advantage of the power of ungrouped event-related designs (Kriegeskorte et al., 2006; Kriegeskorte et al., 2008b; Kravitz et al., 2010) to test a broad array of scenes from different categories, evenly divided between manmade and natural scenes (Oliva and Torralba, 2001; Joubert et al., 2007). Critically, we further controlled and evaluated the contribution of spatial information, by choosing scenes to equally span differences in the expanse (open, closed) and relative distance (near, far) (Figure 1) (Oliva and Torralba, 2001; Torralba and Oliva, 2003; Loschky and Larson, 2008; Greene and Oliva, 2009b). Consistent with prior reports (Kay et al., 2008; Walther et al., 2009), the identity of individual scenes could be decoded in both EVC and PPA. However, PPA primarily grouped scenes based on their expanse, while grouping in EVC was generally weaker and based on relative distance. Further, the observed grouping in PPA and EVC correlated strongly with behavioral judgements of expanse and relative distance, respectively. Contrary to reports of contextual and category effects in PPA, there was no grouping by content, nor any ability to decode scene category either within or across spatial factors. Taken together, these findings indicate that representations in PPA primarily reflect spatial and not category information.
10 participants (6 female), ages 21–35, participated in the fMRI experiment. For one participant there was insufficient time to collect the localizer for EVC. Six participants aged 21–28 participated in the independent behavioral experiment. All participants had normal or corrected-to-normal vision and gave written informed consent. The consent and protocol were approved by the National Institutes of Health Institutional Review Board.
During the 6 event-related runs of the fMRI experiment, participants were presented with 96 highly detailed and diverse real-world scenes (1024×768 pixels, 20×15°) in a randomized order for 500ms each. Interstimulus intervals (4–12s) were chosen to optimize the ability of the subsequent deconvolution to extract responses to each scene using the optseq function from AFNI/Freesurfer.
To ensure participants fixation, participants performed a shape-judgement task on the central fixation cross. Specifically, simultaneous with the presentation of each scene one arm of the fixation cross grew slight longer and participants indicated which arm grew via a button press. Which arm grew was counterbalanced across scenes between runs, such that both arms grew equally often with each scene. We used this task, which was orthogonal to scenes, to measure the structure of scene representations without introducing any confounds or feedback effects caused by task.
The scenes were selected to span the stimulus domain as broadly as possible. Scenes were constrained to represent naturalistic (eye-level) views. The scenes were taken from 16 categories (6 exemplars each), divided evenly by Content (manmade, natural) (Oliva and Torralba, 2001; Joubert et al., 2007). To test for the relative importance of spatial information, scenes within these categories were chosen to equally span two spatial dichotomies thought to be important for scene perception: Expanse (open, closed -- the spatial boundary of the scene) and Relative Distance (near, far -- distance to the nearest foreground objects) (Ross and Oliva; Oliva and Torralba, 2001; Torralba and Oliva, 2003; Loschky and Larson, 2008; Greene and Oliva, 2009b) (Figure 1, See Supplemental Item 1 for full stimulus set). Scenes were identified as belonging to a particular level of a dichotomy (e.g. open, closed), based on agreement amongst the authors. In the case of open and closed scenes, which differed in their spatial boundaries, and content, which differed in their constituent objects, the differences were quite clear. Relative Distance was defined within each category, and thus exemplars differed considerably in vergence cues and the amount of space depicted, making attributions to either near or far simple. As each of the 16 categories had both near and far exemplars, each scene reflected one of eight possible classifications (e.g. manmade/closed/near top left 2 images in Figure 1). Note that all scenes differed from one another at an individual level in their spatial layout.
Four independent block-design scans were also collected in each participant to localize scene-, object-, face-selective and EVC regions-of-interest (ROI). Each of these scans was an on/off design with alternating blocks of stimuli presented while participants either performed a one-back task (for object, face, scene localizers), or simply maintained fixation (EVC). Scene-selective cortex was localized with the contrast of scenes versus faces, object-selective cortex with the contrast of objects versus retinotopically matched scrambled objects(Kravitz et al., 2010), and face-selective cortex with the contrast of faces versus objects. Scene, object, and face images were grayscale photographs. Peripheral and central EVC were localized with the contrast of central (5°) and peripheral (6–15°) flickering (8hz) checkerboards.
Participants were scanned on a research dedicated GE 3-Tesla Signa scanner located in the Clinical Research Center on the National Institutes of Health campus in Bethesda. Partial volumes of the temporal and occipital cortices were acquired using an 8-channel head coil (22 slices, 2×2×2 mm, 0.2 mm inter-slice gap, TR = 2s, TE = 30ms, matrix size = 96×96, FOV = 192mm). In all scans, oblique slices were oriented approximately parallel to the base of the temporal lobe and generally covered the temporal lobe from its most inferior extent to the superior temporal sulcus and extended posteriorly through all of early visual cortex. Six event-related runs (263 TRs) and eight localizer scans (144 TRs) were acquired in each session.
Data were analyzed using the AFNI software package (http://afni.nimh.nih.gov/afni). Prior to statistical analysis all of the images for each participant were motion-corrected to the first image of their first run after removal of the first and last 8TRs from each run. Following motion correction the localizer runs (but not the event-related runs) were smoothed with a 3mm FWHM Gaussian kernel.
ROIs were created for each participant from the localizer runs. Significance maps of the brain were computed by performing a correlation analysis thresholded at a p-value of 0.0001 (uncorrected). ROIs were generated from these maps by taking the contiguous clusters of voxels that exceeded threshold and occupied the appropriate anatomical location based on previous studies(Sayres and Grill-Spector, 2008; Schwarzlose et al., 2008). In order to ensure that all ROIs were mutually exclusive we employed the following precedence rules to remove overlapping voxels: First, If a voxel showed any position selectivity (center vs. periphery) it was deemed retinotopic and excluded from all the category-selective ROIs. Category-selectivity is, by necessity, always established by the contrast of two retinotopically distinct categories, and the demonstration that voxel shows any position effects suggests that its selectivity is due to simple retinotopy. Second, any voxel that showed selectivity to faces or scenes, but did not differentially respond to central or peripheral checkerboards, was deemed selective for those categories. Third, any voxel which showed a stronger response to objects than scrambled objects, but did not respond differentially to the checkerboards and did not respond more to faces or scenes than objects, was included in the object-selective ROIs.
Further, all of the analyses presented below were also performed with all overlapping voxels removed from every ROI and no significant changes in the results reported below occurred. Finally, we also performed all of the analyses of PPA and pEVC with matching voxel sizes by randomly subsampling pEVC and found no qualitative differences in any of the reported results.
We conducted a standard general linear model using the AFNI software package to deconvolve the event-related responses. Our experiment combined a sparse event-related design with multi-voxel pattern analysis, allowing us to assess the response to each individual stimulus and not average across a priori categories of stimuli (ungrouped design). Response patterns in the event-related runs were created by performing t-tests between each condition and baseline. The t-values for each condition were then extracted from the voxels within each ROI and we then used an iterative variant (MacEvoy and Epstein, 2007; Chan et al., 2010; Kravitz et al., 2010) of split-half correlation analysis (Haxby et al., 2001; Williams et al., 2008) to establish the similarity between the response patterns of each pair of scenes, once the mean signal was independently removed from each half of the data. This yielded similarity matrices that represent the similarity in the spatial pattern of response across the ROI between each pair of conditions. T-values were used as they reduced the impact of noisy voxels on the patterns of response(Misaki et al., 2010), nearly equivalent results were obtained using the coefficients. Also, in order to rule out baseline activity differences as the source of any observed effects all analyses were performed with and without the mean activity removed. Without the removal of the mean activity, correlation will reflect only the difference in the pattern of activity across an ROI and not any simple activity differences between a pair of conditions. The main effect of the removal of the mean activity was a normalization of the data leading to an increase in the structure of resulting similarity matrices and reduction in the overall level of correlation. However, there were no qualitative or significant effects on any of the grouping or discrimination results.
All analyses were also repeated after applying a Fisher transformation to the correlation values. No qualitative or significant effects on any of the results was observed, which is unsurprising given that none of the correlations approached either 1 or -1 and correlations near to zero approximate the normal distribution.
To investigate the distribution of scene information throughout the whole volume we performed a novel selectivity analysis. Typical information-based mapping utilizes a searchlight, which determines what information is available in the response of a local cluster of voxels. While useful, this approach is forced to assume that information is present only in these local clusters, constrains the sort of information being searched for, and introduces non-independence between adjacent voxels. Our analysis avoids these problems and simply evaluates whether each individual voxel shows any consistent selectivity amongst our set of 96 stimuli across independent halves of the data.
To determine whether a particular voxel exhibits consistent selectivity amongst our set of stimuli, we smoothed the event-related data to 3mm to match our block-design localizers and divided the data into two independent halves, using the same iterative procedure we used for the similarity analysis. We then correlated the relative levels of activation to each of the 96 scenes across the two halves of the data. If a particular voxel is responsive to, but not selective amongst our set of scenes it will produce two sets of responses in the two halves of data that may have the same distribution (i.e. mean, standard deviation) but there will be no correlation between the rank ordering of the responses. Alternatively, a voxel that is both responsive and selective will produce a correlated pattern of selectivity between the two halves of the data. The correlation value assigned to each voxel therefore indicated its consistency of selectivity across our stimuli. These values were then averaged across all the voxels within a region within each participant.
To establish whether a cluster of voxels showed significant selectivity, we used a cluster threshold based on the following randomization procedure. First, we took the data from the independent halves of the data in each participant and then randomized the condition labels and correlated the selectivity. Importantly, the randomization was the same for every voxel, maintaining any non-stimulus specific relationships between voxels. We then searched the entire volume for the largest contiguous cluster of voxels with correlation values greater than r=0.168 (p<0.05). We repeated this procedure 10000 times for each participant and derived the minimum cluster size that occurred in less than 5% of the iterations. This cluster size served as a participant specific threshold for determining which clusters of voxels (r>0.168, p<0.05) were significant. The average threshold for cluster size was ~12.
Twelve new participants completed three sessions of 576 trials during which they judged which of a pair of scenes was either more open (Expanse), more natural (Content), or more distant (Distance). Importantly, no specific instructions were given to the participants about what defined each of the dimensions, they were left free to rate stimuli based on their intuitions about the labels given. Ideally we would have directly measured Relative Distance within each category of stimuli, but that would have required informing participants of the categories and/or limiting the trials to only comparisons within a category, both of which would have introduced task confounds into our measure of Distance.
On each trial participants were sequentially presented with two scenes from our set of 96 for 500ms each with a 1s blank screen between. Participants indicated their chosen scene via a button press. The order of these sessions (Expanse, Content, Distance) was counterbalanced across participants. Further, the trials were chosen such that no trial was ever repeated across participants, so that as many of the comparisons as possible were made.
As there were not enough trials available to probe every single possible comparison (4560) within a single participant, trials were concatenated across participants. To determine a ranking across our stimulus set for Expanse, Distance, and Content, Elo ratings (Elo, 1978) were derived in the following manner. Each scene was given an initial Elo rating of 1000. Each trial was treated as a match between the two scenes and the losers and winners rankings adjusted according to the standard Elo formula (Meng, M. Functional lateralization of face processing. Vision Sciences Society. 2010). The final rankings for each scene reflect their relative ranking along the dimension of interest. As the order of matches impacts the final Elo ratings, 10000 iterations of this procedure with different random trial orders, were averaged together.
The purpose of this study was to perform a data-driven investigation of scene representations across the ventral visual cortex. We presented 96 highly detailed and diverse scenes chosen to both broadly cover the stimulus domain. The scenes were balanced in such a way as to allow us to evaluate the relative contributions of non-spatial factors, like Content (manmade, natural) and high-level category (e.g. beaches, highways) and spatial factors, like Expanse (open, closed) and Relative Distance (near, far), to scene representations. None of these factors had any preferential status within any of the subsequent analyses, and there was no bias in our design for any, all, or none of these factors or categories to emerge.
In our first test of scene representations we independently localized scene-, object-, and face-selective regions as well as retinotopic early visual cortex (EVC) in both hemispheres. Given the limited acquisition volume possible at our high resolution (2×2×2mm), our scene-selective regions included both transverse occipital sulcus (TOS) (Epstein et al., 2007) and PPA, but not retrosplenial cortex (RSC) (Epstein and Higgins, 2007). We divided EVC into peripheral and central early visual cortex (pEVC, cEVC, respectively), given evidence for a peripheral bias in PPA (Levy et al., 2001; Hasson et al., 2002) and for the differential involvement of central and peripheral space in scene perception (Larson and Loschky, 2009). We will focus initially on comparing and contrasting PPA and pEVC, the regions that showed the strongest discrimination and most structured representations.
Within each region, we extracted the pattern of response across voxels to each of the 96 scenes. We then cross-correlated these response patterns to establish the similarity between the response patterns of each pair of scenes. This analysis yielded a 96x96 similarity matrix for each region (Figure 2a,b) wherein each point represents the correlation or similarity between a pair of scenes (Kriegeskorte et al., 2008a; Drucker and Aguirre, 2009b). These matrices can be decomposed into two components. First, the points along the main diagonal, from the upper left to lower right corner of the matrix, represent the consistency of the response patterns for the same scene across the two halves of the data (within-scene correlations). Second, the points off the diagonal are the correlations between pairs of different scenes (between-scene correlations). These two components can be used to provide information about both categorization and discrimination of scenes. Specifically, the between-scene correlations define how a region groups scene together (categorization). In contrast, significantly greater within- than between-scene correlations indicate that the region can distinguish between individual scenes from one another (discrimination).
Given prior results on categorization in PPA, we first ordered the raw similarity matrices by scene category and divided scenes by Content into manmade and natural. For PPA and pEVC (Figure 2a,b), it is clear that the patterns of response contain rich information about the presented scenes. In both regions the within-scene correlations (diagonal) are on average stronger than the between-scene correlations (off-diagonal), indicating an ability to discriminate scenes. This effect is particularly prominent in pEVC (Figure 2b). However, there is very little structure to the between-scene correlations in pEVC and only mild grouping evident in PPA. Further, neither region shows any consistent grouping of manmade and natural scenes. To better visualize this structure we averaged the between-scene correlations by high-level category (Figure 2c,d). In these matrices the points along the main diagonal reflect the coherence of a scene category. Even within these average matrices there is only weak evidence for coherent scene categories in PPA (Figure 2c, e.g. high within-category correlations for Living Rooms and Ice Caves) and no obvious coherent categories in pEVC (Figure 2d). Further, even amongst the most coherent categories in PPA there are between category correlations that violate differences in Content. For example, Living Rooms and Ice Caves are well correlated despite vast differences in Content and low-level stimulus properties (e.g. color, spatial frequency, luminance, etc).
To better visualize the structure of scene representations in both regions, without assuming the importance of scene categories, we used multi-dimensional scaling (MDS) (Kriegeskorte et al., 2008a). Each scene was positioned on two-dimensional plane, where the distance between any pair of scenes reflects the correlation between their response patterns (the higher the correlation the closer the distance) (Figure 3a,b). This visualization reveals a very striking structure not captured by scene categories in either PPA or pEVC. In PPA there is clear grouping by Expanse with open scenes to the right and closed scenes to the left. In pEVC grouping was weaker, but defined by Relative Distance. We verified the strength of these differential groupings between the two regions by reordering the raw similarity matrices (Figure 2) by these dichotomies (Figure 3c,d) rather than high-level category. Note that in some cases, the difference in the structure of scene representations between PPA and pEVC caused large shifts in the pairwise similarity of individual scenes. For example, a church image and a canyon image were similarly categorized by PPA (Figure 3a yellow boxes), reflecting enclosed structure whereas in pEVC they were categorized as dissimilar (Figure 3b yellow boxes), as they had different Relative Distances. In the following section we quantify these differences in representational structure between the regions.
We directly quantified the relative contributions of Expanse, Relative Distance, and Content by averaging the between-scene correlations (off-diagonal) across the 8 different combinations of the three dichotomies (Figure 4a,b). We then averaged each row of these matrices according the correlation within and between the various levels of Expanse, Relative Distance, and Content (Figure 4c,d). The resulting correlations were then entered into a four-way repeated measures ANOVA with Expanse (same, different), Relative Distance (same, different), Content (same, different), and Region (PPA, pEVC) as factors.
Grouping was weaker in pEVC than PPA (see also discrimination analysis below) with lower between-scene correlations, resulting in a significant main effect of Region (F1,8 = 19.269, p < 0.01). Further, the contributions of Relative Distance and Expanse were different in the two regions, resulting in highly significant interactions between Region × Expanse (F1,8 = 33.709, p < 0.001) and Region × Relative Distance (F1,8 = 24.361, p < 0.01). Notably, Content was not a major contributor to grouping in either region and no main effects or interactions involving Content (all p>0.16) were observed.
To investigate the differential grouping in the two regions further, data from each region were entered independently in two repeated measures ANOVAs. In pEVC, Relative Distance was the only significant factor producing grouping (F1,8 = 30.554, p < 0.001), and no main effect of Expanse or Content (p>0.12) was observed. In contrast, in PPA Expanse was the primary factor producing grouping (F1,9 = 44.419, p < 0.001), though there was a smaller effect of Relative Distance (F1,9 = 18.152, p < 0.01). No interactions between Expanse × Relative Distance were found either within or across the ROIs (p > .2). Again, Content played no role in grouping, with no main effects or interactions involving Content (p>0.15). Further, even when the matrices were averaged by the semantic categories (e.g. beaches, mountains) used in previous studies (e.g. (Walther et al., 2009)), Expanse remained the dominant factor producing grouping (Supplemental Item 2; also see below).
Thus, neither PPA nor pEVC show effects of scene category or Content. Instead both regions group scenes by their spatial aspects, with pEVC showing grouping by Relative Distance, and PPA grouping primarily by Expanse. While the weaker categorization by Relative Distance in PPA may suggest that some aspects of scene categorization are inherited from pEVC, the absence of an effect of Expanse in pEVC implies that the structure of scene representations is transformed between pEVC and PPA.
Multivariate designs, by virtue of their large number of conditions, produce data that can be directly correlated with behavior at an individual item level (Kriegeskorte et al., 2008a; Drucker and Aguirre, 2009a). To assess whether the representational structure we observed in PPA and pEVC was reflected in behavior, we next directly tested whether the structure of scene representations we measured in PPA and pEVC agreed, at the level of individual scenes, with subjective behavioral ratings from a new set of six participants. The task and instructions used in collecting behavioral judgments will inevitably constrain the resulting data. Therefore we provided as little instruction as possible, simply asking participants to report which of a sequential pairs of scenes was more open (Expanse), more natural (Content), or more distant (Distance). Participants were free to interpret these labels as they wanted. We used Elo ratings (see Methods) to derive a ranking for each individual scene for each of the three dichotomies. These rankings turn our dichotomies into dimensions, where the rating of a scene reflects its subjective openness, naturalness, or depth relative to the other 95 scenes.
First, we used the Elo ratings as independent confirmation of our dichotomies. The Content dichotomy was the most clearly reflected in the Elo ratings with 46/48 of the top ranked scenes being natural scenes. The Expanse dichotomy was similarly strong with 40/48 of the top ranked scenes being open scenes. To calculate the strength of the Relative Distance dichotomy we counted the number of times the far exemplars of a particular high-level category of scene were rated more highly than the near exemplars in that category, which was true for 40/48 scenes. To assess the reliability of the ratings we divided the 12 participants into two groups of six and calculated Elo ratings for each group separately. The ratings for all three dichotomies were highly correlated (Expanse: r = .92; Content: r = .94; Relative Distance: r = .86; all p < 0.0001) across the groups, verifying the reliability of the Elo ratings. Thus, independent ratings of the individual scenes by naïve observers reliably confirm our original classifications.
Next, we directly compared the Elo ratings with the scene representations we recovered with fMRI in PPA and pEVC. We calculated a fMRI grouping score from the average similarity matrices (Figure 3c,d) for each scene that reflected how strongly grouped that scene was within a particular dichotomy. For example, the Expanse score for a scene was calculated by subtracting its average correlation with the closed scenes from its average correlation with the open scenes. We then correlated these fMRI grouping scores with their respective Elo ratings to determine whether scene representations in each region reflected the behavioral rankings of scenes.
For Expanse, we found a very strong correlation between the Elo ratings and Expanse scores in PPA (r = 0.67, p < 0.0001; Figure 5a) but not in pEVC (r = 0.08, p > 0.1; Figure 5b). This difference in correlation was significant (z = 2.16, p < 0.05), suggesting that the pattern of response in PPA more closely reflects behavioral judgments of Expanse than does the pattern in pEVC. For Content, we found no correlations in either PPA (r = 0.10, p > 0.1; Figure 5c) or pEVC (r = 0.07, p > 0.1; Figure 5d). Further, in PPA this correlation was significantly weaker than the correlation between Elo ratings and Expanse scores (z = 2.99, p < 0.05), demonstrating that there is a stronger relationship between scene representations in PPA and judgments of Expanse than judgments of Content. For Distance, we found equivalent correlations (p > 0.1) in both PPA (r = 0.54, p < 0.0001; Figure 5e) and pEVC (r = 0.31, p < 0.01), consistent with the grouping we observed in both regions. Based on our earlier analysis it might have been expected that the correlation with Distance would have been stronger in pEVC than PPA. Their equivalent correlations may reflect a weaker direct contribution of pEVC to conscious judgments about scenes than PPA.
These correlations between the structure of scene representations in fMRI and behavior suggest that the pattern of response in PPA much more strongly reflects subjective judgments about spatial aspects of scenes (Expanse, Distance), than the Content of those same scenes. In contrast, the pattern of response in pEVC reflected only judgments of the Distance of those scenes, providing converging evidence for the different scene information captured in pEVC and PPA. Further, these results show that, regardless of what visual statistics drive the responses of pEVC and PPA, the representations they contain directly reflect, and perhaps even contribute to, subjective judgements of high-level spatial aspects of complex scenes.
Our previous analyses confirmed that spatial factors have a greater impact on the structure of scene representations in PPA than non-spatial factors. To directly test whether there was any high-level category information independent from spatial factors, we next considered whether i) scene category could be decoded when spatial factors were held constant or do scenes from different categories, but with similar spatial properties elicit similar response, and ii) whether scene category could be decoded across spatial factors, or do scenes from the same category, but with different spatial properties, elicit different responses. Since Expanse is largely confounded with category (e.g. all mountain scenes will be open), (ii) could only be tested across Relative Distance.
To perform these analyses we needed to consider the near and far exemplars of each of the 16 high-level categories separately (Figure 1), effectively doubling the number of categories to 32. We then averaged the off-diagonal correlation from the raw similarity matrix for PPA (Figure 3c) by scene category (Figure 6a). The points along the diagonal of this matrix represent the average correlation between exemplars of each category. The off-diagonal points represent the correlations between different scene categories or between the near and far exemplars of the same category (e.g. white ellipse (c) in Figure 6a).
To establish whether categories could be distinguished from one another when they shared both Expanse and Relative Distance, discrimination indices were calculated for each category within each combination of the spatial factors (Figure 6b). These discrimination indices were defined as the difference between the correlation of a category with itself (e.g white ellipse (b) in Figure 6a) and the average correlation between that category and the other categories that shared Expanse and Relative Distance. These indices were entered into a one-way ANOVA with scene category (32) as a factor. No main effect of scene category was observed (p > 0.15), nor was their significant discrimination across the scene categories on average (p > 0.15), nor did any individual category evidence significant discrimination with a Bonferonni correction for multiple comparisons (p > 0.3). To apply the most liberal test for category information possible we conducted one-tailed t-tests for each scene category. We found only a single category (near cities) that evidenced any decoding (p < 0.05; uncorrected). Thus, even when spatial factors are held constant we found no strong evidence for scene category representations.
To establish whether high-level scene category could be decoded across variations in spatial factors we calculated discrimination indices for each category across the two levels of Relative Distance (Figure 6c). These discrimination indices were defined as the difference between the correlation of the near and far exemplars of a category with each other (e.g. white ellipse (c) in Figure 6a) and the average correlation between the near and far exemplars of that category and other categories. These indices were entered into a one-way ANOVA with scene category (16) as a factor. No main effect of scene category was observed (p > 0.375), nor was there significant discrimination across the scene categories on average (p > 0.15), nor did any individual category evidence significant discrimination with a Bonferonni correction for multiple comparisons (all p > 0.3). Again we applied the most liberal test for category information and conducted one-tailed t-tests for each scene category. We found only a single category (living rooms) that evidenced any decoding (p < 0.05; uncorrected).
In sum, in contrast to reports emphasizing the representation of scene category in PPA (Walther et al., 2009), we found no evidence for decoding of scene categories in PPA when spatial factors are controlled. We found no ability to decode high-level category across different levels of Relative Distance. We found no evidence for Content as a significant contributor to the overall structure of representations in PPA or pEVC. We also found no correlation between scene representations in PPA or pEVC and subjective judgements of Content, and significantly weaker behavioral correlations for Content than Expanse. While it is possible that these non-spatial factors do have some impact on scene representations in these regions, that impact is clearly minor in comparison to the spatial factors of Expanse and Relative Distance.
While the grouping of between-scene correlations provides insight into how these regions categorize scenes, the difference between within- and between-scene correlations provides an index of scene discrimination. For this analysis, it was critical that we consider only between-scene correlations that did not cross any grouping boundary. Otherwise, our discrimination measure would be implicitly confounded with grouping. Given the strong evidence for both Expanse and Relative Distance as categories we consider discrimination between scenes within the combinations of these factors separately (four white squares encompassing the main diagonal in Figure 2a,2b), collapsing across differences in Content.
Within- and between-scene correlations were extracted from each of the four combinations of Expanse and Relative Distance (Supplemental Item 3). These correlations were then averaged and subtracted from one another to yield discrimination scores (Figure 7a,b). There was a broad ability to discriminate scenes in both regions, with significant discrimination (p<0.05) observed in every condition except for near, closed scenes in PPA. To investigate the pattern of discrimination between the two regions, discrimination scores were entered into a three-way repeated measures ANOVA with Expanse (open, closed), Relative Distance (near, far), and Region (PPA, pEVC) as factors. Discrimination was stronger in pEVC that PPA, resulting in a significant main effect of ROI (F1,8 = 18.838, p < 0.01). Discrimination was also generally stronger for near than far scenes, resulting in a significant main effect of Relative Distance (F1,9 = 9.793, p < 0.05), though this effect was stronger in pEVC, resulting in a significant interaction between Region × Relative Distance (F1,8 = 8.898, p < 0.05). Separate ANOVAs within each region confirmed the larger effect of Relative Distance in pEVC (F1,8 = 15.477, p < 0.01) than in PPA (F1,9 = 5.328, p < 0.05) but revealed no additional effects (all p>0.3). These results demonstrate that even within scenes that are grouped together, there is significant information about the individual scenes.
The gross pattern of scene discrimination was very similar in both pEVC and PPA. To investigate the relationship between discriminability in the two regions in greater detail we calculated discrimination indices for each individual scene and then correlated them across pEVC and PPA (Figure 7c). The high correlation (r = .659, p < 0.001) between the discrimination indices suggests that the distinctiveness of the representation of a scene in PPA is directly related to its distinctiveness in pEVC.
Taken together the results of the discrimination and categorization analyses suggest a transformation of scene representations between pEVC and PPA. Clearly the discriminability of scene representations in PPA reflects discriminability in pEVC. However, PPA sacrifices some scene discriminability, perhaps, in order to better categorize scenes by their spatial expanse. Thus, PPA maintains less distinct representations of scenes that seem broadly organized to capture spatial aspects of scenes.
In addition to PPA and pEVC, we also investigated central EVC (cEVC), TOS, object-selective regions Lateral Occipital (LO) and Posterior Fusiform Sulcus (PFs), and the face-selective Occipital Face Area (OFA) and Fusiform Face Area (FFA).
cEVC was similar to pEVC in its pattern of discrimination (Supplemental Item 4) but showed no scene categorization. This difference in categorization between cEVC and pEVC led to a significant Relative Distance × Region interaction (F1,8 = 29.901, p < 0.01) when categorization averages were entered into a four-way ANOVA with Expanse, Relative Distance, Content, and Region (cEVC, pEVC) as factors. This suggests that pEVC contains more structured scene representations than cEVC, and highlights the likely importance of pEVC in scene processing (Levy et al., 2001; Hasson et al., 2002). However, it must be noted that cEVC represents the portion of space containing the fixation cross, on which the participants were performing the task. Though the cross was very small (~0.5°) relative to the central localizer (5°), it cannot be ruled out that this overlap impacted results in cEVC.
Scene representations in TOS had a structure similar to PPA, but were less categorical. Scene discrimination in TOS and PPA were similar (Supplemental Item 4), but categorization by Expanse was weaker. This weaker categorization led to a significant interaction between Expanse × Region (F1,9 = 11.714, p < 0.01) when categorization averages from TOS and PPA were entered in a four-way ANOVA. In TOS, as in PPA, there was a trend for weak categorization by Relative Distance (F1,9 = 4.548, p = 0.06) and no effects involving Content (all p>0.25).
The object-selective regions (Supplemental Item 5), did not seem particularly involved in processing the scene stimuli. LO evidenced some weak discrimination of scenes and no categorization by any of the 3 dichotomies (all p>0.1). PFs showed no scene discrimination and some categorization by Expanse but far more weakly than that observed in PPA resulting in a highly significant Region × Expanse interaction (F1,8 = 17.382, p < 0.01). It is likely that the short presentations times and the scenes we chose, which did not contain strong central objects, reduced the ability of object-selective cortex to extract individual objects from the scenes.
The results from the face-selective regions (Supplemental Item 6) confirmed they contribute little to scene processing (see also Selectivity Analysis below). Neither of the face-selective regions evidenced any categorization by the 3 dichotomies (all p>0.2). Neither region showed much ability to discriminate between scenes, with FFA showing significant discrimination only for far, closed scenes and OFA for far, open scenes.
Overall, at least some discrimination was possible based on the response of a number of cortical regions, although strongest discrimination was found in EVC, PPA, and TOS. In contrast, grouping was largely confined to PPA, EVC and TOS. Importantly, EVC grouped primarily by Relative Distance whereas PPA and TOS both grouped primarily by Expanse.
So far we have focused on examining scene categorization and discrimination within regions defined by their category selectivity. However, the contrast of a preferred and non-preferred stimulus class (Kanwisher et al., 1997; Epstein and Kanwisher, 1998) implies that a region might be identified as specialized for a particular stimulus class because of a difference in response between these conditions and not necessarily because the region maintains any fine-grained representation of that class. Here we took advantage of our ungrouped design and searched for regions that showed consistent selectivity amongst the set of 96 scenes. This analysis provides an alternate way to identify regions important in scene representation and allows us to investigate whether any other regions are also important.
The aim of this analysis was to identify voxels in a whole-volume search that show consistent selectivity for the set of scene images. Selectivity was defined by the response profile across all 96 scenes in a single voxel (Erickson et al., 2000). We computed the consistency of selectivity by calculating the correlation of the response profile between independent halves of the data. We then produced maps of the correlation values, deriving cluster thresholds using a randomization procedure to determine which voxels are significantly selective (see Experimental Procedures). Given the breadth of our scene stimuli, voxels which do not show at least a modicum of consistency in their selectivity are unlikely to be involved in scene processing.
We found that the vast majority of the consistently selective voxels (~76%) lay within our pre-defined regions, indicating that these regions largely contain the core voxels involved in scene-processing in our volume (Figure 8a).
We next quantified the average selectivity within each of our predefined regions-of-interest (ROIs) (Figure 8b). As expected, significant selectivity (p<0.05) was observed only within scene-selective and EVC ROIs. In EVC there was significantly greater selectivity in pEVC than cEVC (F1,8 = 21.991, p < 0.01). To confirm there was greater selectivity in scene-selective cortex than in either object- or face-selective cortex, their selectivity scores were entered into a two-way ANOVA with Selectivity (scene, object, face) and Location (anterior, posterior) as factors. The only effect observed was a main effect of Selectivity (F2,16 = 6.769, p < 0.01; Greenhouse-Geisser corrected), owing to the greater selectivity observed in the scene-selective than in either the object- (F1,8 = 8.105, p < 0.05) or face-selective (F1,8 = 9.069, p < 0.05) ROIs.
Finally, we quantified the amount of overlap between each ROI and the significantly selective clusters derived from the whole volume search (Figure 8c). Again, significant overlap was present only between the EVC and scene-selective ROIs (p<0.05). The advantage for pEVC over cEVC in both mean selectivity and overlap with selective voxels, is in keeping with the theory that PPA has a bias for the peripheral visual field (Levy et al., 2001; Hasson et al., 2002). In combination, these two selectivity analyses suggest that our analysis of pEVC and PPA captured the majority of the scene processing voxels in the ventral visual pathway.
In sum, using a voxel-wise measure of scene selectivity, based only on responses to scenes, we found that our ROIs captured the vast majority of voxels with consistent scene selectivity. Further, selectivity was most stable in PPA, pEVC and TOS, consistent with our analyses of categorization and discrimination.
Real-world scenes are perhaps the most complex domain for which specialized cortical regions have been identified. Here, we demonstrated that while many visual areas contain information about real-world scenes, the structure of the underlying representations are vastly different. Critically, we were able to establish, without making prior assumptions, that expanse is the primary dimension reflected in PPA. Surprisingly, neither high-level scene category nor gross content (manmade, natural) seemed to play a major role in the structure of the representations. In contrast, pEVC grouped scenes by relative distance and maintained stronger discrimination of individual scenes than observed in PPA. Further, the structure of representations observed with fMRI corresponded closely with independent behavioral ratings of the scene stimuli, with high correlations in PPA for ratings of scene openness, but not content. This specific pattern of brain-behavior correlation suggests that subjective judgements of spatial but not non-spatial aspects of scenes are well captured by, and perhaps dependent, on the response of PPA. These findings provide critical insight into the nature of high-level cortical scene representations and highlight the importance of determining the structure of representations within a region beyond whether those representations are distinct enough to be decoded.
To date, the problem of differentiating between competing accounts of PPA and determining the specific contributions of different visual areas to scene processing has been the complexity and heterogeneity of real-world scenes. First, typical fMRI studies contrast only a small set of pre-selected conditions or categories, presenting blocks of these conditions or averaging over event-related responses to individual exemplars. These designs are implicitly constrained to show differences only between the tested categories or conditions, potentially missing other more important differences. Second, the analysis of these studies also assumes that the response to each exemplar within a category is equivalent. While this assumption is justified in simple domains where there are minimal differences between stimuli, the heterogeneity of scenes makes it more tenuous. For example, the identity of individual scenes could be decoded even from the response of EVC (see also (Kay et al., 2008)). Thus, a difference between conditions might reflect bias in the study design, differences in exemplars, or differences in the homogeneity of stimuli within conditions (Thierry et al., 2007), rather than revealing a critical difference in scene representations. Finally, the paucity of conditions in standard designs also makes it difficult to establish the relative importance of different factors in scene representations in a single study (e.g. spatial vs. category differences). The strength of the our approach is the ability to present a multitude of stimuli, evaluate the response to each stimulus individually, and establish the relative importance of various factors in defining the structure of representations.
Taking advantage of an ungrouped design (Supplemental Item 7), we were able to directly contrast the impact of spatial and non-spatial information on scene representations. Our results further support the theory, based on activation studies, that PPA is part of network of regions specialized for processing the spatial layout of scenes (Epstein et al., 1999; Henderson et al., 2007; Epstein, 2008). The strong grouping of scenes by expanse (see also (Park et al., 2011)) and relative distance, paired with the absence of grouping by content, is inconsistent with theories suggesting the primary function of PPA is distinguishing scene categories (Walther et al., 2009) or, based on activation studies, representing non-spatial contextual associations between objects (Bar, 2004; Bar et al., 2008; Gronau et al., 2008). This is not to suggest that PPA contains no non-spatial scene information,it is possible that other methods which more directly measure the within voxel selectivity (e.g. adaption (Drucker and Aguirre, 2009a)) would reveal a different pattern of results. Our results simply show that the dominant factors in defining the macroscopic response of PPA are spatial. This finding is also consistent with reports of PPA activation during scene encoding (Epstein et al., 1999; Ranganath et al., 2004; Epstein and Higgins, 2007), adaption studies showing viewpoint-specific representations in PPA (Epstein et al., 2003), and anterograde amnesia for novel scene layouts with damage to parahippocampal regions (Aguirre and D'Esposito, 1999;Barrash et al., 2000; Takahashi and Kawamura, 2002; Mendez and Cherrier, 2003).
Our findings contradict a recent study reporting categorization for “natural scene categories” (e.g. forests, mountains, industry) (Walther et al., 2009) in PPA. However, in this study there was no control for spatial factors, including relative distance and expanse. Therefore the ability to decode, for example, highways versus industry could partly reflect the different relative distances within each category or the fact that industry scenes were more likely to have a closed expanse. Similarly, the confusions of their classifier between beaches, highways and mountains could reflect their shared open expanse. This hypothesis is supported by our inability to decode category when spatial factors were held constant, or to decode category across variations in relative distance (Figure 6). Finally, the scenes in this prior study often contained prominent objects (e.g. cars), or even people, and this might explain the equivalent decoding accuracy between PPA and object- and face-selective regions, whereas we found only weak discrimination and no categorization within these areas.
In PPA, it is also possible that low-level features account for some of the observed grouping effects. In particular, there is a difference in the spatial frequency envelopes of closed and open scenes (Supplemental Item 8). Further, it is tempting to suggest that categorization by expanse might reflect the fact that the open scenes often contained sky, despite the absence of any such categorization in EVC. However, this explanation cannot account for the strong discrimination of open far scenes (which shared sky). Further, scene inversion, which should not change the effect of sky or differences in spatial frequencies, has been shown to have a strong impact on both decoding (Walther et al., 2009) and response (Epstein et al., 2006) in PPA. Nonetheless, there must be some visual statistic or combination thereof that is the basis for grouping of by Expanse in PPA, as all visual representations, whether high or low-level, must reflect some difference in the images. The key observation in this study is that the representations in PPA can properly be called spatial as they 1) differ significantly from those observed in early visual cortex, 2) primarily capture differences in spatial information across complex scenes, 3) their structure directly reflects independent behavioral judgements of the spatial and not non-spatial structure of the scenes, and 4) lesions of parahippocampal cortex lead to impairments in the spatial processing of scenes (Aguirre and D'Esposito, 1999).
Grouping in EVC likely reflects some low-level features present in the scenes. However, neither the pixel-wise similarity (Supplemental Item 9) from either the peripheral or central portion of the scenes nor the spatial frequency (Supplemental Item 8) across the entire image seem to individually account for the grouping of scenes by relative distance. Instead, this grouping likely reflects a complex combination of retinotopic, spatial frequency, and orientation information interacting with the structure of EVC (Kay et al., 2008).
There are two possible sources of spatial information in PPA. First, position information has been reported in PPA (Arcaro et al., 2009) (but see (MacEvoy and Epstein, 2007)) and other high-level visual areas (Schwarzlose et al., 2008; Kravitz et al., 2010), suggesting feedforward processing of spatial information. PPA might also receive spatial information from its connections with the retrosplenial cortex, posterior cingulate, and parietal cortex (Kravitz et al., 2011). Further research is needed to address this question, but ultimately, which factors contribute to the formation of a representation and its actual structure are distinct.
The push/pull relationship between discrimination and categorization observed in PPA and pEVC suggests low-level representations may be important in supporting quick discriminations of complex stimuli (Bacon-Mace et al., 2007; Greene and Oliva, 2009a), while high-level representations are specialized to support more abstract or specialized actions (e.g. navigation). Thus, discrimination of complex stimuli based on the response of EVC (e.g. (Kay et al., 2008)), must be interpreted with reference to the particular tasks that response is likely to support, especially given reports that the presence of stimulus information in a region is not necessarily reflected in behavior (Williams et al., 2007; Walther et al., 2009). Our results demonstrate that the critical factors that define high-level representations may not be present within or even predictable from the response of EVC. Nor can EVC be ignored, given the clear inheritance of many aspects of scene representation by PPA, rather the response of both EVC and high-level cortex must be considered in any account of complex visual processing.
In conclusion, we have shown with a data-driven approach that spatial and not high-level category information is the dominant factor in how PPA categorizes scenes. While information about scene was present in other visual regions, including EVC, grouping of scenes varied enormously. These results demonstrate the importance of understanding the structure of representations beyond whether individual presented items can be decoded.
This work was supported by the NIMH Intramural Research Program. Thanks to Marlene Behrmann, Assaf Harel, Alex Martin, Dale Stevens, and other members of the Laboratory of Brain and Cognition, NIMH for helpful comments and discussion.