The purpose of this study was to perform a data-driven investigation of scene representations across the ventral visual cortex. We presented 96 highly detailed and diverse scenes chosen to both broadly cover the stimulus domain. The scenes were balanced in such a way as to allow us to evaluate the relative contributions of non-spatial factors, like Content (manmade, natural) and high-level category (e.g. beaches, highways) and spatial factors, like Expanse (open, closed) and Relative Distance (near, far), to scene representations. None of these factors had any preferential status within any of the subsequent analyses, and there was no bias in our design for any, all, or none of these factors or categories to emerge.
Representational structure within cortical regions
In our first test of scene representations we independently localized scene-, object-, and face-selective regions as well as retinotopic early visual cortex (EVC) in both hemispheres. Given the limited acquisition volume possible at our high resolution (2×2×2mm), our scene-selective regions included both transverse occipital sulcus (TOS) (Epstein et al., 2007
) and PPA, but not retrosplenial cortex (RSC) (Epstein and Higgins, 2007
). We divided EVC into peripheral and central early visual cortex (pEVC, cEVC, respectively), given evidence for a peripheral bias in PPA (Levy et al., 2001
; Hasson et al., 2002
) and for the differential involvement of central and peripheral space in scene perception (Larson and Loschky, 2009
). We will focus initially on comparing and contrasting PPA and pEVC, the regions that showed the strongest discrimination and most structured representations.
Within each region, we extracted the pattern of response across voxels to each of the 96 scenes. We then cross-correlated these response patterns to establish the similarity between the response patterns of each pair of scenes. This analysis yielded a 96x96 similarity matrix for each region () wherein each point represents the correlation or similarity between a pair of scenes (Kriegeskorte et al., 2008a
; Drucker and Aguirre, 2009b
). These matrices can be decomposed into two components. First, the points along the main diagonal, from the upper left to lower right corner of the matrix, represent the consistency of the response patterns for the same scene across the two halves of the data (within-scene correlations). Second, the points off the diagonal are the correlations between pairs of different scenes (between-scene correlations). These two components can be used to provide information about both categorization and discrimination of scenes. Specifically, the between-scene correlations define how a region groups scene together (categorization). In contrast, significantly greater within- than between-scene correlations indicate that the region can distinguish between individual scenes from one another (discrimination).
Similarity Matrices for PPA and pEVC
Given prior results on categorization in PPA, we first ordered the raw similarity matrices by scene category and divided scenes by Content into manmade and natural. For PPA and pEVC (), it is clear that the patterns of response contain rich information about the presented scenes. In both regions the within-scene correlations (diagonal) are on average stronger than the between-scene correlations (off-diagonal), indicating an ability to discriminate scenes. This effect is particularly prominent in pEVC (). However, there is very little structure to the between-scene correlations in pEVC and only mild grouping evident in PPA. Further, neither region shows any consistent grouping of manmade and natural scenes. To better visualize this structure we averaged the between-scene correlations by high-level category (). In these matrices the points along the main diagonal reflect the coherence of a scene category. Even within these average matrices there is only weak evidence for coherent scene categories in PPA (, e.g. high within-category correlations for Living Rooms and Ice Caves) and no obvious coherent categories in pEVC (). Further, even amongst the most coherent categories in PPA there are between category correlations that violate differences in Content. For example, Living Rooms and Ice Caves are well correlated despite vast differences in Content and low-level stimulus properties (e.g. color, spatial frequency, luminance, etc).
To better visualize the structure of scene representations in both regions, without assuming the importance of scene categories, we used multi-dimensional scaling (MDS) (Kriegeskorte et al., 2008a
). Each scene was positioned on two-dimensional plane, where the distance between any pair of scenes reflects the correlation between their response patterns (the higher the correlation the closer the distance) (). This visualization reveals a very striking structure not captured by scene categories in either PPA or pEVC. In PPA there is clear grouping by Expanse with open scenes to the right and closed scenes to the left. In pEVC grouping was weaker, but defined by Relative Distance. We verified the strength of these differential groupings between the two regions by reordering the raw similarity matrices () by these dichotomies () rather than high-level category. Note that in some cases, the difference in the structure of scene representations between PPA and pEVC caused large shifts in the pairwise similarity of individual scenes. For example, a church image and a canyon image were similarly categorized by PPA ( yellow boxes), reflecting enclosed structure whereas in pEVC they were categorized as dissimilar ( yellow boxes), as they had different Relative Distances. In the following section we quantify these differences in representational structure between the regions.
Comparison of the Representational Structure in PPA and pEVC
We directly quantified the relative contributions of Expanse, Relative Distance, and Content by averaging the between-scene correlations (off-diagonal) across the 8 different combinations of the three dichotomies (). We then averaged each row of these matrices according the correlation within and between the various levels of Expanse, Relative Distance, and Content (). The resulting correlations were then entered into a four-way repeated measures ANOVA with Expanse (same, different), Relative Distance (same, different), Content (same, different), and Region (PPA, pEVC) as factors.
Grouping was weaker in pEVC than PPA (see also discrimination analysis below) with lower between-scene correlations, resulting in a significant main effect of Region (F1,8 = 19.269, p < 0.01). Further, the contributions of Relative Distance and Expanse were different in the two regions, resulting in highly significant interactions between Region × Expanse (F1,8 = 33.709, p < 0.001) and Region × Relative Distance (F1,8 = 24.361, p < 0.01). Notably, Content was not a major contributor to grouping in either region and no main effects or interactions involving Content (all p>0.16) were observed.
To investigate the differential grouping in the two regions further, data from each region were entered independently in two repeated measures ANOVAs. In pEVC, Relative Distance was the only significant factor producing grouping (F1,8
= 30.554, p < 0.001), and no main effect of Expanse or Content (p>0.12) was observed. In contrast, in PPA Expanse was the primary factor producing grouping (F1,9
= 44.419, p < 0.001), though there was a smaller effect of Relative Distance (F1,9 = 18.152, p < 0.01). No interactions between Expanse × Relative Distance were found either within or across the ROIs (p > .2). Again, Content played no role in grouping, with no main effects or interactions involving Content (p>0.15). Further, even when the matrices were averaged by the semantic categories (e.g. beaches, mountains) used in previous studies (e.g. (Walther et al., 2009
)), Expanse remained the dominant factor producing grouping (Supplemental Item 2
; also see below).
Thus, neither PPA nor pEVC show effects of scene category or Content. Instead both regions group scenes by their spatial aspects, with pEVC showing grouping by Relative Distance, and PPA grouping primarily by Expanse. While the weaker categorization by Relative Distance in PPA may suggest that some aspects of scene categorization are inherited from pEVC, the absence of an effect of Expanse in pEVC implies that the structure of scene representations is transformed between pEVC and PPA.
Comparison of Behavior and Scene Representations in PPA and pEVC
Multivariate designs, by virtue of their large number of conditions, produce data that can be directly correlated with behavior at an individual item level (Kriegeskorte et al., 2008a
; Drucker and Aguirre, 2009a
). To assess whether the representational structure we observed in PPA and pEVC was reflected in behavior, we next directly tested whether the structure of scene representations we measured in PPA and pEVC agreed, at the level of individual scenes, with subjective behavioral ratings from a new set of six participants. The task and instructions used in collecting behavioral judgments will inevitably constrain the resulting data. Therefore we provided as little instruction as possible, simply asking participants to report which of a sequential pairs of scenes was more open (Expanse), more natural (Content), or more distant (Distance). Participants were free to interpret these labels as they wanted. We used Elo ratings (see Methods) to derive a ranking for each individual scene for each of the three dichotomies. These rankings turn our dichotomies into dimensions, where the rating of a scene reflects its subjective openness, naturalness, or depth relative to the other 95 scenes.
First, we used the Elo ratings as independent confirmation of our dichotomies. The Content dichotomy was the most clearly reflected in the Elo ratings with 46/48 of the top ranked scenes being natural scenes. The Expanse dichotomy was similarly strong with 40/48 of the top ranked scenes being open scenes. To calculate the strength of the Relative Distance dichotomy we counted the number of times the far exemplars of a particular high-level category of scene were rated more highly than the near exemplars in that category, which was true for 40/48 scenes. To assess the reliability of the ratings we divided the 12 participants into two groups of six and calculated Elo ratings for each group separately. The ratings for all three dichotomies were highly correlated (Expanse: r = .92; Content: r = .94; Relative Distance: r = .86; all p < 0.0001) across the groups, verifying the reliability of the Elo ratings. Thus, independent ratings of the individual scenes by naïve observers reliably confirm our original classifications.
Next, we directly compared the Elo ratings with the scene representations we recovered with fMRI in PPA and pEVC. We calculated a fMRI grouping score from the average similarity matrices () for each scene that reflected how strongly grouped that scene was within a particular dichotomy. For example, the Expanse score for a scene was calculated by subtracting its average correlation with the closed scenes from its average correlation with the open scenes. We then correlated these fMRI grouping scores with their respective Elo ratings to determine whether scene representations in each region reflected the behavioral rankings of scenes.
For Expanse, we found a very strong correlation between the Elo ratings and Expanse scores in PPA (r = 0.67, p < 0.0001; ) but not in pEVC (r = 0.08, p > 0.1; ). This difference in correlation was significant (z = 2.16, p < 0.05), suggesting that the pattern of response in PPA more closely reflects behavioral judgments of Expanse than does the pattern in pEVC. For Content, we found no correlations in either PPA (r = 0.10, p > 0.1; ) or pEVC (r = 0.07, p > 0.1; ). Further, in PPA this correlation was significantly weaker than the correlation between Elo ratings and Expanse scores (z = 2.99, p < 0.05), demonstrating that there is a stronger relationship between scene representations in PPA and judgments of Expanse than judgments of Content. For Distance, we found equivalent correlations (p > 0.1) in both PPA (r = 0.54, p < 0.0001; ) and pEVC (r = 0.31, p < 0.01), consistent with the grouping we observed in both regions. Based on our earlier analysis it might have been expected that the correlation with Distance would have been stronger in pEVC than PPA. Their equivalent correlations may reflect a weaker direct contribution of pEVC to conscious judgments about scenes than PPA.
Comparison of behavioral and imaging data
These correlations between the structure of scene representations in fMRI and behavior suggest that the pattern of response in PPA much more strongly reflects subjective judgments about spatial aspects of scenes (Expanse, Distance), than the Content of those same scenes. In contrast, the pattern of response in pEVC reflected only judgments of the Distance of those scenes, providing converging evidence for the different scene information captured in pEVC and PPA. Further, these results show that, regardless of what visual statistics drive the responses of pEVC and PPA, the representations they contain directly reflect, and perhaps even contribute to, subjective judgements of high-level spatial aspects of complex scenes.
High-level Category Information in PPA Within and Across Spatial Factors
Our previous analyses confirmed that spatial factors have a greater impact on the structure of scene representations in PPA than non-spatial factors. To directly test whether there was any high-level category information independent from spatial factors, we next considered whether i) scene category could be decoded when spatial factors were held constant or do scenes from different categories, but with similar spatial properties elicit similar response, and ii) whether scene category could be decoded across spatial factors, or do scenes from the same category, but with different spatial properties, elicit different responses. Since Expanse is largely confounded with category (e.g. all mountain scenes will be open), (ii) could only be tested across Relative Distance.
To perform these analyses we needed to consider the near and far exemplars of each of the 16 high-level categories separately (), effectively doubling the number of categories to 32. We then averaged the off-diagonal correlation from the raw similarity matrix for PPA () by scene category (). The points along the diagonal of this matrix represent the average correlation between exemplars of each category. The off-diagonal points represent the correlations between different scene categories or between the near and far exemplars of the same category (e.g. white ellipse (c) in ).
No High-level category discrimination in PPA even when controlling for spatial factors
To establish whether categories could be distinguished from one another when they shared both Expanse and Relative Distance, discrimination indices were calculated for each category within each combination of the spatial factors (). These discrimination indices were defined as the difference between the correlation of a category with itself (e.g white ellipse (b) in ) and the average correlation between that category and the other categories that shared Expanse and Relative Distance. These indices were entered into a one-way ANOVA with scene category (32) as a factor. No main effect of scene category was observed (p > 0.15), nor was their significant discrimination across the scene categories on average (p > 0.15), nor did any individual category evidence significant discrimination with a Bonferonni correction for multiple comparisons (p > 0.3). To apply the most liberal test for category information possible we conducted one-tailed t-tests for each scene category. We found only a single category (near cities) that evidenced any decoding (p < 0.05; uncorrected). Thus, even when spatial factors are held constant we found no strong evidence for scene category representations.
To establish whether high-level scene category could be decoded across variations in spatial factors we calculated discrimination indices for each category across the two levels of Relative Distance (). These discrimination indices were defined as the difference between the correlation of the near and far exemplars of a category with each other (e.g. white ellipse (c) in ) and the average correlation between the near and far exemplars of that category and other categories. These indices were entered into a one-way ANOVA with scene category (16) as a factor. No main effect of scene category was observed (p > 0.375), nor was there significant discrimination across the scene categories on average (p > 0.15), nor did any individual category evidence significant discrimination with a Bonferonni correction for multiple comparisons (all p > 0.3). Again we applied the most liberal test for category information and conducted one-tailed t-tests for each scene category. We found only a single category (living rooms) that evidenced any decoding (p < 0.05; uncorrected).
In sum, in contrast to reports emphasizing the representation of scene category in PPA (Walther et al., 2009
), we found no evidence for decoding of scene categories in PPA when spatial factors are controlled. We found no ability to decode high-level category across different levels of Relative Distance. We found no evidence for Content as a significant contributor to the overall structure of representations in PPA or pEVC. We also found no correlation between scene representations in PPA or pEVC and subjective judgements of Content, and significantly weaker behavioral correlations for Content than Expanse. While it is possible that these non-spatial factors do have some impact on scene representations in these regions, that impact is clearly minor in comparison to the spatial factors of Expanse and Relative Distance.
Scene Discrimination in PPA and pEVC
While the grouping of between-scene correlations provides insight into how these regions categorize scenes, the difference between within- and between-scene correlations provides an index of scene discrimination. For this analysis, it was critical that we consider only between-scene correlations that did not cross any grouping boundary. Otherwise, our discrimination measure would be implicitly confounded with grouping. Given the strong evidence for both Expanse and Relative Distance as categories we consider discrimination between scenes within the combinations of these factors separately (four white squares encompassing the main diagonal in ), collapsing across differences in Content.
Within- and between-scene correlations were extracted from each of the four combinations of Expanse and Relative Distance (Supplemental Item 3
). These correlations were then averaged and subtracted from one another to yield discrimination scores (). There was a broad ability to discriminate scenes in both regions, with significant discrimination (p<0.05) observed in every condition except for near, closed scenes in PPA. To investigate the pattern of discrimination between the two regions, discrimination scores were entered into a three-way repeated measures ANOVA with Expanse (open, closed), Relative Distance (near, far), and Region (PPA, pEVC) as factors. Discrimination was stronger in pEVC that PPA, resulting in a significant main effect of ROI (F1,8 = 18.838, p < 0.01). Discrimination was also generally stronger for near than far scenes, resulting in a significant main effect of Relative Distance (F1,9
= 9.793, p < 0.05), though this effect was stronger in pEVC, resulting in a significant interaction between Region × Relative Distance (F1,8
= 8.898, p < 0.05). Separate ANOVAs within each region confirmed the larger effect of Relative Distance in pEVC (F1,8
= 15.477, p < 0.01) than in PPA (F1,9
= 5.328, p < 0.05) but revealed no additional effects (all p>0.3). These results demonstrate that even within scenes that are grouped together, there is significant information about the individual scenes.
The gross pattern of scene discrimination was very similar in both pEVC and PPA. To investigate the relationship between discriminability in the two regions in greater detail we calculated discrimination indices for each individual scene and then correlated them across pEVC and PPA (). The high correlation (r = .659, p < 0.001) between the discrimination indices suggests that the distinctiveness of the representation of a scene in PPA is directly related to its distinctiveness in pEVC.
Taken together the results of the discrimination and categorization analyses suggest a transformation of scene representations between pEVC and PPA. Clearly the discriminability of scene representations in PPA reflects discriminability in pEVC. However, PPA sacrifices some scene discriminability, perhaps, in order to better categorize scenes by their spatial expanse. Thus, PPA maintains less distinct representations of scenes that seem broadly organized to capture spatial aspects of scenes.
Categorization and discrimination in other cortical regions
In addition to PPA and pEVC, we also investigated central EVC (cEVC), TOS, object-selective regions Lateral Occipital (LO) and Posterior Fusiform Sulcus (PFs), and the face-selective Occipital Face Area (OFA) and Fusiform Face Area (FFA).
cEVC was similar to pEVC in its pattern of discrimination (Supplemental Item 4
) but showed no scene categorization. This difference in categorization between cEVC and pEVC led to a significant Relative Distance × Region interaction (F1,8
= 29.901, p < 0.01) when categorization averages were entered into a four-way ANOVA with Expanse, Relative Distance, Content, and Region (cEVC, pEVC) as factors. This suggests that pEVC contains more structured scene representations than cEVC, and highlights the likely importance of pEVC in scene processing (Levy et al., 2001
; Hasson et al., 2002
). However, it must be noted that cEVC represents the portion of space containing the fixation cross, on which the participants were performing the task. Though the cross was very small (~0.5°) relative to the central localizer (5°), it cannot be ruled out that this overlap impacted results in cEVC.
Scene representations in TOS had a structure similar to PPA, but were less categorical. Scene discrimination in TOS and PPA were similar (Supplemental Item 4
), but categorization by Expanse was weaker. This weaker categorization led to a significant interaction between Expanse × Region (F1,9
= 11.714, p < 0.01) when categorization averages from TOS and PPA were entered in a four-way ANOVA. In TOS, as in PPA, there was a trend for weak categorization by Relative Distance (F1,9
= 4.548, p = 0.06) and no effects involving Content (all p>0.25).
The object-selective regions (Supplemental Item 5
), did not seem particularly involved in processing the scene stimuli. LO evidenced some weak discrimination of scenes and no categorization by any of the 3 dichotomies (all p>0.1). PFs showed no scene discrimination and some categorization by Expanse but far more weakly than that observed in PPA resulting in a highly significant Region × Expanse interaction (F1,8
= 17.382, p < 0.01). It is likely that the short presentations times and the scenes we chose, which did not contain strong central objects, reduced the ability of object-selective cortex to extract individual objects from the scenes.
The results from the face-selective regions (Supplemental Item 6
) confirmed they contribute little to scene processing (see also Selectivity Analysis below). Neither of the face-selective regions evidenced any categorization by the 3 dichotomies (all p>0.2). Neither region showed much ability to discriminate between scenes, with FFA showing significant discrimination only for far, closed scenes and OFA for far, open scenes.
Overall, at least some discrimination was possible based on the response of a number of cortical regions, although strongest discrimination was found in EVC, PPA, and TOS. In contrast, grouping was largely confined to PPA, EVC and TOS. Importantly, EVC grouped primarily by Relative Distance whereas PPA and TOS both grouped primarily by Expanse.
So far we have focused on examining scene categorization and discrimination within regions defined by their category selectivity. However, the contrast of a preferred and non-preferred stimulus class (Kanwisher et al., 1997
; Epstein and Kanwisher, 1998
) implies that a region might be identified as specialized for a particular stimulus class because of a difference in response between these conditions and not necessarily because the region maintains any fine-grained representation of that class. Here we took advantage of our ungrouped design and searched for regions that showed consistent selectivity amongst the set of 96 scenes. This analysis provides an alternate way to identify regions important in scene representation and allows us to investigate whether any other regions are also important.
The aim of this analysis was to identify voxels in a whole-volume search that show consistent selectivity for the set of scene images. Selectivity was defined by the response profile across all 96 scenes in a single voxel (Erickson et al., 2000
). We computed the consistency of selectivity by calculating the correlation of the response profile between independent halves of the data. We then produced maps of the correlation values, deriving cluster thresholds using a randomization procedure to determine which voxels are significantly selective (see Experimental Procedures). Given the breadth of our scene stimuli, voxels which do not show at least a modicum of consistency in their selectivity are unlikely to be involved in scene processing.
We found that the vast majority of the consistently selective voxels (~76%) lay within our pre-defined regions, indicating that these regions largely contain the core voxels involved in scene-processing in our volume ().
We next quantified the average selectivity within each of our predefined regions-of-interest (ROIs) (). As expected, significant selectivity (p<0.05) was observed only within scene-selective and EVC ROIs. In EVC there was significantly greater selectivity in pEVC than cEVC (F1,8 = 21.991, p < 0.01). To confirm there was greater selectivity in scene-selective cortex than in either object- or face-selective cortex, their selectivity scores were entered into a two-way ANOVA with Selectivity (scene, object, face) and Location (anterior, posterior) as factors. The only effect observed was a main effect of Selectivity (F2,16 = 6.769, p < 0.01; Greenhouse-Geisser corrected), owing to the greater selectivity observed in the scene-selective than in either the object- (F1,8 = 8.105, p < 0.05) or face-selective (F1,8 = 9.069, p < 0.05) ROIs.
Finally, we quantified the amount of overlap between each ROI and the significantly selective clusters derived from the whole volume search (). Again, significant overlap was present only between the EVC and scene-selective ROIs (p<0.05). The advantage for pEVC over cEVC in both mean selectivity and overlap with selective voxels, is in keeping with the theory that PPA has a bias for the peripheral visual field (Levy et al., 2001
; Hasson et al., 2002
). In combination, these two selectivity analyses suggest that our analysis of pEVC and PPA captured the majority of the scene processing voxels in the ventral visual pathway.
In sum, using a voxel-wise measure of scene selectivity, based only on responses to scenes, we found that our ROIs captured the vast majority of voxels with consistent scene selectivity. Further, selectivity was most stable in PPA, pEVC and TOS, consistent with our analyses of categorization and discrimination.