|Home | About | Journals | Submit | Contact Us | Français|
Although our subjective impression is of a richly detailed visual world, numerous empirical results suggest that the amount of visual information observers can perceive and remember at any given moment is limited. How can our subjective impressions be reconciled with these objective observations? Here, we answer this question by arguing that, although we see more than the handful of objects, claimed by prominent models of visual attention and working memory, we still see far less than we think we do. Taken together, we argue that these considerations resolve the apparent conflict between our subjective impressions and empirical data on visual capacity, while also illuminating the nature of the representations underlying perceptual experience.
The moment we open our eyes, we experiences a vast, richly detailed visual world extending well into the periphery [1,2]. However, numerous experimental results indicate that the bandwidth of human perception is severely limited. Findings from change blindness and inattentional blindness demonstrate that much of the available visual information goes unnoticed . Direct estimates of the capacity of visual attention (see Glossary) and working memory reveal that surprisingly few items can be processed and maintained at once [4,5]. These results raise a natural question: why do we think we see so much when the scientific evidence suggests we see so little?
One answer to this question is that change blindness and inattentional blindness highlight the limits of mechanisms such as attention and working memory, rather than the limits of conscious perception. According to this view, perception ‘overflows’ and exceeds the capacity of the cognitive mechanisms needed to access that information . In other words, we consciously perceive more than we can attend, remember, report, or base decisions on [7–11]. Under this view, the neural processes associated with visual awareness are separate from those associated with attention, working memory, and explicit report. Recurrent processing in sensory cortex supports conscious perception , whereas the parietal and prefrontal cortices support the cognitive mechanisms involved in accessing those percepts . According to this framework, there is no tension between our subjective impression of the world and objective measures of human capacity limits because both of these are true. We have a rich experience of the world that cannot be fully captured by the capacity-limited cognitive mechanisms beyond the canonical visual system.
However, contrary to this view, many researchers argue that awareness is intrinsically linked to these cognitive functions and information is not consciously perceived until it is accessed by higher-order systems, such as attention, working memory, and decision-making [13–18]. Rather than link conscious perception with recurrent processing in sensory cortex, this view associates awareness with the parietal and prefrontal cortices . However, for those who endorse this view, the problem remains: how can our impression of a rich visual experience be supported by mechanisms that have strict capacity limits? Put another way, it has been claimed that ‘Introspectively, consciousness seems rich in content…From the third-person perspective of the behavioral scientist, however, consciousness is rather miserable’ ( p. 205).
We argue here that, even though conscious perception is limited by cognitive mechanisms such as attention and working memory , it is not ‘rather miserable’, and the visual information observers have access to is not at all sparse. To make this argument, we discuss a variety of recent results demonstrating that people can encode and remember considerably more than just a few items. First, we examine empirical findings from a relatively new field of study: visual ensembles and summary statistics . The key idea here is that the visual system exploits the redundancy found in real-world scenes to represent a large amount of information, often extending into the visual periphery, as a single summary statistic . Critically, standard models of attention and working memory largely ignore ensemble representations, focusing instead on the representation of individual items [21–25]. Once ensembles and summary statistics are taken into consideration, it quickly becomes clear that observers have access to different aspects of the entire field of view, not just a handful of items.
In addition, we also discuss the idea that neural structures within the visual system involved in representing visual scenes and ensemble statistics [26,27] comprise a unique neural channel that is partially separate from other processing channels [28,29]. These results suggest that the visual system is functionally organized to allow for scene and ensemble representations to be efficiently formed somewhat independently of other object representations. In other words, there appear to be separate neural pathways for representing the forest and the trees.
Together, these findings help reconcile the apparent tension between our subjective impression of a rich visual world and empirical results highlighting the limits of visual cognition. We argue that the apparent richness of visual experience can be captured without having to dissociate consciousness from higher-level cognitive functions and without arguing that visual awareness overflows cognitive access.
Two paradigms that have had a major role in demonstrating the limits of visual cognition are change blindness and inattentional blindness. Change blindness is the inability to detect a change between two different pictures when a brief interruption occurs between the two images [30,31] or the change occurs so gradually that it does not automatically draw attention . By contrast, inattentional blindnessis the failure to notice an otherwise visible stimulus when attention is directed elsewhere. In perhaps the most famous example, participants failed to notice a man in a gorilla costume walking through the middle of a scene when attention was focused on people passing a basketball . Perhaps more commonly, automobile accidents regularly occur because drivers fail to notice items on the road (e.g., another car or a pedestrian) when their attention is directed elsewhere (e.g., their cell phone conversation) [34,35]. Despite differences in methodologies, both change blindness and inattentional blindness arise because of observers’ limited ability to attend to and remember more than a few items at a time.
Although these paradigms clearly demonstrate the limits of visual cognition, more targeted studies have characterized the architecture and capacities of visual attention and working memory. Both of these processes are limited by a finite supply of some mental commodity . This commodity is often characterized as either a fixed number of ‘slots’ [4,22–24] or a fluid cognitive resource [21,37,38]. Despite the differences between these models, they both converge on the idea that observers can store around three or four items in working memory. In terms of visual attention, initial studies estimated that around three or four locations can be attended at once [39,40], but more recent efforts have pushed that number closer to around seven or eight [25,41]. However, even eight attended locations is still not sufficient to explain the richness of perception.
In isolation, these results seem to imply that awareness is limited to only a handful of items at a given moment. However, even when attention is entirely focused on a single item, no one has the impression that the rest of the world fades into darkness (Figure 1). Instead, observers believe they have a rich perceptual experience that spans the entire field of view. This belief has been experimentally verified by the fact that naïve observers systematically overestimate the capacities of attention and working memory [42,43]. At first blush, these results challenge the idea that the contents of visual awareness are the same as the contents of mechanisms such as attention and working memory [13–18]. How can such limited processes ever capture the richness of perceptual experience?
The visual world does not comprise random bits of uncorrelated information; it has structure, regularity, and redundancy [44–46]. The visual system takes advantage of this fact by representing groups of items as a statistic that summarizes different types of information (Box 1). These ensemble representations, or summary statistics, are formed by collapsing across the measurements of individual items to form a singular description of the group. Although items that are not focally attended are represented with poor resolution, averaging across these imprecise representations allows the system to obtain an accurate measure of the entire group .
What kind of information can be represented as an ensemble? Earlier studies focused on low-level visual dimensions, such as average orientation , brightness , speed of motion , and size . These findings were then extended into higher-level dimensions, such as facial emotion, gender , and identity , as well as eye gaze  and biological motion . Many of these dimensions can be processed remarkably fast (i.e., with ~50 ms presentation time) [104,109] and formed by integrating representations over time [110,111], providing observers with a rapidly updated summary of a variety of dimensions across the visual world.
Ensemble perception is not limited to the laboratory and is pervasive throughout everyday life. Imagine walking down a busy street with a crowd of people moving towards you. It would be computationally taxing to examine every object on the street or proceed person by person to determine each individual’s facial expression, direction of motion, and gait. Given the inherent structure in the scene, a vast amount of this information spanning a wide expanse of visual space can be represented as an ensemble, or average. Representing information in this way allows observers to quickly determine whether the people are approaching in a threatening manner (e.g., moving quickly, in a converging direction, with angry facial expressions) or a nonthreatening manner (e.g., moving slower, in multiple directions, with neutral facial expressions). This is just one of many examples of how ensemble perception can enable efficient coding of relevant information as observers navigate the world despite processing limitations.
How does representing multiple items as an ensemble help resolve the tension between our subjective impression of a rich visual experience versus objective measurements of limited perception? We argue that items that are attended to and foveated are perceived at a higher resolution, while items that unattended or are in the periphery are primarily perceived as being part of an ensemble . Observers are aware not only of a handful of items but also of the entire scene, but they only perceive a subset of the scene at high resolution. Standard demonstrations of change blindness and inattentional blindness succeed because the critical change often preserves the summary statistics of the scene. When those statistics are violated, it is considerably easier to notice the changes in a scene [48–50]. Detecting changes in these statistics is easy because observers do not perceive a small subset of the items in a scene; they perceive some information about all of the items in the form of ensemble statistics across several dimensions (Figure 1).
One of the cleanest demonstrations of this idea comes from a study in which participants performed a change detection task in which one of 25 colored items could change color (e.g., from red to blue) and participants simply indicated whether change occurred . Performance was measured as a function of the statistical regularity of the display, which varied across trials. If observers can attend to, perceive, and remember only a handful of items (around three or four), performance should remain constant regardless of the changes to the structural configurations of the display. However, if observers are able to perceive the overall structure of the display, performance on the task should vary as a function of higher-order regularities. The results from this study unambiguously supported the latter prediction. When the display had little structure, standard estimates of working memory capacity  indicated that only around 4.5 items were successfully held in memory. However, when more structure was added to the display, an estimated 24 items were held in memory, six times the standard estimate of working memory capacity (Figure 2, top row). Furthermore, the authors modeled this change detection task using Bayesian inference. The results of this modeling suggest that observers encoded a few individual items along with a summary of the statistics of the entire display (see also [53–55]).
Ensemble statistics are useful not only for the simple stimulus displays typical of working memory paradigms, but also because probably they serve as the foundation of scene perception more broadly. Low-level features, such as luminance, orientation, and spatial frequency, are combined to form higher-order representations  that are sufficient for the classification of scenes (e.g., mountain, highway, or beach) . As long as certain statistics of the scene are preserved (e.g., spatial contours, texture densities, etc.), the category of a scene can be extracted even when the individual objects it contains can no longer be perceived  (Figure 3). The importance of summary statistics has also been demonstrated with computational models that categorize scenes based solely on texture statistics . Finally, in addition to carrying information about the identity of a scene, these basic statistics are informative enough for observers to recognize other aspects of a scene, such as its openness, symmetry, complexity, and depth [60,61].
Ensemble statistics also likely have a key role in enabling observers to form scene representations extremely quickly. These representations are formed so fast that observers can perceive a great deal of information from a single eye fixation, without making saccades . When looking at a scene, fixations typically last 275–300 ms [63,64]. In that time, observers can extract the gist of the scene  and a few larger objects in the scene . Even with an exposure duration of 50–100 ms, observers can still report the gist of a scene and extract a variety of properties, such as the depth, navigability, openness, and the temperature of the scene [67–70].
Together, these studies show that, within a single glance, observers do not merely have access to a small handful of isolated items in a sea of nothingness; they have access to a tremendous amount of information spanning the entire scene. The ability to extract an almost immediate sense of the visual world provides ecological benefits, such as guiding further action, especially saccades. Saccades are important to this discussion because one reason why observers may overestimate their perceptual experience is that saccades are so effortless that observers often do not even realize that they are making them . This gives people the false impression that they perceive more than they actually do in a given instant because they are not aware of the serial manner in which they accrue information (i.e., one saccade after another). Furthermore, observers do not move their eyes randomly across a scene; they systematically go to the parts of the scene that are most informative for the task at hand . This ability to select saccade targets intelligently is possible because observers are able to take advantage of the knowledge they have obtained about the scene from its global image statistics [73–75]. Thus, the use of summary statistics in scene perception not only gives observers an immediate sense of the visual world, but also provides a foundation for further exploration.
Finally, we speculate that, in addition to recognizing basic perceptual aspects of the world in a single glance, observers can also quickly perceive higher-level aspects of the scene, such as its physical, social, and action-based properties. For example, we do not see just a cup and a table, but a cup resting on a table, and we compute the reaching and grasping motion that would be required to take a sip from the cup. When seeing people in a scene, we quickly perceive their social characteristics and actions (e.g., are these two people interacting with each other or not?). If these inferences are made efficiently, it may due to specialized cortical machinery that helps imbue real-world perception with rich semantic content far beyond the mere identities of objects and scenes. Whether and how these high-level inferences bypass the standard processing bottlenecks of vision is an important and largely unanswered question for future research (see Outstanding Questions).
How are ensemble statistics represented in the brain? Are they formed across the same circuits involved in representing individual items? Or do noisy representations of individual items have to be read out by a higher-order node that forms a new, ensemble representation?
What are the attentional requirements of ensemble perception? Is it possible for ensemble statistics to be rendered inattentionally blind or go unnoticed due to the attentional blink? What type of attention is needed for ensemble perception? Can multiple statistics, summarizing different dimensions of the scene (i.e., average color, orientation, size, etc.), be established in parallel?
To what extent is ensemble perception used across sensory modalities besides vision? Do the same principles discovered in vision apply to other modalities?
Are higher-level social, physical, and action-based properties of scenes extracted quickly? Is different neural tissue engaged in extracting each of these kinds of information? To what extent does the subjective richness of perception result from the overlay of these higher-level physical, social, and action-based properties of a scene?
While the speed and efficiency of natural scene and ensemble perception is well known and incorporated into many cognitive models [76,77], the neural mechanisms supporting this ability have only begun to be understood. A series of recent studies all converge on the idea that the gist and statistics of a scene are perceived so efficiently because we have neural structures specifically involved in representing those particular visual dimensions . The parahippocampal place area (PPA), restrosplenical cortex (RSC), and occipital place area (OPA) are selectively and causally involved in recognizing the identity, layout, and navigability of scenes [79–81]. In addition, some of these neural regions, particularly PPA, appear to have a prominent role in representing a variety of ensemble and statistical properties (e.g., texture) [27,82,83].
The neural structures that are sensitive to scenes and ensemble statistics appear to be at least partially separate from the structures involved in representing other visual categories, such as faces, bodies, and objects. For example, neuropsychological studies have shown perserved texture and scene perception, but impaired object perception, in a patient with bilateral damage to lateral occipital cortex, an area of the brain that responds selectively to shape and object stimuli [84,85]. Conversely, a patient with damage to parahippocampal cortex had impaired scene recognition abilities and could only recognize scenes because of a prominent visual object (e.g., a house) . Furthermore, behavioral and neural evidence from normal observers suggests that both scene and ensemble perception draw upon pools of cognitive resources that are distinct from pools supporting object perception [28,29,87]. One recent study found that more information could be processed when it was distributed across multiple neural regions (e.g., two faces and two scenes) compared with when it relied on a single neural region (e.g., four faces)  (Figure S1 in the supplemental information online). Thus, the visual system appears to be organized so that the representations of scenes, and potentially the representations of ensemble statistics, are formed with minimal interference from other objects. Having dedicated neural structures for these particular visual dimensions potentially has an important role in the ability to construct a richly detailed percept from a single glance.
Those who believe that visual awareness overflows the capacities of attention, working memory, and other higher-level cognitive processes [7–11] may claim that the ensemble statistics described here are in fact prominent examples of such overflow. In fact, a recent study claimed that one particular statistic, color diversity, could be perceived ‘cost free’ and required no attention or working-memory resources . Are ensemble statistics and scene representations truly ‘cost free’? If this is true, ensembles and scenes should be immune to all types of attentional interference. Performing a demanding attentional task (e.g., visual search or working memory) should have no impact on the ability to perceive ensemble statistics or a natural scene.
In reality, numerous pieces of evidence suggest that some type of attention is needed to process and perceive ensembles and scenes. It has repeatedly been shown that natural scenes can go unnoticed because of inattentional blindness  or the attentional blink . Furthermore, the ability to classify the gist of a scene suffers in dual-task situations . In terms of ensemble statistics, one recent study found that processing the statistics of a group of objects requires as much attention as processing an individual object . This finding is consistent with earlier studies claiming that an ensemble takes up approximately the same amount of space in working memory as an individual object [91–93]. In addition, observers are more accurate at processing multiple ensembles when they are presented sequentially, rather than simultaneously [94,95], suggesting that ensembles compete for limited cognitive resources. Finally, the precision of ensemble representations varies as a function of the allocation of attention (i.e., focused versus distribute) . Together, these results suggest that, although ensemble statistics are processed quickly and efficiently , they are not perceived ‘cost free.’ However, this is a relatively new research question and future work will need to examine this issue with different paradigms and tasks (see Outstanding Questions).
Those who believe that visual awareness overflows mechanisms, such as attention and working memory, might say that our expanded notion of what observers can access still does not account for the richness of experience. One classic kind of evidence in support of the overflow argument comes from the partial report paradigms [97,98]. In these studies, participants encode items into working memory and are then quickly cued as to which particular items they should report. When cued in this way, performance is near ceiling. However, when no cue occurs, and participants must report the entire set, performance is worse. This finding has been cited as evidence of information overflowing cognitive access [6,10]. However, other researchers argue that the cue simply elevates previously unconscious information to consciousness [15,16]. Currently, there is no clear, uncontroversial way to empirically distinguish these interpretations, and so we focus the rest of our discussion on other types of information that may overflow access.
Even after considering ensemble statistics and scene perception, some people may still have the intuition that observers can see more than attention and working memory can capture. However, without clear empirical evidence to support this claim, it appears to be based purely on intuition. Should we trust this intuition? It is well established that observers systematically overestimate the richness of their own perception. People often believe that detecting changes in a change blindness experiment will be easy, and are surprised to find out that it is not [42,43]. Similarly, people do not realize how bad their acuity and color perception is in the periphery . Box 2 presents two simple exercises that directly test the extent to which people are mistaken about their own perceptual experience.
First, ask a participant how close a playing card has to be to the center of their field of vision for them to determine the identity of the card (e.g., ten of clubs versus seven of spades). Have the participant hold his arms up to visualize an approximate estimate of his answer (Figure IA). Next, to show the participant the actual answer, tell him to extend his arm to the side (~90° from fixation). Put a card in his hand facing him, but make sure that he keeps looking forward and does not glance at the card. Then, tell him to keep the card at arm’s length and slowly move it in to the center of his field of view. Tell him to stop moving the card as soon as he is sure he can identify the card. If done correctly, it will become clear on the first trial that people wildly underestimate how close the card has to be to fixation for them to identify it. In many cases, participants will spontaneously laugh once they realize how far off they were with their prediction.
For the second demonstration, ask a participant to hold his fixation on a random object. Grab a colored object that fits in your hand (e.g., an orange). Slowly move your hand from the participant’s periphery to the center of his field of view (Figure IB). Make sure to emphasize that he keep staring straight ahead. As you move your hand towards the center, jiggle it a little bit. Tell the participant to say ‘Stop’ as soon as he detects any motion in their periphery. Really emphasize that, as soon as he senses any peripheral motion, he should tell you immediately. Once he says stop, jiggle the object for a moment longer and confirm that he sees peripheral motion. If you want to confirm that he sees something, move your hand up/down and ask the participant to say which direction your hand moved. Once you are both convinced he can see the peripheral motion, ask him to tell you the color of the object. He will almost certainly say he does not know and if you force him to give you an answer, he will just guess. Do this a multiple times and you will see that people are: (i) no better than chance at guessing the color of the object; and (ii) surprised by how unable they are to report the color.
What these exercises show is that it is easy to be wrong about the richness of perception, and scientists should be skeptical about claims that observers can see more than can be accessed. Instead, the claim that visual awareness overflows cognitive access must be supported by specific examples of visual input that can be consciously perceived without being attended, held in working memory, reported, or used to guide volitional action. Without specific evidence, there appears no good scientific reason to believe consciousness overflows cognition.
Many researchers have claimed that information cannot be consciously perceived without being accessed by higher-level cognitive functions, such as attention and working memory [13–18]. This view has been criticized for its inability to capture the richness of perceptual experience, given the strict capacity limits of these mechanisms evident in phenomena such as change blindness and inattentional blindness [7–11]. Critics of this view say that those who claim information must be accessed to be conscious believe that ‘conscious perception is limited to the contents of visual working memory, roughly three or four things at a time in many standard paradigms’ ( p. 445).
To the contrary, we argue that observers have access to considerably more information than just three or four items at a time. Instead, a handful of items are perceived with high fidelity, while the remainder of the world is represented as an ensemble statistic (or set of statistics). Those who link consciousness with higher-level cognitive function [13–18] need not believe that perception is sparse, with observers seeing only a few items at a time. Perception is undoubtedly rich, but this richness can be easily captured by cognitive mechanisms, such as attention and working memory. The focus of consciousness research should be on the nature of the visual information that is captured beyond the few high-fidelity objects that can be held in visual working memory, the nature and number of such summary statistics, and the capacity limits entailed in their extraction and representation.
Numerous empirical results highlight the limits of visual perception, attention, and working memory. However, it intuitively feels as though we have a rich perceptual experience, leading many to claim that conscious perception overflows these limited cognitive mechanisms.
A relatively new field of study (visual ensembles and summary statistics) provides empirical support for the notion that perception is not limited and that observers have access to information across the entire visual world.
Ensemble statistics, and scene processing in general, also appear to be supported by neural structures that are distinct from those supporting object perception. These distinct mechanisms can work partially in parallel, providing observers with a broad perceptual experience.
Moreover, new demonstrations show that perception is not as rich as is intuitively believed. Thus, ensemble statistics appear to capture the entirety of perceptual experience.
Thanks to Tim Brady, Sid Kouider, Michael Pitts, and Ruth Rosenholtz for helpful discussions on the project. Thanks to George Alvarez, Jason Haberman, and Jordan Suchow for extensive discussions on ensemble representations. Thanks to Cameron Ellis for comments on an earlier version of the manuscript. Thanks to Jeremy Freeman for the stimuli used to create Figure 1 and Aude Oliva for the images in Figure 3. This research was supported by NIH-NRSA (F32EY024483) to M.A.C. and NIH (EY13455) to N.K.