|Home | About | Journals | Submit | Contact Us | Français|
Theories of visual attention argue that attention operates on perceptual objects, and thus that interactions between object formation and selective attention determine how competing sources interfere with perception. In auditory perception, theories of attention are less mature, and no comprehensive framework exists to explain how attention influences perceptual abilities. However, the same principles that govern visual perception can explain many seemingly disparate auditory phenomena. In particular, many recent studies of “informational masking” can be explained by failures of either auditory object formation or auditory object selection. This similarity suggests that the same neural mechanisms control attention and influence perception across different sensory modalities.
At a cocktail party, the sounds of clinking glasses and exuberant voices add acoustically before entering your ears. In order to appreciate your companion's anecdote, you must filter out extraneous sources and focus attention on her voice. At the same time, the sounds that you tune out are critical for maintaining awareness of your environment. Indeed, a source of interference (the pompous man on your right) may become the source you want to understand in the next moment (e.g., when you realize he is relaying a juicy story about your boss). In order to maneuver successfully in everyday settings, you need to be able to both focus and shift attention as the need arises.
Theories of visual attention explain many striking perceptual phenomena that arise when viewing complex scenes, from change blindness to performance on visual search tasks [1,2]. While there is much current interest in how central limitations interfere with auditory perception, there is no comprehensive framework to explain our ability to understand sound sources in complex acoustic scenes. Here, I argue that many auditory phenomena, including how we manage to converse in a cocktail party, can be understood by properly extending theories of visual attention. This commonality supports the idea that the same neural processes control visual and auditory attention .
Theories of visual attention argue that observers focus attention on an object in a complex scene . Unfortunately, just as in vision , it is difficult to define what constitutes an object in audition. This difficulty arises in part because there are few absolute rules governing auditory object formation. Audible sound in a mixture is not always allocated between the objects perceived in a scene, and can contribute to either multiple objects [5,6] or to no object . The state of the listener, from expectations about a scene's content to the level of analysis a listener undertakes (listening to a symphony versus to the English horn solo), influence the perceived content of an object [8,9]. Particularly for ambiguously structured stimuli, the perceptual organization of a scene evolves over time and/or is bistable [10,11].
Despite the lack of a precise definition, we have an intuitive understanding of what an auditory object is. In the cocktail party, we perceive the woman speaking on the left, the chiming doorbell, a shattering plate. Each of these auditory objects is an estimate of sound emanating from a discrete sound source: an “auditory object” is a perceptual entity that, correctly or not, is perceived as coming from one physical source.
In a visual scene, objects form locally based on contiguous geometric structure, such as edges, boundaries, and contours . Discrete local patches can be perceptually linked based on similarity of texture, color, and other features to form whole objects .
In a similar way, auditory objects form across different analysis scales. For sound elements with contiguous spectro-temporal structure, formation relies primarily on this local structure [12,13], including common onsets and offsets, harmonic structure, and continuity of frequency over time. Due to the physical constraints of how sound is produced, many ecological signals (particularly information-conveying communication signals, from birdsongs to speech) have a very rich spectro-temporal structure that supports robust short-term object formation (e.g., formation of syllables). Short-term objects are streamed (linked together over time) through continuity and similarity of higher-order perceptual features such as location, pitch, timbre, and even learned meaning (word identity, grammatical structure, semantics) .
The relative influence of a particular cue or feature on object formation depends on the scale of the analysis. For instance, spatial auditory cues have a relatively weak influence over local time scales [13,14]. However, perceived location (as opposed to basic spatial cues such as interaural time differences ) strongly influences how we link short-term auditory objects into a coherent stream .
Although the above description may seem to suggest that objects are constructed through a hierarchy of processing, first grouped based on local structure and then organized across longer spatial or temporal scales, the truth is more complex. Higher-order features and top-down attention can alter how objects form locally. Rather than a hierarchical processing structure, objects are formed through heterarchical interactions across different scales. The ultimate perceptual organization of the scene, at all scales, depends on the preponderance of all evidence .
Object formation directly influences how we perceive and process complex scenes. In all sensory modalities, the normal mode of analyzing a complex scene is to focus on one object while other objects are in the perceptual background [17,18]. In vision, this mode of perceiving is described as a biased competition between perceptual objects . Biased competition takes place automatically and ubiquitously when there are multiple objects in a scene. Which object wins the competition depends both on the inherent salience of the objects and the influence of volitional, top-down attention, which biases the competition to favor objects with desired perceptual features [2,19].
Even when observers select what to attend based on low-level features, attention operates on objects [2,20]. For instance, when attention is spatially focused, observers' sensitivity to other features that are part of the object at the attended location is also enhanced . Thus, object formation is intricately linked with selective attention: the perceptual unit of attention is the object.
Most work on attention and objects is in the visual literature [2,21], but similar principles govern auditory perception [22,23,24]. Evidence suggests that attention acts on auditory objects, much as it enhances visual objects [25,26,27]. Moreover, listeners appear to attend actively to one and only one auditory object at a time [28,29], consistent with the biased-competition model of visual attention (see Text Box 1).
Evidence suggests that we listen to only one object at a time. Listeners have difficulty making judgments of the relative timing of events across (but not within) streams . When listeners are asked to divide attention between two speech streams that are close together in space, they are able to report many of the words in the two streams, but intermingle words from the two messages . In contrast, when the two streams are spatially distinct, listeners are less likely to confuse words across streams, but also recall fewer words overall . These results hint that the more distinct competing streams are from one another, the more complete the suppression of the stream in the perceptual background.
How is it, then, that in everyday listening situations we seem to be able to understand multiple sources, especially in social settings where the flow of conversation is chaotic and unpredictable?
It is likely that we switch attention between objects in a complex setting, time-sharing attention between competing sources. Even if we don't perceive all of the content of one signal, we can fill in missing snippets (see also Text Box 2). In addition, we can use short-term sensory memory to help this filling-in process, mentally replaying the bit of the input signal that we didn't focus on initially.
Switching attention takes on the order of 100 - 200 ms and sensory memory degrades with time. Thus, some of the information in a newly attended stream will be missed even after a listener switches attention. Moreover, auditory streams build up over time [8,10], which may enhance the ability to focus on the stream in the perceptual foreground and understand its content. Thus, if listeners switch attention between streams, performance is likely degraded due to the direct cost of switching attention and because switching attention resets streaming, negating the benefit of object build up.
Because attention is object based, competing sources in a complex scene can cause many different forms of perceptual interference, some of which are considered below. An overview of the interactions affecting auditory perception is shown in Figure 1.
The simplest form of perceptual interference occurs when a competing source renders portions of a target imperceptible. This kind of interference, known in auditory circles as energetic masking, occurs when the response on the sensory epithelium to the target is disrupted because the system is responding to a competing source. In the auditory domain, where the auditory nerve encodes sound in a time-frequency representation, energetic masking occurs when the masking signal overlaps in time and frequency with the target. In vision, an analogous form of interference occurs when a source near the observer obscures all or part of another source behind it. In such cases, the neural response to the target is distorted or imperceptible.
Current models can account for acoustic energetic masking effects; however, in some situations, performance is worse than predicted . Many natural sounds such as speech are spectro-temporally sparse, so energetic masking often affects only portions of the target, limited in both time and frequency . Moreover, we perceptually fill in inaudible portions based on glimpses we hear (see Text Box 2) [32,33,34]. As a result, energetic masking is often not the main factor limiting performance.
Picture yourself at a crowded bar with your friends. In this kind of setting, you can imagine a burst of laughter that momentarily masks your buddy's unending tale of romantic misfortune. Fortunately (or perhaps unfortunately), speech signals are redundant, and we can often understand a message even when we only hear glimpses of the speech signal . Moreover, we perceptually fill in missing bits of speech based on the glimpses we hear, so that we often don't even notice the interruption (an effect known as “phonemic restoration”) . This ability depends on integrating all available evidence (including evidence for how to perceptually organize the scene)  to make sense of the message we want to understand. Thus, to make sense of noisy signals we hear in everyday settings, we depend on signal redundancy (from continuity of spectro-temporal energy in the sound to lexical, linguistic, and semantic constraints) [45,47]. While phonemic restoration is particularly strong for speech signals, even non-speech signals can be perceptually completed based on low-level spectro-temporal structure .
Perhaps because there is no widely accepted theory to explain auditory interference beyond energetic masking, the catchall phrase “informational masking” is used to encompass all masking that is not energetic . Although there is a large and growing interest in informational masking, mechanistic explanations are lacking (see also Text Box 3). Here, I argue that results of many studies of informational masking can be explained by failures of object-based attention.
Many recent psychoacoustic studies link informational masking with stimulus similarity (i.e., similarity between target and maskers) and with stimulus uncertainty (e.g., randomness in the masker and/or target) [35,42]. While similarity and uncertainty affect informational masking, here I argue that they do so by affecting object formation and object selection.
Similarity between target and masker can cause either or both of the processes of object formation and object selection to fail. Similarity can cause the target and masker to be perceived as part of the same, larger perceptual object, which will result in poorer sensitivity to the content of the target  (see the left side of Figure 2). Even if target and masker are perceptually segregated into distinct objects, similarity of these objects can interfere with the selection of the correct object in a scene.
Uncertainty also can interfere with object selection, either because the listener is unsure of how to direct top-down attention to select the target object , or because the salience of new events (e.g., randomly varying maskers) draws exogenous attention too strong to be overcome by top-down attention .
Although stimulus similarity and uncertainty influence perception in a complex scene, the processes underlying these effects can be attributed to object-based auditory attention. Framed in this way, results from many different studies of informational masking can be understood and explained.
Failures in object formation can come about when local structure is insufficient to separate one source from the others , which can degrade perception . This can occur for a variety of reasons, including:
Figure 2 shows, by visual analogy, the kind of perceptual problems that can arise when local object formation breaks down. In the auditory domain, “double vowel” experiments demonstrate failures of object formation: when two vowels are played with common onsets and offsets, listeners have difficulty identifying either vowel (local object formation fails due to ambiguous spectro-temporal structure); however, when harmonic structure differentiates the competing vowels, identification improves .
Failures in streaming occur when there are multiple sources that have similar higher-order features, such as when a listener hears a mixture of multiple male voices or the target is a set of tones amidst similar tones . These failures can result in a target stream that is corrupted by sound elements from a masker or that is missing key elements (see Fig. 3), which can interfere with perception of the target.
Consistent with the theory of biased competition, volitional selection of an object occurs through top-down attention. If the target object has features that differentiate it from other objects in a scene and if the listener knows these distinctive features a priori, s/he can properly direct attention to select the target.
Failures in object selection can occur because a listener directs attention to the wrong object, either because they do not know what feature to attend, or because the target and masker features are not sufficiently distinct to ensure proper target selection (attending to the wrong male voice) [39,40]. Indeed, many studies of informational masking using speech signals demonstrate failures of object selection: listeners may perceive a properly formed stream of words (objects form properly), but report a masker rather than the target stream.
Even when the listener is sure of which object is the target, object selection may fail when a competing object is inherently more salient (e.g., much louder) than the target . In these cases, the top-down bias of attention is insufficient to override bottom-up salience and win the biased competition . Figure 4 illustrates the influence of bottom-up salience on attention, again by visual analogy. In the auditory domain, anything from an unexpected, loud sound (a door slamming) to a signal that has special significance (your name being spoken from across the room) can draw attention involuntarily through bottom-up salience .
The more unique and distinct the target features, the more effective top-down attention is in enhancing the target and suppressing any maskers . Thus, object selection is a probabilistic competition that depends on interactions between bottom-up and top-down biases .
In both vision and audition, we direct top-down attention to select desired objects from a complex scene. Because perceptual objects are the basic units of attention, proper object formation is critical to this ability. Stimulus structure determines how objects form locally, either in space-time (for visual objects) or time-frequency (for auditory objects). Higher-order perceptual attributes enable both object formation across larger scales and selection of a desired object from a complex scene. In complex settings, interactions between object formation and object selection are critical in enabling us to manage the flow of sensory information we receive. The similarities between auditory and visual perception in complex scenes suggest that common neural mechanisms control attention across modalities. Moreover, a framework based on auditory object formation and auditory object selection can help explain results of many recent psychoacoustic experiments.
Grants from NIDCD, AFOSR, ONR, and NSF supported this work. These ideas were developed through discussions with Gin Best, Antje Ihlefeld, Erick Gallun, Chris Mason, Gerald Kidd, Steve Colburn, and Nat Durlach.