Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Trends Cogn Sci. Author manuscript; available in PMC 2009 June 22.
Published in final edited form as:
PMCID: PMC2699558

Object-based auditory and visual attention


Theories of visual attention argue that attention operates on perceptual objects, and thus that interactions between object formation and selective attention determine how competing sources interfere with perception. In auditory perception, theories of attention are less mature, and no comprehensive framework exists to explain how attention influences perceptual abilities. However, the same principles that govern visual perception can explain many seemingly disparate auditory phenomena. In particular, many recent studies of “informational masking” can be explained by failures of either auditory object formation or auditory object selection. This similarity suggests that the same neural mechanisms control attention and influence perception across different sensory modalities.


At a cocktail party, the sounds of clinking glasses and exuberant voices add acoustically before entering your ears. In order to appreciate your companion's anecdote, you must filter out extraneous sources and focus attention on her voice. At the same time, the sounds that you tune out are critical for maintaining awareness of your environment. Indeed, a source of interference (the pompous man on your right) may become the source you want to understand in the next moment (e.g., when you realize he is relaying a juicy story about your boss). In order to maneuver successfully in everyday settings, you need to be able to both focus and shift attention as the need arises.

Theories of visual attention explain many striking perceptual phenomena that arise when viewing complex scenes, from change blindness to performance on visual search tasks [1,2]. While there is much current interest in how central limitations interfere with auditory perception, there is no comprehensive framework to explain our ability to understand sound sources in complex acoustic scenes. Here, I argue that many auditory phenomena, including how we manage to converse in a cocktail party, can be understood by properly extending theories of visual attention. This commonality supports the idea that the same neural processes control visual and auditory attention [3].


Theories of visual attention argue that observers focus attention on an object in a complex scene [2]. Unfortunately, just as in vision [4], it is difficult to define what constitutes an object in audition. This difficulty arises in part because there are few absolute rules governing auditory object formation. Audible sound in a mixture is not always allocated between the objects perceived in a scene, and can contribute to either multiple objects [5,6] or to no object [7]. The state of the listener, from expectations about a scene's content to the level of analysis a listener undertakes (listening to a symphony versus to the English horn solo), influence the perceived content of an object [8,9]. Particularly for ambiguously structured stimuli, the perceptual organization of a scene evolves over time and/or is bistable [10,11].

Despite the lack of a precise definition, we have an intuitive understanding of what an auditory object is. In the cocktail party, we perceive the woman speaking on the left, the chiming doorbell, a shattering plate. Each of these auditory objects is an estimate of sound emanating from a discrete sound source: an “auditory object” is a perceptual entity that, correctly or not, is perceived as coming from one physical source.


In a visual scene, objects form locally based on contiguous geometric structure, such as edges, boundaries, and contours [4]. Discrete local patches can be perceptually linked based on similarity of texture, color, and other features to form whole objects [4].

In a similar way, auditory objects form across different analysis scales. For sound elements with contiguous spectro-temporal structure, formation relies primarily on this local structure [12,13], including common onsets and offsets, harmonic structure, and continuity of frequency over time. Due to the physical constraints of how sound is produced, many ecological signals (particularly information-conveying communication signals, from birdsongs to speech) have a very rich spectro-temporal structure that supports robust short-term object formation (e.g., formation of syllables). Short-term objects are streamed (linked together over time) through continuity and similarity of higher-order perceptual features such as location, pitch, timbre, and even learned meaning (word identity, grammatical structure, semantics) [13].

The relative influence of a particular cue or feature on object formation depends on the scale of the analysis. For instance, spatial auditory cues have a relatively weak influence over local time scales [13,14]. However, perceived location (as opposed to basic spatial cues such as interaural time differences [15]) strongly influences how we link short-term auditory objects into a coherent stream [16].

Although the above description may seem to suggest that objects are constructed through a hierarchy of processing, first grouped based on local structure and then organized across longer spatial or temporal scales, the truth is more complex. Higher-order features and top-down attention can alter how objects form locally. Rather than a hierarchical processing structure, objects are formed through heterarchical interactions across different scales. The ultimate perceptual organization of the scene, at all scales, depends on the preponderance of all evidence [7].


Object formation directly influences how we perceive and process complex scenes. In all sensory modalities, the normal mode of analyzing a complex scene is to focus on one object while other objects are in the perceptual background [17,18]. In vision, this mode of perceiving is described as a biased competition between perceptual objects [2]. Biased competition takes place automatically and ubiquitously when there are multiple objects in a scene. Which object wins the competition depends both on the inherent salience of the objects and the influence of volitional, top-down attention, which biases the competition to favor objects with desired perceptual features [2,19].

Even when observers select what to attend based on low-level features, attention operates on objects [2,20]. For instance, when attention is spatially focused, observers' sensitivity to other features that are part of the object at the attended location is also enhanced [17]. Thus, object formation is intricately linked with selective attention: the perceptual unit of attention is the object.

Most work on attention and objects is in the visual literature [2,21], but similar principles govern auditory perception [22,23,24]. Evidence suggests that attention acts on auditory objects, much as it enhances visual objects [25,26,27]. Moreover, listeners appear to attend actively to one and only one auditory object at a time [28,29], consistent with the biased-competition model of visual attention (see Text Box 1).


Evidence suggests that we listen to only one object at a time. Listeners have difficulty making judgments of the relative timing of events across (but not within) streams [12]. When listeners are asked to divide attention between two speech streams that are close together in space, they are able to report many of the words in the two streams, but intermingle words from the two messages [44]. In contrast, when the two streams are spatially distinct, listeners are less likely to confuse words across streams, but also recall fewer words overall [44]. These results hint that the more distinct competing streams are from one another, the more complete the suppression of the stream in the perceptual background.

How is it, then, that in everyday listening situations we seem to be able to understand multiple sources, especially in social settings where the flow of conversation is chaotic and unpredictable?

It is likely that we switch attention between objects in a complex setting, time-sharing attention between competing sources. Even if we don't perceive all of the content of one signal, we can fill in missing snippets (see also Text Box 2). In addition, we can use short-term sensory memory to help this filling-in process, mentally replaying the bit of the input signal that we didn't focus on initially.

Switching attention takes on the order of 100 - 200 ms and sensory memory degrades with time. Thus, some of the information in a newly attended stream will be missed even after a listener switches attention. Moreover, auditory streams build up over time [8,10], which may enhance the ability to focus on the stream in the perceptual foreground and understand its content. Thus, if listeners switch attention between streams, performance is likely degraded due to the direct cost of switching attention and because switching attention resets streaming, negating the benefit of object build up.


Because attention is object based, competing sources in a complex scene can cause many different forms of perceptual interference, some of which are considered below. An overview of the interactions affecting auditory perception is shown in Figure 1.

Fig 1
Conceptual model relating auditory object formation and its interactions with bottom-up salience and top-down attention, where arrow width denotes the strength of a signal or a connection. 1) Short-term segments initially form based on local spectro-temporal ...

Energetic masking

The simplest form of perceptual interference occurs when a competing source renders portions of a target imperceptible. This kind of interference, known in auditory circles as energetic masking, occurs when the response on the sensory epithelium to the target is disrupted because the system is responding to a competing source. In the auditory domain, where the auditory nerve encodes sound in a time-frequency representation, energetic masking occurs when the masking signal overlaps in time and frequency with the target. In vision, an analogous form of interference occurs when a source near the observer obscures all or part of another source behind it. In such cases, the neural response to the target is distorted or imperceptible.

Current models can account for acoustic energetic masking effects; however, in some situations, performance is worse than predicted [30]. Many natural sounds such as speech are spectro-temporally sparse, so energetic masking often affects only portions of the target, limited in both time and frequency [31]. Moreover, we perceptually fill in inaudible portions based on glimpses we hear (see Text Box 2) [32,33,34]. As a result, energetic masking is often not the main factor limiting performance.


Picture yourself at a crowded bar with your friends. In this kind of setting, you can imagine a burst of laughter that momentarily masks your buddy's unending tale of romantic misfortune. Fortunately (or perhaps unfortunately), speech signals are redundant, and we can often understand a message even when we only hear glimpses of the speech signal [31]. Moreover, we perceptually fill in missing bits of speech based on the glimpses we hear, so that we often don't even notice the interruption (an effect known as “phonemic restoration”) [45]. This ability depends on integrating all available evidence (including evidence for how to perceptually organize the scene) [46] to make sense of the message we want to understand. Thus, to make sense of noisy signals we hear in everyday settings, we depend on signal redundancy (from continuity of spectro-temporal energy in the sound to lexical, linguistic, and semantic constraints) [45,47]. While phonemic restoration is particularly strong for speech signals, even non-speech signals can be perceptually completed based on low-level spectro-temporal structure [32].

Informational masking

Perhaps because there is no widely accepted theory to explain auditory interference beyond energetic masking, the catchall phrase “informational masking” is used to encompass all masking that is not energetic [30]. Although there is a large and growing interest in informational masking, mechanistic explanations are lacking (see also Text Box 3). Here, I argue that results of many studies of informational masking can be explained by failures of object-based attention.


Many recent psychoacoustic studies link informational masking with stimulus similarity (i.e., similarity between target and maskers) and with stimulus uncertainty (e.g., randomness in the masker and/or target) [35,42]. While similarity and uncertainty affect informational masking, here I argue that they do so by affecting object formation and object selection.

Similarity between target and masker can cause either or both of the processes of object formation and object selection to fail. Similarity can cause the target and masker to be perceived as part of the same, larger perceptual object, which will result in poorer sensitivity to the content of the target [36] (see the left side of Figure 2). Even if target and masker are perceptually segregated into distinct objects, similarity of these objects can interfere with the selection of the correct object in a scene.

Uncertainty also can interfere with object selection, either because the listener is unsure of how to direct top-down attention to select the target object [48], or because the salience of new events (e.g., randomly varying maskers) draws exogenous attention too strong to be overcome by top-down attention [41].

Although stimulus similarity and uncertainty influence perception in a complex scene, the processes underlying these effects can be attributed to object-based auditory attention. Framed in this way, results from many different studies of informational masking can be understood and explained.

Failures of object formation

Failures in object formation can come about when local structure is insufficient to separate one source from the others [35], which can degrade perception [36]. This can occur for a variety of reasons, including:

  1. energetic masking may make all or part of the target imperceptible,
  2. the mixture may contain competing sources that have similar spectro-temporal structure and that tend to group with the target, or
  3. the target may not be structured enough to support object formation, for instance, if the mixture contains ambiguous or conflicting cues.

Figure 2 shows, by visual analogy, the kind of perceptual problems that can arise when local object formation breaks down. In the auditory domain, “double vowel” experiments demonstrate failures of object formation: when two vowels are played with common onsets and offsets, listeners have difficulty identifying either vowel (local object formation fails due to ambiguous spectro-temporal structure); however, when harmonic structure differentiates the competing vowels, identification improves [37].

Fig 2
Visual analogies of failed object formation. Left: the general similarity of the features and elements of the image make it difficult to segregate words, so viewers are likely to perceive the mixture as a connected mass that fails to represent any of ...

Failures in streaming occur when there are multiple sources that have similar higher-order features, such as when a listener hears a mixture of multiple male voices or the target is a set of tones amidst similar tones [38]. These failures can result in a target stream that is corrupted by sound elements from a masker or that is missing key elements (see Fig. 3), which can interfere with perception of the target.

Fig 3
Illustration of failure of auditory streaming. Two brothers address their mother simultaneously. Although the local spectro-temporal structure of the speech signals supports formation of words (local objects), the words are not properly sorted into streams, ...

Failures of object selection

Consistent with the theory of biased competition, volitional selection of an object occurs through top-down attention. If the target object has features that differentiate it from other objects in a scene and if the listener knows these distinctive features a priori, s/he can properly direct attention to select the target.

Failures in object selection can occur because a listener directs attention to the wrong object, either because they do not know what feature to attend, or because the target and masker features are not sufficiently distinct to ensure proper target selection (attending to the wrong male voice) [39,40]. Indeed, many studies of informational masking using speech signals demonstrate failures of object selection: listeners may perceive a properly formed stream of words (objects form properly), but report a masker rather than the target stream.

Even when the listener is sure of which object is the target, object selection may fail when a competing object is inherently more salient (e.g., much louder) than the target [41]. In these cases, the top-down bias of attention is insufficient to override bottom-up salience and win the biased competition [41]. Figure 4 illustrates the influence of bottom-up salience on attention, again by visual analogy. In the auditory domain, anything from an unexpected, loud sound (a door slamming) to a signal that has special significance (your name being spoken from across the room) can draw attention involuntarily through bottom-up salience [41].

Fig 4
Visual analogy illustrating how object selection can be driven by bottom-up salience. In this example, objects form based primarily on the spatial proximity of the letters within, compared to across, words in the image. Thus, object formation is not at ...

The more unique and distinct the target features, the more effective top-down attention is in enhancing the target and suppressing any maskers [42]. Thus, object selection is a probabilistic competition that depends on interactions between bottom-up and top-down biases [43].


In both vision and audition, we direct top-down attention to select desired objects from a complex scene. Because perceptual objects are the basic units of attention, proper object formation is critical to this ability. Stimulus structure determines how objects form locally, either in space-time (for visual objects) or time-frequency (for auditory objects). Higher-order perceptual attributes enable both object formation across larger scales and selection of a desired object from a complex scene. In complex settings, interactions between object formation and object selection are critical in enabling us to manage the flow of sensory information we receive. The similarities between auditory and visual perception in complex scenes suggest that common neural mechanisms control attention across modalities. Moreover, a framework based on auditory object formation and auditory object selection can help explain results of many recent psychoacoustic experiments.

Fig Box 1
Visual analogy illustrating glimpsing and phonemic restoration. A) Mixture of messages. Even though one message obstructs a portion of the other, the meaning of both messages is clear. Moreover, you undoubtedly perceive the full characters “the” ...


Grants from NIDCD, AFOSR, ONR, and NSF supported this work. These ideas were developed through discussions with Gin Best, Antje Ihlefeld, Erick Gallun, Chris Mason, Gerald Kidd, Steve Colburn, and Nat Durlach.


Energetic masking
perceptual interference present in the sensory epithelium
Informational masking
perceptual interference that cannot be explained by energetic masking
a perceptual estimate of the content of a discrete physical source
the perceptual strength of an input based purely on stimulus attributes
a putative explanation for auditory informational masking when a target and competing sources have similar perceptual features
a discrete physical entity in the external world
grouping of short-term auditory objects across longer time scales
a putative explanation for auditory informational masking when properties of either the target or the masker change unpredictably from trial to trial


1. Simons DJ, Rensink RA. Change blindness: past, present, and future. Trends Cogn Sci. 2005;9:16–20. [PubMed]
2. Desimone R, Duncan J. Neural mechanisms of selective visual attention. Annu Rev Neurosci. 1995;18:193–222. [PubMed]
3. Serences JT, et al. Preparatory activity in visual cortex indexes distractor suppression during covert spatial orienting. J Neurophysiol. 2004;92:3538–3545. [PubMed]
4. Feldman J. What is a visual object? Trends Cogn Sci. 2003;7:252–256. [PubMed]
5. Whalen DH, Liberman AM. Limits on phonetic integration in duplex perception. Percept Psychophys. 1996;58:857–870. [PubMed]
6. Darwin CJ. Perceiving vowels in the presence of another sound: A quantitative test of the “Old-plus-New” heuristic. In: Sorin JC, Meloni H, Schoenigen J, editors. Levels in Speech Communication: Relations and Interactions: A Tribute to Max Wajskop. Elsevier; Amsterdam: 1995. pp. 1–12.
7. Shinn-Cunningham BG, et al. A sound element gets lost in perceptual competition. Proceedings of the National Academy of Science. 2007:104. [PubMed]
8. Cusack R, et al. Effects of location, frequency region, and time course of selective attention on auditory scene analysis. J Exp Psychol Hum Percept Perform. 2004;30:643–656. [PubMed]
9. Sussman ES, et al. The role of attention in the formation of auditory streams. Percept Psychophys. 2007;69:136–152. [PubMed]
10. Carlyon RP, et al. Effects of attention and unilateral neglect on auditory stream segregation. J Exp Psychol Hum Percept Perform. 2001;27:115–127. [PubMed]
11. Pressnitzer D, Hupe JM. Temporal dynamics of auditory and visual bistability reveal common principles of perceptual organization. Curr Biol. 2006;16:1351–1357. [PubMed]
12. Bregman AS. Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press; Cambridge, MA: 1990.
13. Darwin CJ, Carlyon RP. Auditory grouping. In: Moore BCJ, editor. Hearing. Academic Press; San Diego, CA: 1995. pp. 387–424.
14. Carlyon RP. How the brain separates sounds. Trends Cogn Sci. 2004;8:465–471. [PubMed]
15. Sach AJ, Bailey PJ. Some characteristics of auditory spatial attention revealed using rhythmic masking release. Percept Psychophys. 2004;66:1379–1387. [PubMed]
16. Darwin CJ, Hukin RW. Effectiveness of spatial cues, prosody, and talker characteristics in selective attention. J Acoust Soc Am. 2000;107:970–977. [PubMed]
17. Duncan J. EPS Mid-Career Award 2004: brain mechanisms of attention. Q J Exp Psychol (Colchester) 2006;59:2–27. [PubMed]
18. Shomstein S, Yantis S. Configural and contextual prioritization in object-based attention. Psychon Bull Rev. 2004;11:247–253. [PubMed]
19. Yantis S. How visual salience wins the battle for awareness. Nat Neurosci. 2005;8:975–977. [PubMed]
20. Serences JT, et al. Parietal mechanisms of switching and maintaining attention to locations, objects, and features. In: Itti L, Rees G, Tsotsos J, editors. Neurobiology of Attention. Academic Press; New York: 2005. pp. 35–41.
21. Knudsen EI. Fundamental components of attention. Annu Rev Neurosci. 2007;30:57–78. [PubMed]
22. Busse L, et al. The spread of attention across modalities and space in a multisensory object. Proc Natl Acad Sci U S A. 2005;102:18751–18756. [PubMed]
23. Shomstein S, Yantis S. Parietal cortex mediates voluntary control of spatial and nonspatial auditory attention. J Neurosci. 2006;26:435–439. [PubMed]
24. Serences JT, et al. Coordination of voluntary and stimulus-driven attentional control in human cortex. Psychol Sci. 2005;16:114–122. [PubMed]
25. Alain C, Arnott SR. Selectively attending to auditory objects. Front Biosci. 2000;5:D202–212. [PubMed]
26. Scholl BJ. Objects and attention: the state of the art. Cognition. 2001;80:1–46. [PubMed]
27. Best V, et al. Visually guided attention enhances target identification in a complex auditory scene. Journal of the Association for Research in Otolaryngology. 2007;8:294–304. [PMC free article] [PubMed]
28. Vogel EK, Luck SJ. Delayed working memory consolidation during the attentional blink. Psychon Bull Rev. 2002;9:739–743. [PubMed]
29. Cusack R, et al. Neglect between but not within auditory objects. J Cogn Neurosci. 2000;12:1056–1065. [PubMed]
30. Durlach NI, et al. Note on informational masking. J Acoust Soc Am. 2003;113:2984–2987. [PubMed]
31. Cooke M. A glimpsing model of speech perception in noise. J Acoust Soc Am. 2006;119:1562–1573. [PubMed]
32. Ciocca V, Bregman AS. Perceived continuity of gliding and steady-state tones through interrupting noise. Percept Psychophys. 1987;42:476–484. [PubMed]
33. Warren RM, et al. Auditory induction: perceptual synthesis of absent sounds. Science. 1972;176:1149–1151. [PubMed]
34. Shinn-Cunningham BG, Wang D. Influences of auditory object formaiton on phonemic restoration. Journal of the Acoustical Society of America. 2008;123:295–301. [PubMed]
35. Kidd G, Jr., et al. Similarity, uncertainty, and masking in the identification of nonspeech auditory patterns. J Acoust Soc Am. 2002;111:1367–1376. [PubMed]
36. Best V, et al. Binaural interference and auditory grouping. Journal of the Acoustical Society of America. 2007;121:420–432. [PubMed]
37. Culling JF, Summerfield Q. Perceptual separation of concurrent speech sounds: absence of across-frequency grouping by common interaural delay. J Acoust Soc Am. 1995;98:785–797. [PubMed]
38. Kidd G, Jr., et al. Multiple bursts, multiple looks, and stream coherence in the release from informational masking. J Acoust Soc Am. 2003;114:2835–2845. [PubMed]
39. Kidd G, Jr., et al. The advantage of knowing where to listen. J Acoust Soc Am. 2005;118:3804–3815. [PubMed]
40. Darwin CJ, et al. Effects of fundamental frequency and vocal-tract length changes on attention to one of two simultaneous talkers. J Acoust Soc Am. 2003;114:2913–2922. [PubMed]
41. Conway AR, et al. The cocktail party phenomenon revisited: the importance of working memory capacity. Psychon Bull Rev. 2001;8:331–335. [PubMed]
42. Durlach NI, et al. Informational masking: counteracting the effects of stimulus uncertainty by decreasing target-masker similarity. J Acoust Soc Am. 2003;114:368–379. [PubMed]
43. Buschman TJ, Miller EK. Top-down versus bottom-up control of attention in the prefrontal and posterior parietal cortices. Science. 2007;315:1860–1862. [PubMed]
44. Best V, et al. The influence of spatial separation on divided listening. J Acoust Soc Am. 2006;120:1506–1516. [PubMed]
45. Warren RM. Perceptual restoration of missing speech sounds. Science. 1970;167:392–393. [PubMed]
46. Shinn-Cunningham BG, Wang D. Auditory object formation influences speech understanding. Journal of the Acoustical Society of America. in press.
47. Warren RM, et al. Auditory induction: reciprocal changes in alternating sounds. Percept Psychophys. 1994;55:313–322. [PubMed]
48. Best V, et al. Spatial unmasking of birdsong in human listeners: Energetic and informational factors. J Acoust Soc Am. 2005;118:3766–3773. [PubMed]
49. Cusack R, Carlyon RP. Perceptual asymmetries in audition. J Exp Psychol Hum Percept Perform. 2003;29:713–725. [PubMed]
50. Winkowski DE, Knudsen EI. Top-down gain control of the auditory space map by gaze control circuitry in the barn owl. Nature. 2006;439:336–339. [PMC free article] [PubMed]