Separating out a speaker of interest from other speakers in a noisy, crowded environment is a perceptual feat that we perform routinely. The ease with which we hear under these conditions belies the intrinsic complexity of this process, known as the cocktail party problem1–3,6
: concurrent complex sounds, which are completely mixed upon entering the ear, are re-segregated and selected from within the auditory system. The resulting percept is that we selectively attend to the desired speaker while tuning out the others.
Although previous studies have described neural correlates of masking and selective attention to speech4,5,7–9
, fundamental questions remain unanswered regarding the precise nature of speech representation at the juncture where competing signals are resolved. In particular, when attending to a speaker within a mixture, it is unclear what key aspects (for example, spectrotemporal profile, spoken words and speaker identity) are represented in the auditory system and how they compare to representations of that speaker alone; how rapidly a selective neural representation builds up when one attends to a specific speaker; and whether breakdowns in these processes can explain distinct perceptual failures, such as the inability to hear the correct words, or follow the intended speaker.
To answer these questions, we recorded cortical activity from human subjects implanted with customized high-density multi-electrode arrays as part of their clinical work-up for epilepsy surgery10
. Although limited to this clinical setting, these recordings provide simultaneous high spatial and temporal resolution while sampling the population neural activity from the non-primary auditory speech cortex in the posterior superior temporal lobe. We focused our analysis on high gamma (75–150 Hz) local field potentials11
, which have been found to correlate well with the tuning of multi-unit spike recordings12
. In humans, the posterior superior temporal gyrus has been heavily implicated in speech perception13
, and is anatomically defined as the lateral parabelt auditory cortex (including Brodmann areas 41, 42 and 22)14
Subjects listened to speech samples from a corpus commonly used in multi-talker communication research15,16
. A typical sentence was “ready tiger go to red two now” where “tiger” is the call sign, and “red two” is the colour–number combination. One male and one female speaker were selected, each speaking the same 12 unique combinations of two call signs (ringo or tiger), three colours (red, blue or green) and three numbers (two, five or seven). Example acoustic spectrograms from two individual speakers are shown in . The two voices differ along several dimensions including pitch (male versus female), spectral profile (different vocal track shapes) and temporal characteristics (speaking rate). Subjects first listened to each of the speakers alone and were able to report the colour and number with 100% accuracy. Subjects then listened to a monaural, simultaneous mixture of the two speakers’ phrases with different call signs, colours and numbers. The subjects were instructed to respond by indicating the colour and number spoken by the talker who uttered the target call sign. The target call sign (ringo or tiger) was fixed and shown visually on a monitor during each trial block, which contained 28 different mixture sounds. As the target speaker was changed randomly from trial to trial, the subjects were required to monitor both voices initially (divided attention) to identify the target speaker. The target call sign was switched after each block, turning the previous target speaker in each mixture into a masker. This resulted in two sets of behavioural and neural responses for each identical mixture sound, which differed only in the focus of attention. Subjects reported correct responses in 74.8% of trials.
Acoustic and neural reconstructed spectrograms for speech from a single speaker or a mixture of speakers
illustrates the mixture spectrogram and how difficult it is to tell which sound parts belong to one speaker versus the other. The energy for both speakers is distributed broadly across the spectral and temporal domains, with overlap in some areas and isolated sound parts in others, as shown in their difference spectrogram (; average spectrograms in Supplementary Fig. 1a
To determine the spectrotemporal encoding of the attended speaker, the method of stimulus reconstruction was used17–19
to estimate the speech spectrogram represented by the population neural responses. Reconstructed spectrograms provide an intuitive way to examine how the population neural responses encode the spectrotemporal features of speech, and more importantly, can be compared with the original acoustic spectrograms as well as across attentional conditions. We first calculated the reconstruction filters from a passive listening task using a separate continuous speech corpus (TIMIT20
) that consisted of 499 unique short sentences spoken by 402 different speakers. The filters were then fixed and applied to a novel set of population neural responses to the single and attended mixture speech for spectrogram reconstruction.
When listening to a single speaker alone, the reconstructed spectrograms from population neural activity corresponded well to the spectrotemporal features of the original acoustic spectrograms ( compared to , respectively), exhibiting fairly precise temporal features and spectral selectivity (for example, correspondence between the high frequency bursts of energy in “tiger” and “two”, in ). The average and standard deviation of the correlation between reconstructed and original spectrograms over 24 sentences were 0.60 ± 0.034 (0.60 and 0.62 for the examples in ). When attending to each of the two speakers, the reconstructed spectrograms from the same speech mixture showed a marked difference depending upon which speaker was attended (). For each pair, the key temporal and spectral features of the target speaker are enhanced relative to the masker speaker ( compared to , respectively). To compare directly, the energy contours from these reconstructed spectrograms are overlaid in . Important spectrotemporal details of the attended speaker were extracted, while the masker speech was effectively suppressed.
Attentional modulation of the neural representation was quantified, separately for correct and error trials, by measuring the correlation of the reconstructed spectrograms from the mixture in two attended conditions with original acoustic spectrograms of the speakers alone (). During correct trials (), we observed a significant shift of average correlation values towards the target speaker representation. During error trials, in contrast, no significant shift was observed (). Furthermore, the correlations between the reconstructed mixture and the masker speaker were higher than the average intrinsic correlation between randomly chosen original acoustic speech phrases (, dashed lines), revealing a weak presence of the masker speaker in mixture reconstructions, even in correct trials.
Quantifying the attentional modulation of neural responses
The difference in speaking rate of the two speakers, coupled with the stereotyped structure of the carrier phrases, results in specific average temporal modulation profiles for each speaker (average spectrogram for each speaker is shown in Supplementary Fig. 1a, b
). To investigate encoding of the distinct spectral profile and characteristic temporal rhythm of the target compared to the masker speaker, we estimated the average difference between reconstructed spectrograms of the two speakers, when presented alone and in the attended mixture (). The comparison between the two average difference reconstructed spectrograms reveals enhanced encoding of both temporal and spectral aspects of the attended speaker (Supplementary Fig. 1c, d
). To study the time course of attention-induced modulation of reconstructed mixture spectrograms towards the attended speaker, we calculated an attentional modulation index (AMIspec
), using a sliding window of 250 ms throughout the trial duration:
are the original acoustic spectrograms of speakers one and two, respectively, and SP1attend
are the spectrograms reconstructed from neural responses to the mixture with attended targets, speaker one and two, respectively. Positive values of this index reflect shifts towards the target, negative values reflect shifts to the masker representation, and values around zero reflect no shift (AMIspec
= 0.58 for the example in ). An upper bound for the AMIspec
was calculated by assuming that attention, at best, restores the single speaker reconstructions of the target speaker (replacing SP1attend
in equation (1)
; , grey line). The AMIspec
from the mixture was first estimated from correct trials (, black line), and could resolve the time point at which the reconstructed spectrograms were modulated by attention. After the end of the call sign, which cues the speaker that should be attended, a rapid positive shift in the AMIspec
was observed, implying the enhanced representation of the target speaker. In error trials, this effect shows a bias towards the masker speaker, which, in contrast, occurred far earlier in the time course. The neural response shift towards the masker, which occurs as early as the call sign, suggests that listeners had prematurely attended to the wrong speaker during those error trials.
Although the reconstruction analyses showed clear attention-based spectrotemporal modulation, we wanted to determine explicitly whether the attended speech in a mixture could be decoded from a model of a single speaker. A regularized linear classifier21
was trained on neural responses to the single speakers and then used to decode both the spoken words and speaker identity of the attended speech mixture. To keep the chance performance at 50% across all comparisons, classification results were limited only to the choices that were present in each mixture. For correct trials, the colour and number of the attended speech were decoded with high accuracy (77.2% and 80.2%, P
< 10 × 10−4
-test; ). However, the decoding performance during error trials was significantly below chance (30.0%, 30.1%, P
< 10 × 10−4
-test; ), indicating a systematic bias towards decoding the words of the masker speaker. In addition, for correct trials, the call sign was classified at chance performance (). However, for incorrect trials the classifier detected the masker call sign significantly more often than the target call sign (34.1%, P
< 10 × 10−4
-test; ), which again shows errors due to an early selection of the masker (incorrect) speaker.
Decoding spoken words and the identity of the attended speaker
For the speaker identification analyses, we divided the behavioural error types into two subsets. The first type occurred when the reported colour–number combination was incorrect for either speaker (‘incorrect’; 16.5% of trials). The second type occurred when subjects reported the correct colour–number for the masker instead of the target speaker (‘correct for masker’; 8.6% of trials).
In correct trials, the classifier identified the target speaker 93.0% of the time (P
< 10 × 10−4
-test; ). During incorrect trials, the classifier performance was at chance. However, during correct for masker trials, the classifier identified the masker rather than the target speaker (27.3%; P
< 10 × 10−4
-test; ). These classification results confirm the observed restoration seen in spectrotemporal reconstruction, without necessarily assuming a linear relationship between the neural responses and the stimulus. Furthermore, they extend recent findings using similar methods to decode speech sounds presented in isolation22
to full words and sentences under complex listening conditions.
We next asked whether the observed robust encoding of attended speech results as an emergent property of the distributed population activity or is driven by a few spatially discrete sites. The cortical regions with reliable evoked responses to speech stimuli were found using a t
-test between neural responses during speech and silence (P
< 0.01), and were confined to the posterior superior and middle temporal gyri (). An example of the attentional response modulation at a single electrode is shown in . The spectrotemporal receptive field (STRF, estimated using the http://www.strflab.berkeley.edu
package) of this electrode in passive listening to speech (TIMIT20
) showed a strong preference for high frequency sounds () (STRFs for all electrodes of one subject are provided in Supplementary Fig. 2b
). This tuning was also evident in the increased neural response at this electrode (, dashed lines) to each of the single speakers’ high frequency sound components (circled in , responses are delayed about 120 ms from the stimulus). However, the responses to the same speech mixture sound (, solid lines) were significantly modulated by attention. The responses to high frequency components were enhanced for the attended speaker, but suppressed for similar sounds in the masker speaker (, solid lines compared to dashed lines). This highly modulated yet fixed feature selectivity probably contributes to the constancy of the single speaker representation observed in our previous analyses. To quantify this effect for each individual electrode, we measured the correlation between the neural responses to the attended mixture and to those of the speakers in isolation (AMIelec
, equation (2)
in Methods). We found a varying degree of bias towards the attended speaker distributed across the population (Supplementary Fig. 3d
= 0.28 for the example in ), which gradually builds up after the end of the call sign (Supplementary Fig. 3e
). We did not observe any particular anatomical pattern for the attentional modulation across sites (Supplementary Fig. 3f
). Rather, it appeared to be distributed over responsive sites, consistent with previous findings of higher-order sound processing23
Attentional modulation of individual electrode sites
In summary, we demonstrate that the human auditory system restores the representation of the attended speaker while suppressing irrelevant competing speech. Speech restoration occurs at a level where neural responses still show precise phase-locking to spectrotemporal features of speech. Population responses revealed the emergent representation of speech extracted from a mixture, including the moment-by-moment allocation of attentional focus.
These results have implications for models of auditory scene analysis. In agreement with recent studies, the cortical representation of speech in the posterior temporal lobe does not merely reflect the acoustical properties of the stimulus, but instead relates strongly to the perceived aspects of speech10
. Although the exact mechanisms are not fully known, multiple processes in addition to attention are likely to enable this high-order auditory processing, including grouping of predictable regularities in speech acoustics24
, feature binding3,25
and phonemic restoration26
. Conversely, behavioural errors seem to result from degradation of the neural representation, a direct result of inherent sensory interference such as energetic masking16
(Supplementary Fig. 3g, h
) and/or the allocation of attention27
In speech, the end result represented in the posterior temporal lobe appears to be unaffected by perceptually irrelevant sounds, which is ideal for subsequent linguistic and cognitive processing. Following one speaker in the presence of another can be trivial for a normal human listener, but remains a major challenge for state-of-the-art automatic speech recognition algorithms28
. Understanding how the brain solves this problem may inspire more efficient and generalizable solutions than current engineering approaches29
. It will also shed light on how these processes become impaired during ageing and in disorders of speech perception in real-world hearing conditions7