Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Neurosci. Author manuscript; available in PMC 2010 July 13.
Published in final edited form as:
PMCID: PMC2832933

Attentional Gain Control of Ongoing Cortical Speech Representations in a “Cocktail Party”


Normal listeners possess the remarkable perceptual ability to select a single speech stream among many competing talkers. However, few studies of selective attention have addressed the unique nature of speech as a temporally extended and complex auditory object. We hypothesized that sustained selective attention to speech in a multi-talker environment would act as gain control on the early auditory cortical representations of speech. Using high-density electroencephalography and a template-matching analysis method, we found selective gain to the continuous speech content of an attended talker, greatest at a frequency of 4–8 Hz, in auditory cortex. In addition, the difference in alpha power (8–12 Hz) at parietal sites across hemispheres indicated the direction of auditory attention to speech, as has been previously found in visual tasks. The strength of this hemispheric alpha lateralization, in turn, predicted an individual’s attentional gain of the cortical speech signal. These results support a model of spatial speech stream segregation, mediated by a supramodal attention mechanism, enabling selection of the attended representation in auditory cortex.

Keywords: auditory cortex, speech, selective attention, EEG, spatial attention, oscillations


Listening to one person speaking among many others serves as a model for selective attention in ecological environments, as most people depend upon it daily for social interaction and well-being (Pichora-Fuller and Singh, 2006; Shinn-Cunningham and Best, 2008). Commonly referred to as the “cocktail party effect”(Cherry, 1953), this perceptual feat has been studied extensively for decades using behavior (Broadbent, 1957; Treisman and Geffen, 1967; Driver, 2001). Neural evidence for auditory attentional modulation first arose in electroencephalography (EEG), where the primary approach has been to characterize differences in the transient, event-related potentials (ERPs) of attended vs unattended sounds (Picton and Hillyard, 1974; Näätänen and Michie, 1979; Woods et al., 1993; Alcaini et al., 1995). While this approach has expanded our knowledge of auditory attention, with few exceptions (Teder et al., 1993; Coch et al., 2005; Nager et al., 2008) it tends to present sounds in isolation rather than concurrently as in a “cocktail party”, and therefore may not engage selective attention (Lavie, 1995). It is also limited by treating all sounds, phonemes, or words as discrete events with a stereotyped neural onset response.

Recent EEG and magnetoencephalography (MEG) studies have used novel methods to measure the continuous responses, rather than stereotyped onset components, to natural speech from early auditory cortex. These signals appear to be closely related to the slow (2–20 Hz) acoustic envelope of speech (Ahissar et al., 2001; Purcell et al., 2004; Abrams et al., 2008; Aiken and Picton, 2008; Lalor et al., 2009) and can differentiate responses to vowels, words, and sentences (Suppes et al., 1997; Suppes et al., 1998; Suppes et al., 1999; Luo and Poeppel, 2007; Bonte et al., 2009). Given its success in classifying natural speech when presented alone, a continuous neural measure should be especially suited to characterize how selective attention acts on concurrent streams of continuous speech.

Finally, it remains unclear which top-down neural signals are responsible for mediating attention to continuous speech in space. Recent evidence suggests visual spatial attention, in particular the suppression of distracting speakers’ faces, affects comprehension of an attended speaker (Senkowski et al., 2008). Visual spatial attention is known to be associated with relative contralateral alpha suppression at posterior sites, which has been attributed to parietal and/or occipital cortices (Worden et al., 2000; Gruber et al., 2005; Medendorp et al., 2007; Palva and Palva, 2007). This activity, in turn, predicts successful visual detection (Thut et al., 2006). Although it has been proposed that visual and auditory spatial attention share an overlapping mechanism, it is unknown whether contralateral posterior alpha suppression occurs or plays a role in auditory perception.

In the current study, we presented two different sentences to listeners, either one sentence at a time or simultaneously with one on each side. While listeners performed a comprehension task, we recorded high-density EEG, filtered into frequency bands ranging from very low (1–4 Hz) to ultra-high frequencies (120–160 Hz). We hypothesized that selective attention would act via a gain increase on the lower frequencies in auditory cortical activity for the attended sentence, as measured in single trials. We also tested whether lateralization of alpha activity at posterior sites could predict the gain of the attended signal across individuals, implying a mechanistic link between spatial and selective attention in a “cocktail party”.

Materials and Methods


Fourteen volunteers (8 female) between 18 and 36 years old participated in the experiment. All participants were right-handed native English speakers with normal hearing, no history of neurological problems, and no use of psychoactive medications or drugs in the past month. Participants gave written informed consent in accordance with procedures approved by the University of California Institutional Review Board and were paid for their participation. A single participant was removed from all analyses in the current study, based on near-chance overall behavioral performance (55% accuracy). This was more than three standard deviations below the accuracy of the group, qualifying as an outlier.

Speech Stimuli

All speech stimuli were recorded in a sound-dampened chamber from a 25 year old male speaker. The speech stimuli consisted of two incomplete sentences (Sentence A:“Brandon’s wife seemed…” ; Sentence B:“His other friend got the…”) of the same duration (1.36 seconds) and 128 ending words (64 adjectives and 64 nouns). Ending words were selected from the MRC Psycholinguistic Database ( All adjectives were screened to have a familiarity score above 500 and all nouns a concreteness score above 500. Both nouns and adjectives were required to have 2 syllables. The words were then further selected by a native English speaker to ensure the adjectives and nouns were grammatically correct and semantically plausible as a final word for sentences A and B, respectively.

In order to perform an analysis of the fundamental frequency of speech, the speech stimuli were further processed in Praat ( to flatten and alter the fundamental frequency of all speech sounds uniformly. From an average fundamental frequency of 128 Hz for the original speaker, two sets of flattened speech stimuli with fundamentals at 123 Hz and 133 Hz were produced. The resulting sentences and words remained highly intelligible, yet sounded monotonic and lacking in prosody. This conditioning of the stimulus was intended to produce a frequency following response (FFR) at the fundamental frequency in the EEG waveform under all conditions. However, we had far fewer trials than previous studies of the FFR (Krishnan, 2002; Dajani et al., 2005; Musacchia et al., 2007), and the vowels of our ongoing speech stimuli were short and discontinuous, which may have led to our lack of observed FFR. This manipulation will therefore not be addressed further in this paper.

Head-related transfer functions (HRTF), recorded from AuSIM in-the-canal microphones at −45 degrees (left of midline), +45 degrees (right of midline) and 0 degrees (midline) along the horizontal azimuth, were obtained for each subject. Each individual’s HRTF was used to filter the speech stimuli so talkers were perceived in virtual external space (Langendijk and Bronkhorst, 2000). Finally, each speech stimulus was normalized to have equal root mean square amplitude, at a volume of ~70dB HL. All speech stimuli were presented with Etymotic ER-4B insert earphones, shielded with grounded metallic tape to avoid transduction artifacts in the EEG recordings. Fifty percent of the sentence waveforms were randomly inverted, with no perceptual consequence, to further preclude any possibility of artifactual phase-locking in the EEG signal.

Presentation and Trial Structures

Each trial presented one of three conditions (Single Talker, Selective Attention, and Central Control) each with three possible cue instructions. The Single Talker condition included presenting a sentence immediately followed by an ending word. The ending word was equally likely to be grammatically congruent (i.e. Brandon’s wife seemed friendly.) or incongruent (i.e. Brandon’s wife seemed lizard.). Single Talker sentences were always presented to either the left or right of the participant. The Selective Attention condition included the simultaneous presentation of different Single Talker sentences to the left and right of the participant (always on opposite sides). The Central Control condition included the simultaneous presentation of two sentences in the same midline location.

Participants were instructed to attend to the subsequently presented speech stimulus based on one of three cues: a left arrow (“<”) a right arrow (“>”) or the numeral zero (“0”). For the Single Talker and Selective Attention conditions, the left or right arrows indicated the participant was to attend to the speech presented in that direction. In the Central Control condition, the participants were instructed to attend to the talker based on pitch. During the Central Control condition, participants were therefore told to attend to the lower pitch when given the left cue and to the higher pitch when given the right cue.

For 7 participants, all speech stimuli in the Single Talker and Selective Attention conditions presented to the left had the fundamental frequency (f0) flattened to 123 Hz, and all speech stimuli presented to the right had f0 flattened to 133 Hz during the entire session. For the other 7 participants, the locations of flattening and the corresponding pitch cue instructions were reversed. In the Central Control condition, participants were required to segregate the speech stimuli using the very small pitch differences present in the Selective Attention condition, which, in contrast, also included strong spatial information. Thus, it served as a behavioral control for whether participants could be using the pitch information in the Selective Attention task, rather than the spatial information.

In all conditions cued with an arrow, participants were told to press the ‘1’ key on the response pad if the attended sentence was grammatically congruent, and the ‘2’ key if incongruent, regardless of its semantic likelihood. Participants were asked to respond as quickly and accurately as possible. For all subjects and sentence presentation types, participants were told that the “0” cue indicated a passive trial, when they should ignore all speech signals and give no response. For all conditions there were equal numbers of each cue. For Single Talker presentations, the arrow cues always pointed in the direction in which the speech stimulus would actually be presented.

Each trial started with a cue, replaced 1000 ms later by a crosshair that was maintained throughout the rest of the trial, followed 1000 ms later by the onset of the sentence. After the 1364 ms presentation of the sentence, a 1900 ms window was given in which the participant’s response could be counted as accurate. The next trial would start jittered uniformly between 1000 to 2000 ms after the end of the response window (See Figure 1).

Figure 1
Trial Structure

Each participant completed 12 blocks of 80 trials, with a short break between each block, for a total of 960 trials, except for two participants, who completed 9 blocks, for a total of 720 trials. 40% of trials presented Single Talker sentences, 40% presented Selective Attention sentences, and 20% presented Central Control sentences. Trials were presented in random order from a full factorial design which counterbalanced side of presentation, sentence type, and grammar congruency.

Data Acquisition and Analysis

EEG was recorded from a 128-electrode cap continuously throughout trial presentation with the BioSemi ActiveTwo data acquisition system. All recordings were conducted in a sound dampened, electrically shielded room. The data was originally recorded at a sampling rate of 2048 Hz and subsequently downsampled to a rate of 512 Hz. For some participants, some electrode channels would be visibly, excessively noisy for the entire duration of recording and were therefore marked and interpolated with surrounding electrode sites (Mean number of bad channels = 2.1; SD = 2.7).

All data analysis, except source localization, was performed in MATLAB using a combination of the FieldTrip MATLAB toolbox ( and custom MATLAB scripts. The continuous data was referenced to the average of all channels after interpolation and filtered with a high-pass, zero-phase Butterworth filter at 1Hz. The data was then cut into epochs from 800 ms before to 2400 ms after speech onset, baseline correcting from −100 ms to 0 ms relative to speech onset. Independent component analysis (ICA) was performed on the epoched data, constrained by the top 50 principle component analysis (PCA) components. The topographic distribution of the top 20 ICA components discovered were screened for eye movement artifacts. No more than 2 components, with far frontal distributions clearly indicative of either eye blinks or lateral eye movements, were removed from the data of each subject. Epochs with shifts greater than +− 80 uV were rejected. Only trials associated with a correct behavioral response were included for EEG analysis. Because the Central Control condition was only included as a behavioral control, it was not included in any EEG analysis.

Epochs were sorted into separate bins for each participant based on the condition (Single Talker, Selective Attention) or on the content of the sentence (either Sentence A or Sentence B as presented in the Single Talker condition or as attended in the Selective Attention conditions). The Selective Attention passive conditions (cue “0”) could obviously not be grouped based on attended content and were instead averaged only within presentation type. For some analyses, the Single Talker and Selective Attention conditions were further subdivided into bins based on the side of presentation (left or right) and direction of attention (left or right), respectively. For all bins, high and low pitch conditions were collapsed

A two equivalent current dipole (ECDs) model based on the N1 component (100–150 ms) for the group grand averaged activity time-locked to the onset of each sentence, collapsed across left and right Single Talker presentation, was created in Brain Electrical Source Analysis (BESA). The two dipoles were constrained to be symmetric in spatial location and allowed to fit freely to a single orientation for each dipole. The resulting fit placed vertically oriented dipoles in the left and right hemisphere with Talairach coordinates (x = 29; y =−31; z =13; for the right dipole) consistent with sources in or near Heschl’s gyrus. The residual variance for the fitting time interval was 2.9%. The location and orientation of the dipoles are consistent with the dipole parameters for sentence stimuli reported by Aiken and Picton (Aiken and Picton, 2008). In a subsequent step the dipole model was exported to Matlab and was used as a spatial filter on the individual waveforms. The individual dipole waveforms were then filtered with 8 different zero-phase band-pass Butterworth filters with frequencies of 1–4 Hz, 4–8 Hz, 8–12 Hz, 12–30 Hz, 30–50 Hz, 50–80 Hz, 80–120 Hz and 120–160 Hz. Each filtered waveform was split into 9 time windows, 341 ms long (1/4 of the incomplete sentence length), beginning with the window centered at the time of sentence onset (0 ms) and ending centered on sentence offset (1364 ms), shifting with 50% overlap. These waveforms, binned by participant, talker condition, sentence content, cue direction, side of presentation, hemisphere, frequency and time, will subsequently be referred to as trial waveforms.

Regression Analysis

In order to measure the frequency and time-course of the ongoing representation of speech in auditory cortex, we developed a template-matching method. For each participant, an N-minus-1 (N-1) group template waveform was created by averaging all trial waveforms not belonging to the current participant. There is thus no overlap in the data contained in the participant and group template waveforms and bins within trial and group template waveforms can be collapsed independently. For all comparisons, the linear least squares estimate between the template and all individual trial waveforms (τ) was calculated for each combination of comparison bins by the following equation:


where T is the template waveform, Y represents individual trial waveforms, and S and C represent the sentence content and condition type, respectively. The τ-value, or regression coefficient, serves as an estimate of the extent to which the template waveform is present in each individual trial. τ -values for within and across sentence comparisons were found by the equations:


where A and B represent the presented/attended waveforms for Sentence A and B, respectively. The Discrimination Index (DI) was calculated by simply subtracting the within and across τ -values :


A positive index reflects a shared signal between the trial and group EEG waveforms that distinguishes which of the two sentence waveforms was presented or attended on the individual trials (see Figure 2). A number of different comparisons between waveforms can be made, and the definition of the comparisons performed in this study can be found in Table 1. The method is conceptually similar to that devised by Luo (Luo and Poeppel, 2007), but we choose to compare filtered waveforms rather than phase in order to be more sensitive to waveform phase changes within each analysis window and to quantify the magnitude and sign of enhancement and suppression.

Figure 2
EEG Waveform Analysis
Table 1
Discrimination Index Comparisons

For visualization across the scalp, topographic maps of the Discrimination Index are shown in Figure 4 and Supplemental Figure 1. The methods used were identical to those described above, with the exception that waveforms were derived from each channel of the 128-channel array, rather than the two source waveforms. T-scores at each channel were derived from a one-sample test of the individual mean DI across participants, with the null hypothesis of µDI = 0.

Figure 4
Spatial Attention and Sentence Content Selection

For the calculation of enhancement and suppression in the Generalized Attentional Gain comparison, the within and across τ-values were subtracted by the responses in the Generalized Passive condition to remove any signal that could be attributed to stimulus attributes:


where the subscripts ‘1’ and ‘2’ represent active attention Single and Dual Talker conditions, respectively, and the subscript ‘2P’ represents the Dual Talker Passive condition. The τ-within and τ-across now represent the enhancement and suppression of the attended signals, respectively, under the assumption that the time courses of A and B within any given analysis window are orthogonal. To remove any enhancement which could be an artifact of an increase in suppression, or vice-versa, we applied a final correction:


where a positive τenhance represents enhancement of the attended signal which cannot be attributed to suppression of the unattended signal and a negative τsuppress represents suppression of the unattended signal which cannot be attributed to enhancement of the attended signal.

Alpha Power Analysis

To measure changes in power in the alpha range, sensitive to the location of auditory spatial attention, the root mean square (rms) of the EEG signal across all 128 electrodes, filtered from 8–12 Hz, was measured for each of the nine time windows of the previous comparisons. The rms amplitude of every trial was averaged, and the data were then collapsed across sentence content, leaving the two directions of attention (left and right) as conditions within the two main talker conditions (Single Talker and Selective Attention). The power for left-cued trials (Pα(CuedLeft)) was then subtracted from right-cued trials (Pα(CuedRight)) within each talker condition, to examine the differential response between the left and right sentence presentation in the Single Talker condition, and left vs. right attention in the Selective Attention condition.

Based on previous visual studies showing alpha lateralization in posterior-parietal electrodes, we selected the 26 electrodes in the posterior - left quadrant of the electrode array as the posterior-left region of interest (ROIPL) and the 26 electrodes in the posterior-right quadrant as the posterior- right region of interest (ROIPR). We quantified this differential alpha power response across hemispheres to lateralized speech in a single measure: the Alpha Lateralization Index (ALI), similar to the index of the same name described by Thut et al. 2006 (Thut et al., 2006). We defined the Alpha Lateralization Index with the following formula:




Participants performed the speech comprehension task with a high level of accuracy (M = 93.2% correct, SD = 3.67%). A 2×2×2 ANOVA was performed on accuracy with the within-subject factors of talker condition (Single Talker, Selective Attention), direction of attention (left, right) and attended sentence content (Sentence A, Sentence B). A main effect of talker condition (F(1,12) = 8.44, p = 0.0132) was found, with significantly better accuracy for the Single Talker (M = 94.6%, SD = 3.63%) than the Selective Attention (M = 91.8%, SD = 4.44%) condition. A small but significant main effect of sentence content (F(1,12) = 7.98, p = 0.0153) was also found, with significantly better accuracy for the adjective sentence (M = 94.5%, SD = 3.33%) than the noun sentence (M = 91.9%, SD = 4.59%). A second ANOVA was performed based on the reaction time after the end of the sentence for correct trials, using the same factors. There was a main effect of sentence content (F(1,12) = 12.2, p = 0.0044) on reaction time, with the noun sentence (M = 1084 ms, SD = 169 ms) leading to slightly but significantly longer reaction times than the adjective sentence (M = 1028 ms, SD = 153 ms). There were no other significant main effects or interactions. Subject performance was therefore generally high and well-balanced across stimuli in the Single Talker and Selective Attention Conditions. In a further Central Control condition, participant performance, though above chance (one-sided t-test; t(12) = 2.41, p = 0.016), was extremely poor (M = 56.9%, SD = 10.2%), confirming that accurate performance of the Selective Attention task was dependent on spatial segregation.

Frequencies of Speech Representation in Auditory Cortex Consistent Across Individuals

We sought to identify which frequencies encode speech content consistently across individuals in auditory cortex. We therefore found the mean discrimination index value in the Speech Encoding comparison (see Table 1), and collapsed across all dimensions except EEG frequency. A positive discrimination index means that the EEG signal at a particular frequency distinguishes between the sentences A and B when presented alone, in a consistent manner across subjects. Three out of the eight frequencies had a discrimination index significantly above zero, based on one-tailed t-tests, Bonferonni corrected for eight comparisons: 1–4 Hz (t(12) = 7.54, p <0.0001), 4–8 Hz ( t(12) = 7.82, p <0.0001), and 8–12 Hz ( t(12) = 4.83, p = 0.0016). An additional three frequencies had discrimination indices greater than zero with a p-value below 0.05, without Bonferonni correction: 12–30 Hz (t(12) = 1.91, p = 0.0373), 30–50 Hz ( t(12) = 2.74, p = 0.0089) and 80–120 Hz (t(12) = 2.21, p = 0.024) (see Figure 3). As expected, lower EEG frequencies were robust in discriminating which sentence was presented, even on individual trials. Unexpectedly, the 30–50 Hz and 80–120 Hz bands also showed positive discrimination indices, though not as strongly, which would suggest differential, phase-locked neural responses at very high frequencies that distinguished between sentences.

Figure 3
Gain Control of Speech Representation Across Frequency

Frequencies of Speech Representation in Auditory Cortex Modulated by Selective Attention

We then tested whether selective attention modulates these neural representations of speech. The same analysis steps were thus performed on the Attentional Gain comparison. Here, a positive discrimination index means that an individual’s attention to a sentence causes the EEG signal to better match the group response when attending to that sentence, despite no difference in the stimulus presented. The Attentional Gain comparison had a significant discrimination index with Bonferroni correction at 4–8 Hz (t(12) = 4.08, p = 0.0064), with frequencies from 1–4 Hz (t(12) = 2.15, p = 0.0266) and 12–30 Hz (t(12) = 1.80, p = 0.049) having uncorrected p-values less than 0.05. Thus, selective attention causes a consistent, phase-locked response across individuals at lower frequencies, especially at 4–8 Hz, that distinguishes which sentence is being attended.

Notice, this comparison is indifferent to whether the speech EEG waveform for one talker presented alone generalizes to, or is qualitatively similar in, multi-talker situations. Rather, it only requires that attention consistently modulates the EEG signal when listeners selectively attend in the presence of multiple talkers. In order to test whether selective attention during competing speech acts through gain control on the neural responses that represent speech when heard alone, we performed the same analysis steps on the Generalized Attentional Gain comparison. Here, a positive discrimination index means that attending to a sentence causes an individual’s EEG signal to better match the group signal for that sentence presented alone, despite no differences in the stimuli presented. The Generalized Attentional Gain comparison had a significant discrimination index only at 4–8 Hz (t(12) = 3.57, p = 0.015, corrected), with no other significant frequency ranges. Importantly, this attentional gain could not be explained by a simple enhancement of the traditional N1 response to words, as shown by modeling sentence responses as a series of word-onset transients (see Supplemental Information). Nor can these results be explained by increased phase entrainment of an intrinsic 4–8 Hz oscillation (see Supplemental Information). Thus, attending to a speech signal increased its continuous neural representation compared to the unattended sentence, in the 4–8 Hz range.

Hemispheric Differences in Speech Representation

Further analysis of source hemisphere was performed only on the frequency ranges with a significant discrimination index in the previous analysis. Separate paired two sample t-tests were performed on the 1–4 Hz, 4–8 Hz and 8–12 Hz waveforms from the Speech Encoding comparison. There was a significant main effect of source hemisphere only for the 4– 8 Hz range (t(12) = 3.40, p = 0.0053), with the right hemisphere having a significantly greater discrimination index than the left hemisphere.

The same analysis was performed with the Attentional Gain comparison at the 4– 8 Hz range, again finding greater discrimination in the right vs left source (t(12) = 2.31, p = 0.040). In the Generalized Attentional Gain condition, the 4– 8 Hz frequency range was compared, and though again the right hemisphere appeared to be more discriminative than the left hemisphere, this difference was not quite significant (t(12) = 2.13, p = 0.054). In general, the cortical response in the right hemisphere was more robust in predicting which sentences were presented or attended, in the frequency range of 4–8 Hz.

Enhancement of the Speech Representation by Selective Attention

While the discrimination index distinguishes EEG waveforms produced by attending to a particular sentence while ignoring the other, it cannot show whether discrimination results from enhancement of the attended sentence signal and/or suppression of the unattended sentence signal. To test each possibility, attended and unattended activity was compared to a passive listening “baseline”. Specifically, the passive listening template-matching regression coefficients were subtracted from the attended and unattended regression coefficients, forming an index of enhancement and suppression, respectively. This was performed for all participants with positive selective attention discrimination indices at 4–8 Hz (n= 10). These values were further corrected to discount any values resulting from non-orthogonality in the two template waveforms (see Methods). A positive value in this index means the attended signal was enhanced, while a negative value indicates that the unattended signal was suppressed.

At 4–8 Hz, the index on enhancement of the attended speech was significantly greater than zero, as revealed with a one-tailed t-test (t(9) = 1.90, p = 0.045), while the index of suppression of the unattended speech was not significantly different than zero, showing a weak negative trend in a one-tailed test (t(9) = -0.96, p = 0.18). Auditory selective attention to continuous speech therefore acts at least via an enhancement of the attended signal, as opposed to a strong suppression of the unattended signal.

Auditory Spatial Attention Results in Differential Hemispheric Alpha Power

Having indexed the content of selective attention to speech, we tested whether alpha power in posterior cortex could predict the location of speech presentation and selective attention. Based on studies in the visual domain, we expected relative contralateral alpha suppression and ipsilateral alpha enhancement when attention is focused laterally. As can be seen in Figure 4a, presenting a sentence to the left vs right in virtual auditory space (Single Talker condition) induces a clear difference between the hemispheres at parietal channels due to stimulus location, with relative ispilateral alpha enhancement and relative contralateral alpha suppression. We performed a Student t-test using the Alpha Lateralization Index for the Single Talker condition (ALIST) across all participants and found it was significantly greater than zero (t(12) = 3.14, p = 0.0043), meaning that lateralized alpha power was significantly predictive of the side of speech presentation. When both sentences are presented simultaneously (Selective Attention condition), the differential alpha response based solely on selective attention appears to have a similar time course and topography as the Single Talker condition (Figure 4a). A Student t-test of the Selective Attention Alpha Lateralization Index (ALISA) revealed the index was significantly greater than zero (t(12) = 2.37, p = 0.018), meaning lateralized alpha power predicted the direction of auditory selective attention in the absence of stimulus differences, in a manner similar that of visual-spatial selective attention.

Signals of Speech Representation and Selection over Time

The neural mechanisms of allocating spatial attention and selecting a talker may not occur uniformly through time. We therefore examined the time course of both alpha lateralization and the discrimination index over the duration of the sentence, shown normalized in magnitude in Figure 4a ,b. A one-way ANOVA of the ALIST across time (9 time windows), collapsed over all other bins, revealed a significant main effect of time (F = 7.34, p < 0.0001), starting at a value significantly above zero and peaking around 340 ms. There was also a significant main effect of time for the ALISA (F = 2.9, p < 0.0057), though peaking somewhat later at 682 ms.

A one-way ANOVA of the Speech Encoding discrimination index at 4–8 Hz across time (9 time windows), collapsed over all other bins, also revealed a significant main effect of time (F = 18.42, p < 0.0001) as did the Generalized Attentional Gain discrimination index (F = 3.24, p = 0.0026). The time course of the Generalized Attentional Gain index appears to be quite similar to that of the Speech Encoding index. This is not surprising, as the ability to discriminate sounds based on selective attention should depend on the degree to which their responses can be separated when presented alone. However, as with the Alpha Lateralization Index, the signal due purely to selective attention (Generalized Attentional Gain DI) peaks later than the signal evoked by external stimulus differences (Speech Encoding DI ), suggesting there may be a buildup period for gain by selective attention. Consistent with this, the spatial alpha effect tends to dominate early in the sentence, while the attentional discrimination effect dominates later in the sentence.

Lateralized Alpha Power and Attentional Gain

Given that both phase-locked sentence-specific responses at 4–8 Hz and non-phase-locked power at 8–12 Hz are sensitive to attention among multiple talkers, we sought to test if there is a relationship between the two signals. We predicted that in the Selective Attention condition, participants with greater early alpha lateralization, which is linked to attending to a particular location, would also tend to have a greater Generalized Attentional Gain Discrimination Index, which is a measure of selecting the sentence content at the attended location. To test this, we calculated the Pearson’s correlation coefficient between the mean Selective Attention alpha lateralization index early in the sentence (0 to 512 ms) and the mean Generalized Attentional Gain discrimination index later in the sentence (682 to 1364 ms) for each participant (see Figure 5). A significant, positive correlation was found between the two indices (r =0.495, t(12) = 1.89, p = 0.043).

Figure 5
Early Alpha Lateralization Predicts Generalized Attentional Gain

Although the topographies of these signals were quite different, it is possible that this correlation reflected individual differences in factors independent of selective attention. However, we found no significant correlation between the early Speech Encoding DI and late ALISA (r = −0.148), ruling out general differences in an individual’s stimulus driven response as the cause of the relationship. We could also rule out that individual differences in alpha activity over the entire time course caused this relationship, as there was no significant correlation when the ALISA was taken from the same time window as the late Generalized DI (r = 0.074), nor if the ALISA was replaced with the overall alpha power in the same time period (r = 0.17). Thus, the relationship between lateralized alpha activity and the gain of sentence content reflects an interaction between distinct mechanisms of spatial and selective attention.


We have shown that selective attention in a “cocktail party” modulates the early cortical representation of speech via a gain mechanism. Specifically, selective attention increases discrimination of the attended speech signal in auditory cortex in the range of 4–8 Hz, a frequency band strongly represented in the speech envelope and known to be important for speech comprehension. We demonstrate furthermore that this attentional gain is due to enhancement of the attended sentence, and possibly suppression of the unattended stream.

In addition to the neural gain in speech representation, our results establish that alpha power lateralization at parieto-occipital sites reflects the direction of auditory attention to continuous speech in space. Posterior parietal cortical involvement in attentional selection of speech-in-noise has been shown with high spatial resolution using fMRI, with greater bilateral superior parietal lobule (SPL) activity when participants select a talker based on spatial attributes rather than pitch (Hill and Miller, 2009) and for shifts vs. maintenance of auditory selective attention (Shomstein and Yantis, 2006). Furthermore, a recent MEG study found occipito-parietal alpha activity when subjects maintained lateralized sounds in working memory (Kaiser et al., 2007), and recent fMRI studies find sensitivity to auditory spatial attention in occipital as well as parietal cortex (Wu et al., 2007; Cate et al., 2009). Notably, the topography of our alpha lateralization is nearly identical to that in cued visuospatial attention (Worden et al., 2000; Sauseng et al., 2005; Kelly et al., 2006; Rihs et al., 2009) and intermodal attentional switching (Foxe et al., 1998). Although the arrow cue in the current experiment could have evoked pure visuo-spatial attention with alpha lateralization, this interpretation is unlikely as the cue onset occurred long before (2 sec) the ALI analysis window and no visual stimuli were ever co-localized with the voices. The overlap of alpha modulation at parieto-occipital sites for both auditory and visual spatial attention adds to growing behavioral and physiological evidence for a supramodal mechanism of attentional selection (Farah et al., 1989; Spence and Driver, 1996; Eimer et al., 2004).

Not only does alpha power at parieto-occipital sites reflect where the brain allocates selective attention to continuous speech in space, but it also predicts how well the auditory cortex distinguishes which sentence is attended. The correlation between alpha power lateralization and the strength of the selective attention response to continuous speech provides a mechanistic link between parieto-occipital alpha activity and selective enhancement of the attended auditory stimulus. The time courses of our chosen measures of selective attention also suggest an order to these effects. Alpha lateralization due to spatial attention peaks early and disappears before the end of the sentence, which implies that differential parieto-occipital activity is needed to select an auditory object in space, but is not required to maintain the auditory stream over time. This is consistent with fMRI evidence for greater activity in the SPL to auditory speech stream switching than maintenance (Shomstein and Yantis, 2006). The time course of the neural selection of speech content, as indexed by the Generalized Attentional Gain comparison, peaks later and is sustained throughout the sentence. This time course is more difficult to interpret, and could be well explained by differences in the stimulus envelope over time, such that the more the envelopes of two sentences differ, the easier it is to detect a difference in the neural response between them (Abrams et al., 2008; Aiken and Picton, 2008). However, other cognitive temporal effects, such as perceptual buildup in the streaming of the attended sentences (Shinn-Cunningham and Best, 2008), or the increased task relevance around sentence endings in our paradigm, may also play a role. Future experiments with a variety of longer and systematically varied sentence structures are needed to disambiguate the issue. Regardless, the attentional gain of sentence content clearly continues after the offset of alpha lateralization, implying that alpha spatial selection mechanism is resolved before the entire sentence has been processed.

Though the current study shows clear attentional enhancement of a 4–8Hz signal in auditory cortex, our approach is agnostic with respect to the representational nature of the signal itself. As in similar, previous studies (Suppes et al., 1998; Suppes et al., 1999; Luo and Poeppel, 2007), we find robust speech representations in these lower frequencies. Further analysis (see Supplemental Information) established that the low-frequency attentional gain is an ongoing phenomenon, as with ongoing attention to tone sequences (Elhilali et al., 2009), and cannot be explained by a traditional, transient word-onset N1 response. Several possibilities for the nature of this ongoing signal remain. Most likely, much of the initial 4–8Hz signal reflects a response to the speech envelope, which has substantial power in the 2–20 Hz range (Purcell et al., 2004; Aiken and Picton, 2008) and is known to evoke a following response preferentially at the natural frequencies of speech envelope (Ahissar et al., 2001). Furthermore, this low-frequency encoding is greater in the right than left hemisphere, also consistent with prior work on the speech envelope (Tremblay and Kraus, 2002; Abrams et al., 2008)), as well as Asymmetric Sampling in Time (AST) theory, in which syllabic timescales are processed preferentially in the right hemisphere (Poeppel, 2003; Giraud et al., 2007; Abrams et al., 2008; Overath et al., 2008). An alternative view is that the 4–8Hz signal is not a speech representation per se, but rather reflects intrinsic oscillatory neural activity that is phase-reset by an ongoing stimulus (Makeig et al., 2002; Luo and Poeppel, 2007; Lakatos et al., 2008; Bonte et al., 2009). Although possible, in line with recent papers in the visual domain (Mazaheri and Jensen, 2006; Risner et al., 2009), we found little evidence that the 4–8Hz signal from auditory cortex before the stimulus maintained phase information for multiple cycles (see Supplemental Information). A parsimonious interpretation of our results suggests that this speech encoding signal is closely related to the properties of stimulus acoustics, such as the speech envelope, and limited by the temporal resolution of auditory cortical networks.

While previous studies on the cortical response frequencies to natural continuous speech report no differentiating frequencies above 50 Hz (Suppes et al., 1998; Suppes et al., 1999; Ahissar et al., 2001; Bidet-Caulet et al., 2007; Luo and Poeppel, 2007; Buiatti et al., 2009), we found weak evidence of neural speech encoding above 80Hz. At 80–120 Hz, the finding of a positive discrimination index requires a phase-locked response, shared among participants with latency differences of less than ~6 ms (1/2 the longest wavelength). Higher frequencies may reflect transient responses to particular plosives, fricatives, or transitions within the speech stream or possibly higher order processes, such as matching external auditory stimuli to templates in working memory (Kaiser et al., 2003; Kaiser et al., 2008; Lenz et al., 2008; Shahin et al., 2009). Further studies are required to determine whether phase-locked speech encoding responses are truly present in cortex at these frequencies.

Our attentional gain analyses were performed using a dipole model aimed at maximizing signal originating in or near early auditory cortex. Such generators produce a canonical frontal-central and posterior scalp distribution and can be modeled to a large extent by one dipole in each hemisphere (Scherg et al., 2007; Aiken and Picton, 2008). We expected that most time-locked cortical activity in response to an extended, complex acoustic stimulus such as speech would come from this region of cortex, producing a similar pattern on the scalp. Indeed, the discrimination index at 4–8 Hz across all scalp sites is largely consistent with our two dipole source model based on the early auditory onset response (N1) (see Supplemental Figure 1). This finding agrees with the MEG results of Luo and Poeppel, which found sentence discrimination greatest in the same channels as the auditory M100 (Luo and Poeppel, 2007). Nevertheless, this approach cannot exclude the possibility that other regions, particularly language related areas such as superior temporal sulcus, may also have time-locked, differential responses between sentences that contribute to the source waveforms.

We should emphasize that many of our results rely on a novel application of EEG template matching to produce the discrimination index. This technique has several advantages when compared to traditional ERP or oscillatory power analysis. Notably, it allows for the measurement of a shared response pattern to an extended stimulus without requiring any prior knowledge of the pattern, and with very few assumptions about its time course and frequency content. This makes the method ideal for exploratory studies in which, unlike traditional ERP analysis, unexpected but consistently time-locked responses can be detected. It has some limitations, including the requirement that the same stimulus must be presented multiple times and that the responses must be phase-locked. But as a complement to more established methods, it offers substantial advantage in characterizing the multiple, temporally overlapping signals so common in naturalistic environments.

We propose that auditory selective attention in a cluttered, realistic environment begins with allocating spatial attention. This is evidenced by increased contralateral alpha suppression, which may reflect networks in the contralateral posterior parietal cortex shifting from a passive to active state, or active suppression of the unattended space. The posterior parietal cortex may then bind the auditory object to a location in space in order to assist in the selection and streaming of the auditory content. Once the object is successfully streaming, supramodal spatial activity reduces and stream selection continues based on non-spatial as well as spatial cues. In contrast to the suppression of non-phase-locked alpha in the parieto-occipital cortex, phase-locked early auditory cortical representation of the attended stream in the temporal lobe is then enhanced via a gain mechanism, leading to successful comprehension.

Supplementary Material



The authors would like to thank Kevin Hill, Kristina Backer and Chris Bishop for advice and support in data collection, as well as Terry Picton, Ali Mazaheri, Tom Campbell and Risa Sawaki for advice and technical expertise. This research was supported by a grant from the NIH/NIDCD (R01-DC008171).


  • Abrams DA, Nicol T, Zecker S, Kraus N. Right-hemisphere auditory cortex is dominant for coding syllable patterns in speech. J Neurosci. 2008;28:3958–3965. [PMC free article] [PubMed]
  • Ahissar E, Nagarajan S, Ahissar M, Protopapas A, Mahncke H, Merzenich MM. Speech comprehension is correlated with temporal response patterns recorded from auditory cortex. Proc Natl Acad Sci U S A. 2001;98:13367–13372. [PubMed]
  • Aiken SJ, Picton TW. Human cortical responses to the speech envelope. Ear Hear. 2008;29:139–157. [PubMed]
  • Alcaini M, Giard MH, Echallier JF, Pernier J. Selective auditory attention effects in tonotopically organized cortical areas: A topographic ERP study. Hum Brain Mapp. 1995;2:159–169.
  • Bidet-Caulet A, Fischer C, Besle J, Aguera PE, Giard MH, Bertrand O. Effects of selective attention on the electrophysiological representation of concurrent sounds in the human suditory cortex. J Neurosci. 2007;27:9252–9261. [PubMed]
  • Bonte M, Valente G, Formisano E. Dynamic and task-dependent encoding of speech and voice by phase reorganization of cortical oscillations. J Neurosci. 2009;29:1699–1706. [PubMed]
  • Broadbent DE. A mechanical model for human attention and immediate memory. Psychol Rev. 1957;64:205–215. [PubMed]
  • Buiatti M, Peña M, Dehaene-Lambertz G. Investigating the neural correlates of continuous speech computation with frequency-tagged neuroelectric responses. Neuroimage. 2009;44:509–519. [PubMed]
  • Cate AD, Herron TJ, Yund EW, Stecker GC, Rinne T, Kang X, Petkov CI, Disbrow EA, Woods DL. Auditory attention activates peripheral visual cortex. PLoS ONE. 2009;4:e4645. [PMC free article] [PubMed]
  • Cherry EC. Some experiments on the recognition of speech, with one and with two ears. J Acoust Soc Am. 1953;25:975–979.
  • Coch D, Sanders LD, Neville HJ. An event-related potential study of selective auditory attention in children and adults. J Cogn Neurosci. 2005;17:605–622. [PubMed]
  • Dajani HR, Purcell D, Wong W, Kunov H, Picton TW. Recording human evoked potentials that follow the pitch contour of a natural vowel. IEEE Trans Biomed Eng. 2005;52:1614–1618. [PubMed]
  • Driver J. A selective review of selective attention research from the past century. Br J Psychol. 2001;92:53–78. [PubMed]
  • Eimer M, van Velzen J, Driver J. ERP evidence for cross-modal audiovisual effects of endogenous spatial attention within hemifields. J Cogn Neurosci. 2004;16:272–288. [PubMed]
  • Elhilali M, Xiang J, Shamma SA, Simon JZ. Interaction between attention and bottom-up saliency mediates the representation of foreground and background in an auditory scene. PLoS Biol. 2009;7:e1000129. [PMC free article] [PubMed]
  • Farah MJ, Wong AB, Monheit MA, Morrow LA. Parietal lobe mechanisms of spatial attention: modality-specific or supramodal? Neuropsychologia. 1989;27:461–470. [PubMed]
  • Foxe JJ, Simpson GV, Ahlfors SP. Parieto-occipital ~10Hz activity reflects anticipatory state of visual attention mechanisms. Neuroreport. 1998;9:3929–3933. [PubMed]
  • Giraud A-L, Kleinschmidt A, Poeppel D, Lund TE, Frackowiak RSJ, Laufs H. Endogenous cortical rhythms determine cerebral specialization for speech perception and production. Neuron. 2007;56:1127–1134. [PubMed]
  • Gruber WR, Klimesch W, Sauseng P, Doppelmayr M. Alpha phase synchronization predicts P1 and N1 latency and amplitude size. Cereb Cortex. 2005;15:371–377. [PubMed]
  • Hill KT, Miller LM. Auditory attentional control and selection during cocktail party listening. Cereb Cortex. 2009 in press. [PMC free article] [PubMed]
  • Kaiser J, Heidegger T, Wibral M, Altmann CF, Lutzenberger W. Alpha synchronization during auditory spatial short-term memory. Neuroreport. 2007;18:1129–1132. [PubMed]
  • Kaiser J, Heidegger T, Wibral M, Altmann CF, Lutzenberger W. Distinct gamma-band components reflect the short-term memory maintenance of different sound lateralization angles. Cereb Cortex. 2008;18:2286–2295. [PubMed]
  • Kaiser J, Ripper B, Birbaumer N, Lutzenberger W. Dynamics of gamma-band activity in human magnetoencephalogram during auditory pattern working memory. Neuroimage. 2003;20:816–827. [PubMed]
  • Kelly SP, Lalor EC, Reilly RB, Foxe JJ. Increases in alpha oscillatory power reflect an active retinotopic mechanism for distracter suppression during sustained visuospatial attention. J Neurophysiol. 2006;95:3844–3851. [PubMed]
  • Krishnan A. Human frequency-following responses: representation of steady-state synthetic vowels. Hear Res. 2002;166:192–201. [PubMed]
  • Lakatos P, Karmos G, Mehta AD, Ulbert I, Schroeder CE. Entrainment of neuronal oscillations as a mechanism of attentional selection. 2008;320:110–113. [PubMed]
  • Lalor EC, Power AJ, Reilly RB, Foxe JJ. Resolving precise temporal processing properties of the auditory system using continuous stimuli. J Neurophysiol. 2009:349–359. [PubMed]
  • Langendijk EHA, Bronkhorst AW. Fidelity of three-dimensional-sound reproduction using a virtual auditory display. J Acoust Soc Am. 2000;107:528–537. [PubMed]
  • Lavie N. Perceptual load as a necessary condition for selective attention. J Exp Psychol Hum Percept Perform. 1995;21:451–468. [PubMed]
  • Lenz D, Jeschke M, Schadow J, Naue N, Ohl FW, Herrmann CS. Human EEG very high frequency oscillations reflect the number of matches with a template in auditory short-term memory. Brain Res. 2008;1220:81–92. [PubMed]
  • Luo H, Poeppel D. Phase patterns of neuronal responses reliably discriminate speech in human auditory cortex. Neuron. 2007;54:1001–1010. [PMC free article] [PubMed]
  • Makeig S, Westerfield M, Jung TP, Enghoff S, Townsend J, Courchesne E, Sejnowski TJ. Dynamic brain sources of visual evoked responses. Science. 2002;295:690–694. [PubMed]
  • Mazaheri A, Jensen O. Posterior α activity is not phase-reset by visual stimuli. Proc Natl Acad Sci U S A. 2006;103:2948–2952. [PubMed]
  • Medendorp WP, Kramer GF, Jensen O, Oostenveld R, Schoffelen JM, Fries P. Oscillatory activity in human parietal and occipital cortex shows hemispheric lateralization and memory effects in a delayed double-step saccade task. Cereb Cortex. 2007;17:2364–2374. [PubMed]
  • Musacchia G, Sams M, Skoe E, Kraus N. Musicians have enhanced subcortical auditory and audiovisual processing of speech and music. Proceedings of the National Academy of Sciences. 2007;104:15894–15898. [PubMed]
  • Nä ätänen R, Michie PT. Early selective-attention effects on the evoked potential: a critical review and reinterpretation. Biol Psychol. 1979;8:81–136. [PubMed]
  • Nager W, Dethlefsen C, Münte TF. Attention to human speakers in a virtual auditory environment: brain potential evidence. Brain Res. 2008;1220:164–170. [PubMed]
  • Overath T, Kumar S, von Kriegstein K, Griffiths TD. Encoding of spectral correlation over time in auditory cortex. J Neurosci. 2008;28:13268–13273. [PMC free article] [PubMed]
  • Palva S, Palva JM. New vistas for α-frequency band oscillations. Trends Neurosci. 2007;30:150–158. [PubMed]
  • Pichora-Fuller MK, Singh G. Effects of age on auditory and cognitive processing: implications for hearing aid fitting and audiologic rehabilitation. 2006;10:29–59. [PubMed]
  • Picton TW, Hillyard SA. Human auditory evoked potentials. II: Effects of attention. Electroencephalogr Clin Neurophysiol. 1974;36:191–200. [PubMed]
  • Poeppel D. The analysis of speech in different temporal integration windows: cerebral lateralization as 'asymmetric sampling in time'. Speech Communication. 2003;41:245–255.
  • Purcell DW, John SM, Schneider BA, Picton TW. Human temporal auditory acuity as assessed by envelope following responses. J Acoust Soc Am. 2004;116:3581–3593. [PubMed]
  • Rihs TA, Michel CM, Thut G. A bias for posterior α-band power suppression versus enhancement during shifting versus maintenance of spatial attention. Neuroimage. 2009;44:190–199. [PubMed]
  • Risner ML, Aura CJ, Black JE, Gawne TJ. The visual evoked potential is independent of surface alpha rhythm phase. Neuroimage. 2009;45:463–469. [PubMed]
  • Sauseng P, Klimesch W, Stadler W, Schabus M, Doppelmayr M, Hanslmayr S, Gruber WR, Birbaumer N. A shift of visual spatial attention is selectively associated with human EEG alpha activity. Eur J Neurosci. 2005;22:2917–2926. [PubMed]
  • Scherg M, Vajsar J, Picton TW. A source analysis of the late human auditory evoked potentials. J Cogn Neurosci. 2007;1:336–355. [PubMed]
  • Senkowski D, Saint-Amour D, Gruber T, Foxe JJ. Look who's talking: the deployment of visuo-spatial attention during multisensory speech processing under noisy environmental conditions. Neuroimage. 2008;43:379–387. [PMC free article] [PubMed]
  • Shahin AJ, Picton TW, Miller LM. Brain oscillations during semantic evaluation of speech. Brain Cogn. 2009;70:259–266. [PMC free article] [PubMed]
  • Shinn-Cunningham BG, Best V. Selective attention in normal and impaired hearing. Trends Amplif. 2008;12:283–299. [PMC free article] [PubMed]
  • Shomstein S, Yantis S. Parietal cortex mediates voluntary control of spatial and nonspatial auditory attention. J Neurosci. 2006;26:435–439. [PubMed]
  • Spence C, Driver J. Audiovisual links in endogenous covert spatial attention. J Exp Psychol Hum Percept Perform. 1996;22:1005–1030. [PubMed]
  • Suppes P, Han B, Epelboim J, Lu ZL. Invariance between subjects of brain wave representations of language. Proc Natl Acad Sci U S A. 1999;96:12953–12958. [PubMed]
  • Suppes P, Lu ZL, Han B. Brain wave recognition of words. Proc Natl Acad Sci U S A. 1997;94:14965–14969. [PubMed]
  • Suppes P, Han B, Lu ZL. Brain-wave recognition of sentences. Proc Natl Acad Sci U S A. 1998;95:15861–15866. [PubMed]
  • Teder W, Kujala T, Nä ätänen R. Selection of speech messages in free-field listening. Neuroreport. 1993;5:307–309. [PubMed]
  • Thut G, Nietzel A, Brandt SA, Pascual-Leone A. α-band electroencephalographic activity over occipital cortex indexes visuospatial attention bias and predicts visual target detection. J Neurosci. 2006;26:9494–9502. [PubMed]
  • Treisman A, Geffen G. Selective attention: perception or response? Q J Exp Psychol. 1967;19:1–17. [PubMed]
  • Tremblay KL, Kraus N. Auditory training induces asymmetrical changes in cortical neural activity. J Speech Lang Hear Res. 2002;45:564–572. [PubMed]
  • Woods DL, Alho K, Algazi A. Intermodal selective attention: evidence for processing in tonotopic auditory fields. Psychophysiology. 1993;30:287–295. [PubMed]
  • Worden MS, Foxe JJ, Wang N, Simpson GV. Anticipatory biasing of visuospatial attention indexed by retinotopically specific α -band electroencephalography increases over occipital cortex. J Neurosci. 2000;20:RC63. [PubMed]
  • Wu CT, Weissman DH, Roberts KC, Woldorff MG. The neural circuitry underlying the executive control of auditory spatial attention. Brain Res. 2007;1134:187–198. [PMC free article] [PubMed]