|Home | About | Journals | Submit | Contact Us | Français|
Speech comprehension relies on temporal cues contained in the speech envelope, and the auditory cortex has been implicated as playing a critical role in encoding this temporal information. We investigated auditory cortical responses to speech stimuli in subjects undergoing invasive electrophysiological monitoring for pharmacologically refractory epilepsy. Recordings were made from multi-contact electrodes implanted in Heschl’s gyrus (HG). Speech sentences, time-compressed from 0.75 to 0.20 of natural speaking rate, elicited average evoked potentials (AEPs) and increases in event-related band power (ERBP) of cortical high frequency (70–250 Hz) activity. Cortex of posteromedial HG, the presumed core of human auditory cortex, represented the envelope of speech stimuli in the AEP and ERBP. Envelope-following in ERBP, but not in AEP, was evident in both language dominant and non-dominant hemispheres for relatively high degrees of compression where speech was not comprehensible. Compared to posteromedial HG, responses from anterolateral HG — an auditory belt field — exhibited longer latencies, lower amplitudes and little or no time locking to the speech envelope. The ability of the core auditory cortex to follow the temporal speech envelope over a wide range of speaking rates leads us to conclude that such capacity in itself is not a limiting factor for speech comprehension.
The temporal envelope of human speech reflects amplitude fluctuations ranging from about 2 Hz to 50 Hz, which correspond to phonemic and syllabic transitions critically important for comprehension (Rosen, 1992). Speech recognition can be achieved when the spectral information is severely limited but temporal envelope cues are preserved (Shannon et al., 1995). Comprehension of speech auditory chimaeras (in which the envelope of one stimulus is used to modulate the fine structure of another) is based primarily on envelope cues (Smith et al., 2002). Distorting the speech envelope by temporal smearing (Drullman et al., 1994) or compression (Ahissar & Ahissar, 2005) impairs comprehension.
Understanding how and where speech envelope information is represented within human auditory cortex continues to be a major challenge (Luo & Poeppel, 2007). Ahissar et al. (2001), using magnetoencephalography (MEG), observed that degraded comprehension of time-compressed speech correlated with a decline in temporal synchrony between auditory cortical responses and the speech envelope. They placed this processing mechanism “approximately on Heschl’s gyrus” and concluded that temporal locking of activity in this cortical area to the speech envelope was a prerequisite for comprehension. The modal frequency of the most compressed speech signal used by Ahissar et al. (2001) was around 14 Hz, which, in humans, is well within the limits of phase locking to the envelope of sinusoidal amplitude-modulated tones and noise (Kuwada et al., 1986; Rees et al., 1986; Roβ et al., 2000; Liégeois-Chauvel et al., 2004; Nourski et al., 2009). These findings support an alternative hypothesis, that the auditory cortex of Heschl’s gyrus (HG) can temporally encode the speech envelope even at high modulation rates beyond speech comprehension. We tested this hypothesis by recording directly from HG, in human neurosurgical subjects, activity evoked by speech stimuli that were essentially identical to those used by Ahissar et al. (2001). We were able to accurately localize the evoked activity within the gyrus anatomically and physiologically.
Using the intracortical recording approach, primary and primary-like auditory cortices (the auditory core) have been localized to posteromedial HG (Liegeois-Chauvel et al., 1991; Howard et al., 1996, 2000; Brugge et al., 2008). Average evoked potentials (AEPs) (Donchin & Lindsey, 1969) recorded there have a relatively short latency and feature phase-locked responses to periodic stimuli. These properties distinguish the core from an auditory field on anterolateral HG, which exhibits AEPs having longer latency with little evidence of phase locking to the stimulus. This laterally positioned field has been interpreted as an auditory cortical belt system.
High frequency cortical activity (above ~70 Hz) has been shown to be a prominent component of auditory cortical responses in human and monkey (Crone et al., 2001; Steinschneider et al., 2008). In this study, we employed time-frequency analysis of single-trial response waveforms to capture event-related band power (ERBP) of the electrocorticogram (ECoG) within the frequency range of 70–250 Hz. We explored the relationship between stimulus temporal envelope and cortical activity, measured as the AEP as well as ERBP, at multiple recording sites within HG of both the language-dominant and non-dominant hemispheres.
The six subjects (two males, four females; 22–45 years old) that participated in this study were neurosurgical patients diagnosed with pharmacologically refractory epilepsy and were undergoing chronic invasive electroencephalography monitoring to identify a seizure focus prior to surgical treatment. Written informed consent was obtained from each subject. Research protocols were approved by The University of Iowa Human Subjects Review Board.
All six subjects were right-handed, one (L162) had mixed language dominance, while five others had left hemisphere language dominance, as determined by Wada test results. In three out of the six subjects studied (L156, L162, L173) the electrodes were implanted on the left side, while in three other subjects (R152, R153, R154) recordings were made from the right hemisphere. All subjects underwent audiometric and neuropsychological evaluation prior to the study, and none were found to have hearing or cognitive deficits that would impact the findings presented in this study. All subjects were native English speakers. Analysis of intracranial recordings indicated that HG was not involved in the generation of epileptic activity in the subjects.
Experimental stimuli were speech sentences, digitized at a sampling rate of 24414 Hz. The stimuli were time-compressed to ratios 0.75, 0.50, 0.40, 0.30, and 0.20 of the natural speaking rate (Fig. 1) using an algorithm that preserved the spectral content of the stimuli, as implemented in Sound Designer II software.
Evaluation of comprehension of time-compressed speech sentences was performed following the approach of Ahissar et al. (2001). The psychophysical experiment was carried out in five out of six subjects (all except R152) and in a control group of 20 healthy volunteers (13 males, 7 females, 19–35 years old, all native English speakers). The following set of ten sentences was used (T indicates a true statement, F indicates a false statement):
|1.||Black cars can all park||T|
|2.||Black dogs can all bark||T|
|3.||Black cars cannot bark||T|
|4.||Black dogs cannot park||T|
|5.||Playing cards cannot park||T|
|6.||Black cars cannot park||F|
|7.||Black dogs cannot bark||F|
|8.||Black cars can all bark||F|
|9.||Black dogs can all park||F|
|10.||Playing cards can all park||F|
Each sentence was presented at five compression ratios in a random order, and each sentence was presented twice at each compression ratio, thus yielding a total of 100 trials in the psychophysical experiment. The subjects were instructed to respond to the sentences by pressing one of three buttons, corresponding to “True”, “False” or “I don’t know”. Comprehension was quantified using a comprehension index (CI) (Ahissar et al., 2001; Ahissar and Ahissar, 2005), calculated as follows:
where Ncorrect is the number of correct responses (i.e., “True” statements identified as “True”, and “False” statements identified as “False”); Nincorrect is the number of incorrect responses; (i.e., “True” statements identified as “False”, and “False” statements identified as “True”); Ntotal is the total number of trials, including correct responses, incorrect responses and trials to which the subjects responded with “I don’t know”.
The electrophysiological experiment, carried out in the six subjects, employed a set of six time-compressed speech stimuli. Five of the stimuli were time-compressed versions of a sentence “Black cars cannot park”, presented at compression ratios of 0.75, 0.50, 0.40, 0.30, and 0.20 (see Fig. 1). The sixth stimulus, “Black dogs can all bark,” presented at a compression ratio of 0.75, was used as a target in an oddball detection task to maintain the subject in an alert state. The subjects were instructed to press a button whenever the oddball stimulus was detected. The output of the response box was monitored with an oscilloscope during the recording sessions. The sounds were delivered binaurally via insert earphones (ER4B, Etymotic Research, Elk Grove Village, IL, USA) mounted in subject-specific custom made earmolds. Each stimulus was presented 50 times in random order at a comfortable level (45–55 dB above hearing threshold). The duration of the speech stimuli ranged from 0.29 to 1.05 seconds (at compression ratios of 0.20 and 0.75, respectively). The interval between stimulus onsets was fixed at 3 seconds. Stimulus delivery and data acquisition were controlled by a TDT RX5 or RZ2 processor.
Details of electrode implantation have been described previously (Howard et al., 1996, 2000; Brugge et al., 2008; Reddy et al., 2009). In brief, custom-designed hybrid depth electrode (HDE) arrays were implanted stereotactically into HG, along its anterolateral to posteromedial axis. HDEs included six platinum macro-contacts, spaced 10 mm apart, which were used to record clinical data. Fourteen platinum micro-contacts (diameter 40 µm, impedance 0.08–0.7MΏ), were distributed at 2–4 mm intervals between the macro contacts and were used to record intracortical electrocorticogram (ECoG). The reference for the micro-contacts was either a sub-galeal contact or one of the two most lateral macro-contacts near the lateral surface of the superior temporal gyrus. Reference electrodes, including those near the lateral surface of the superior temporal gyrus, were relatively inactive compared to the large amplitude activity recorded from more medial portions of HG. Recording electrodes remained in place for approximately 2–3 weeks under the direction of the clinical epileptologists.
Each subject underwent whole-brain MRI and CT scanning prior to electrode implantation. To locate recording contacts on the HDEs, high-resolution T1-weighted structural MRIs (in-plane resolution 0.78×0.78×1.0 mm) were obtained both before and after electrode implantation. Pre- and post-implantation MRIs were co-registered using a 3D rigid fusion algorithm (Analyze version 8.1 software, Mayo Clinic, MN, USA). Coordinates for each electrode contact obtained from post-implantation MRI volumes were transferred to pre-implantation MRI volumes. Serial MR cross-sectional images containing the recording contacts were obtained perpendicular to the trajectory of the HDEs. The coordinates of the electrode shaft were determined using custom-designed software written in the MATLAB programming environment.
ECoG signals were recorded simultaneously from the intracranial HDE contacts, amplified, filtered (1.6–6000 Hz bandpass, 12 dB/octave rolloff), digitized at a sampling rate of 12207 Hz, and stored for subsequent offline analysis.
Envelopes of the speech stimuli were obtained by calculating the magnitude of the Hilbert transform of the speech signal waveform and low-pass filtering at 50 Hz using a 4th order Butterworth filter. ECoG obtained from each recording site were down-sampled to a sampling rate of 4069 Hz for computational efficiency. Trials that might be contaminated with noise (movement artifacts or electrical interference), and whose maximum amplitude deviated more than 2.5 SD above the mean, were excluded from the analysis. Data analysis was performed using custom software (MATLAB version 7.7.0).
In the time domain, stimulus-related phase-locked activity in the ECoG was characterized by the average evoked potential (AEP). The AEP estimates the most likely response waveform that would result from a single stimulus presentation, if stationary random noise was removed from the recorded voltage measurements. The rationale for this simple averaging approach is the explicit model that this response waveform (i.e. the AEP) is invariant (in amplitude values and onset latency) for all presentations of an identical stimulus. Therefore, in this homogeneous population of response waveforms, the AEP can be said to be 'phase-locked' to the stimulus. An alternative model is that response waveforms constitute an inhomogeneous set and are not invariant across identical stimulus trials. Simple averaging is not appropriate to estimate a most likely response waveform under this model and some form of single-trial analysis must be employed (Woody, 1967; Knuth et al., 2006; Crone et al., 1998). This may result because the assumption of stationary independent noise is insufficient to characterize the physiological recordings and/or to systematic variability in response waveforms due to unobserved covariates (e.g. adaptation, habituation, learning, etc.). In this single-trial analysis approach, the response waveform is said to be 'time-locked' to the stimulus given an operational definition of a response-time window.
Time-domain waveform averaging minimizes the contribution of time- but non-phase-locked (NPL) activity that may be important components of the neural activity evoked by speech. This is especially relevant for higher frequencies in the ECoG (Crone et al., 1998; Steinschneider et al., 2008). Thus, in addition to computing the AEP, the power in selected frequency bands in the ECoG signal was computed to obtain measures of the time-locked but not phase-locked response. This event-related band power (ERBP) reflects the increase or decrease in total power in a given frequency band with reference to the ongoing background ECoG (Crone et al., 1998; Pfurtscheller et al., 1999). Thus the ERBP will include both phase-locked (often termed ‘evoked’) power (Pantev, 1995) as well as non phase-locked, yet time-locked (often termed ‘induced’) power (Kalcher et al., 1995; Pantev, 1995; Crone et al., 2001).
Time-frequency analysis of the ECoG was performed using wavelet transforms based on complex Morlet wavelets following the approach of Oya et al. (2002). Center frequencies ranged from 10 to 250 Hz in 10 Hz steps, and the constant ratio was defined as 2πf0σ = 7, where f 0 is the center frequency and σ defines the wavelet width. Power measurements were done on a trial-by-trial basis and then averaged across trials. To quantify power changes as ERBP, mean power values were calculated at each center frequency within a reference period of 300 ms prior to the onset of the stimuli. ERBP values were then calculated at each center frequency and each time point in dB relative to mean power over the reference period. An advantage of such an approach is that power is normalized independently in each frequency band, thus ensuring that the 1/f statistical behavior of the ECoG power spectra does not impact the analysis.
While most time-frequency analyses presented in this study measured total power, we also estimated NPL cortical activity in a limited data set. In this estimation procedure, the contribution of phase-locked response components was minimized using the approach of Crone et al. (2001), by subtracting the AEP from each individual trial waveform prior to the wavelet transformation.
ERBP envelopes were calculated as log-transformed power changes, normalized and averaged, over the range of frequencies between 70 and 130 Hz in subject R153, and between 70 and 250 Hz in the other five subjects. The range used for data collected from subject R153 differed from the others due to noise contamination of unknown origin that affected the recorded ECoG at frequencies above 130 Hz.
Representation of the temporal stimulus envelope in the cortical activity was quantified in the time domain using cross-correlation analysis (Bieser and Müller-Preuss, 1996; Abrams et al., 2008) and, in the frequency domain, as modal frequency matching (Ahissar and Ahissar, 2005). Peaks of cross-correlograms were found between lags of 0 and 150 ms. Ninety-five percent confidence intervals of the cross-correlation peaks were calculated based on 1000 bootstrapped samples.
Power spectra of time-compressed speech stimulus envelopes, ECoG single trial waveforms and ERBP envelopes were estimated using Thomson multitaper approach (Thomson, 1982) as implemented in MATLAB version 7.7.0. The spectrum estimation algorithm was applied with a time-bandwidth product of 1.5 following removal of linear trend. The power spectra of the stimulus envelopes were characterized by their modal frequencies, which ranged from 3.7 to 14 Hz (at compression ratios of 0.75 and 0.20, respectively) (Supplementary Figure 1). Modal frequencies of ECoG averaged power spectra and ERBP spectra were defined as maximal spectral peaks at frequencies above the reciprocal of the stimulus duration. Peaks below this frequency were ignored because they were likely to represent artifacts of zero-padding and detrending in the context of a DC offset in the ERBP.
Stimulus-response frequency matching was evaluated for the raw ECoG signal as well as for ERBP envelope from their power spectra. In the former case, frequency matching was measured as the difference between modal frequency of the stimulus envelope and the local maximum of the averaged spectrum of ECoG, and in the latter case, as the difference between the modal frequency of the stimulus envelope and the local maximum of the ERBP envelope.
Intelligibility of time-compressed speech sentences was evaluated in a psychophysical experiment, the results of which are presented in Fig. 2. At compression ratios of 0.75, 0.50 and 0.40, comprehension index values were relatively high (≥0.6) in all tested subjects, corresponding to correct identification of at least 80% of the sentences. This indicates that speech sentences presented at these compression ratios were intelligible. At compression ratio of 0.30, speech comprehension deteriorated, and comprehension of sentences compressed to 0.20 of the original duration was at or below chance level (dashed line in Fig. 2), indicating that the most compressed speech sentences were unintelligible.
The neurosurgical subject patients (symbols in Fig. 2) were not considerably different from a group of tested healthy volunteers (lines with error bars in Fig. 2) in terms of their ability to comprehend time-compressed speech. A two-factor repeated measures analysis of variance was conducted to evaluate the effect of subject-population and compression ratio on comprehension index. In this repeated measures design, the between-subject factor was subject-population with two levels (patients and volunteers) and within-subject factor was compression ratio with five levels (0.75, 0.50, 0.40, 0.30, 0.20). The α level was set at 0.05. A significant main effect was found for compression ratio, F(4,20)=112.42, p<0.0001. The main effect for subject-population was not significant, F(1, 23) = 0.66, p < 0.42, nor factor interactions, F(4, 20) = 0.072, p < 0.58. The results of this psychophysical test are consistent with speech comprehension data reported previously by Ahissar et al., obtained using essentially the same experimental paradigm (cf. Fig. 3C in Ahissar et al., 2001).
Time-compressed speech stimuli elicited robust AEPs in HG, with responses having the shortest latencies and highest amplitudes in the posteromedial portion of the gyrus (Fig. 3). Here temporal synchrony to the speech envelope was evident at moderate compression ratios (0.75-0.40) as a series of peaks in the AEP waveform (Fig. 3B, contacts 3–8). At compression ratios that affected comprehension (0.30-0.20), however, responses were dominated by a relatively large waveform complex that was time-locked to the stimulus onset. Synchrony to the temporal envelope of the stimulus was not apparent. In contrast, AEPs recorded from anterolateral HG (contacts 9–12) had longer latencies, lower amplitudes and little or no evidence of envelope following.
The AEP waveforms are useful in evaluating the response waveform that is phase-locked to the stimulus waveform and largely invariant across trials. Response activity that is time-locked but not phase-locked to the stimulus waveform would necessarily be markedly attenuated in the across-trial averaging process (Woody, 1967; Glaser & Ruchkin, 1976). To explore this component of speech-evoked activity, we performed spectral analyses of the ECoG data recorded from each of the HG recording sites on a trial-by-trial basis and measured changes in ERBP across a range of frequencies that extended from 10 to 250 Hz (see Methods). Figure 4 shows the results of such an analysis applied to the data set introduced in Figure 3. Within posteromedial HG, cortical activity exhibited increases of ERBP that spanned a wide range of frequencies and were most prominent in the high frequency (70 Hz and above) range. ERBP was not constant in magnitude throughout the duration of the stimulus but appeared to be modulated by the temporal envelope of the speech stimulus. This pattern of ERBP changes, seemingly driven by the stimulus temporal envelope, was observed even in responses to the most compressed (0.30-0.20) stimuli at some (contacts 3–6 in Fig. 4) but not at all recording sites within HG. This finding is in contrast to the AEP, which failed to reveal following of the temporal envelope at these high compression ratios (cf. Fig. 3B).
Additional differences between the ERBP and the AEP include an abrupt decline lateral to contact 6 in the magnitude of the ERBP at all compression ratios, despite the presence of strong temporal synchrony in the AEP at contacts 7 and 8. ERBP data obtained from sites more anterolateral on HG showed some responses throughout the duration of sound stimulus extending to contact 12. These power changes, which were clearly seen between about 50 and 150 Hz, were relatively modest and did not exhibit modulation by the stimulus temporal envelope seen in posteromedial HG, even under the least compressed (0.75) condition.
ERBP measures shown in Figure 4 show stimulus-related changes in both phase-locked and non-phase-locked power. Abrupt-onset components of complex acoustic stimuli such as speech are likely to trigger cortical activity with a relatively high degree of temporal synchrony. We hypothesized that different components of the speech envelope (such as syllable onsets and vowel nuclei) might be differentially represented by high frequency cortical activity. To address this question, we attempted to minimize the contribution of phase-locked power by subtracting the AEP from each ECoG trial waveform prior to time-frequency ERBP analysis (Crone et al., 2001; Pulvermüller et al., 1997). We found that NPL activity in core auditory cortex exhibited modulation by the stimulus envelope (Supplementary Figure 2). On the other hand, syllable onsets were emphasized in the cortical response by phase-locked high frequency activity, as can be seen from a comparison between plots of total and NPL ERBP in Supplementary Figure 2. Although detailed comparison of phase-locked and NPL high frequency auditory cortical activity is beyond the scope of this study and currently is under further investigation, we note that phase-locked and NPL ERBP may differentially represent rising and steady-state components of the speech envelope, respectively.
A question remains as to the extent temporal synchrony to the speech envelope is represented by posteromedial HG bilaterally (Liégeois-Chauvel et al., 1999; 2004). We could not address this question directly, as simultaneous recording from the left and right hemispheres from the same subjects was not possible due to clinical considerations. We were, however, able to compare data recorded from left (language-dominant) and right (non-dominant) hemispheres across the studied group of subjects.
Results obtained from HG of the right (non-dominant) hemisphere in a representative subject are shown in Figure 5 and Figure 6. As with recordings from HG of the language-dominant hemisphere shown previously (see Fig. 3 and Fig. 4), robust AEPs were recorded in posteromedial HG at all compression ratios. Synchrony to the stimulus envelope at compression ratios of 0.75 to 0.50 was most evident at several adjacent recording sites (contacts 3, 4, 5 in Fig. 5) located in the central portion of posteromedial HG. There appeared to be a shift in location of the envelope-following response in the AEP (contacts 8, 9, 10) at compression ratios of 0.40 to 0.30. ERBP exhibited modulation by the stimulus envelope at compression ratios extending to 0.20 (Fig. 6). Again, the most prominent temporal modulation of ERBP was in posteromedial HG, and, again, the spatial distribution of ERBP modulation was not entirely coextensive with that of AEP envelope following. A transition seems to have occurred around contact 10, both in the AEP and ERBP. Further anterolaterally on HG (contacts 11 and 12), the AEP was of low magnitude and showed little or no sign of envelope following. ERBP, on the other hand, revealed a faint representation of the stimulus envelope at the lowest compression ratios (0.75-0.50).
Although envelope-following was recorded in posteromedial HG in all subjects studied, there was considerable inter-subject variability. This is illustrated in Figure 7, which presents AEPs and ERBP envelopes in response to the compressed speech stimuli, recorded at sites of maximal ERBP change within posteromedial HG for all six subjects. In three subjects, recordings were made from right (R), language non-dominant hemisphere, while in three others, data were obtained from left (L), language-dominant hemisphere. In all cases, and at all degrees of compression, stimulus-evoked activity was robust within the high frequency range (70 Hz and up), peaking at about 3–6 dB re prestimulus baseline. Envelope following was also exhibited by all subjects, but the strength of response modulation varied considerably among them even at the lowest degree (0.75) of compression. At the most compressed conditions (0.30-0.20), where intelligibility of speech considerably deteriorated (see Fig. 2), envelope following was still present in four subjects (L156, R154, R153, and L173).
To quantify the representation of the temporal stimulus envelope in the cortical activity, we utilized two approaches. In the time domain, the accuracy of envelope following was estimated using cross-correlation analysis (Abrams et al., 2008) and, in the frequency domain, using analysis of stimulus-response modal frequency matching (Ahissar and Ahissar, 2005).
First, envelope following by the AEP and ERBP within core auditory cortex was quantified by measuring peaks of cross-correlograms between speech envelopes and AEPs, and high frequency ERBP envelope (70–250 Hz; see Methods), respectively. Figure 8 presents the results of this analysis performed on data obtained from the six subjects at the same core auditory cortex locations as those shown in Figure 7. In four subjects out of six (L156, L173, R154 and R153), correlation between the stimulus envelope and the high frequency ERBP envelope remained consistently high across compression ratios, including the most compressed (unintelligible) condition. This applies to both total and non phase-locked ERBP (open squares and triangles, respectively, in Fig. 8). In contrast, AEP’s stimulus envelope following (filled circles in Fig. 8) deteriorated with compression, consistent with a decrease in comprehension of highly compressed sentences (cf. Fig. 2). Similarly, correlation between the stimulus envelope and the ERBP envelope in lower frequency bands (<50 Hz) did not reliably follow temporal envelope of speech across the range of compression ratios (not shown).
Next, we sought to examine the extent of frequency matching between the temporal stimulus envelope and recorded cortical activity. As the spectral profiles of the speech envelopes were dominated by modal frequencies ranging from 3.7 to 14 Hz (see Supplementary Fig.1), power spectra of cortical activity were estimated within relatively low (up to 25 Hz) frequency bands. An example of averaged power spectra of ECoG waveforms recorded from multiple HG sites (see Fig. 3A) is shown as blue lines in Figure 9. We also characterized modulation of ERBP by plotting its power spectra across locations and compression ratios (red lines in Fig. 9). The two power measures of the cortical response were compared with the power spectrum of the stimulus envelope (grey lines in Fig. 9).
Power spectra of the ECoG recorded from the posteromedial HG (contacts 3–9) featured peaks that matched the modal frequency of the stimulus envelope at moderate degrees of compression (0.75, 0.50 and 0.40), where speech stimuli were intelligible. This frequency matching was not present, however, in responses to the more compressed stimuli (0.30 and 0.20). In contrast, power spectra of ERBP envelopes exhibited peaks that matched the modal frequency of the stimulus envelopes even in the most compressed condition (0.20) (contacts 3–6 in Fig. 9).
Stimulus-response frequency matching was measured in the six subjects as a difference between the modal frequency of the stimulus envelope and a local peak of the response spectrum (Fig. 10). The low-frequency components of the ECoG exhibited frequency matching with the stimulus envelope of sentences compressed to ratios of 0.75, 0.50 and 0.40, and lack of frequency matching to more compressed stimuli (0.30-0.20). This finding is consistent with the MEG data reported by Ahissar et al. (2001). In contrast, the envelope of ERBP exhibited a more accurate frequency matching than low-frequency ECoG components and featured local spectral peaks matching the modal frequency of the stimulus envelope even at compression ratios of 0.30-0.20 in four subjects out of six (L156, L173, R154 and R153).
We also sought to establish a relationship between comprehension of time-compressed speech and the ability of the core auditory cortex to follow its temporal envelope either by phase-locking of low frequency ECoG components, or by amplitude modulation of high frequency activity. For this purpose, we computed correlation coefficients between speech comprehension (measured as comprehension index; see Fig. 2), on the one hand, and the accuracy of cortical envelope tracking (measured in the time domain as peaks of cross-correlations and in the frequency domain as frequency matching; see Fig. 8 and Fig. 10), on the other. The results are presented in Fig. 11. It can be observed that envelope following by low-frequency ECoG components exhibited strong positive correlations with speech comprehension (r=0.55 and r=0.66 for time- and frequency domain measures of envelope-following responses). On the other hand, modulation of high frequency cortical activity produced a generally more faithful representation of the stimulus envelope across speech comprehension, and exhibited only weak positive correlations with it (r=0.13 and r=0.19). This suggests that high frequency activity within the core auditory cortex can follow the envelope of speech stimuli by and large regardless of their intelligibility.
The results of the present study provide evidence that human auditory cortex resolves the temporal envelope of speech stimuli presented at natural speaking rates, as well as at degrees of time compression that make speech unintelligible. It does so by employing mechanisms operating over a wide range of ECoG frequencies, at least as high as 250 Hz. This temporal representation is most prominently featured within a restricted region of posteromedial HG, the presumed core auditory cortex, in both dominant and non-dominant hemispheres.
Despite selecting only those data from electrodes confirmed to be in posteromedial HG grey matter, modulation of both the AEP and the high-frequency ERBP varied among subjects at all rates of utterance. Moreover, in two subjects out of six there was no evidence of envelope following in the ERBP at the most compressed conditions (0.30-0.20) (see Fig. 8, Fig. 10). We note that standard audiometric test results showed speech reception scores in the normal range for all six subjects. Furthermore, all subjects were able to comprehend the speech stimuli when presented at compression ratios between 0.75 and 0.40 (see Fig. 2). While cortical high frequency activity can be influenced by selective attention (Ray et al., 2008), it is unlikely that the attentional load or the degree of arousal contributed to the inter-subject variability in cortical response locking in the present study, which employed an active-listening task.
Inter-subject variability is more likely associated with the functional organization within the core auditory cortex. What we have considered so far to be the core auditory cortex in our human subjects is not expected to be uniform in its cytoarchitecture nor is it constant relative to gross anatomical landmarks (Galaburda & Sanides, 1980; Rademacher et al., 1993; Leonard et al., 1998; Hackett et al., 2003; Fullerton & Pandya; 2007). It is possible that we obtained the data shown from different primary or ‘primary-like’ fields making up the human auditory cortical core. In monkey, three fields within the core have been shown to exhibit demonstrably different capacities to encode temporal information (Bendor & Wang; 2008). Each of the core fields may exhibit tonotopy, which we did not map and which may have influenced temporal synchrony to our speech utterances. For example, the magnitude of the evoked response to a stop consonant is influenced by the onset spectra of the stimulus and where the recording electrode is located within the tonotopic map of the primary auditory cortex (Steinschneider et al., 1995). Other functional organizations may be operating here as well to give rise to inter-subject variability (Read et al., 2002).
At relatively moderate time compression, where the speech utterances were typically intelligible, the AEP showed clear temporal following of the speech envelope. However, when the utterance was accelerated further, envelope following declined, and when the utterance was no longer intelligible, envelope following was no longer in evidence in the AEP. These results are consistent with the findings of Luo and Poeppel (2007), who demonstrated that low frequency (4–8 Hz) cortical activity measured by MEG was phase-locked to the speech signal and that this mechanism correlated with speech intelligibility. Our results are also consistent with the findings of Ahissar et al. (2001) showing that averaged evoked cortical activity measured by MEG was temporally locked to the envelope of moderately compressed speech sentences but failed to synchronize to the envelope of severely accelerated and unintelligible speech. From this they hypothesized that cortical envelope following was a prerequisite for speech comprehension. However, when we analyzed high frequency ERBP, using the same stimulus paradigm as Ahissar et al. (2001), a more complex picture emerged. Here we observed the ability of the core auditory cortex to synchronize to the speech envelope in the high frequency range of the ECoG even at rates that made the speech utterance unintelligible. We can therefore conclude that the ability of the auditory core cortex to follow low frequency fluctuations in the speech envelope is not per se a limiting factor for speech comprehension.
The relationships between response metrics (e.g., AEP, ERBP) derived from ECoG recordings and the intracortical electrodynamics representing those physiological and behavioral variables assumed to mediate these responses are complex. The literature, spanning more than half a century, is a testament to the importance attached to the resolution of issues that define these relationships (Li & Jasper, 1953; Vaughan & Costa, 1964; Morrell, 1967; Lopes da Silva 1970; Mitzdorf, 1985; Barth & Di, 1990; Kandel & Buzsáki, 1997; Mukamel et al., 2005; Liu & Newsome, 2006; Ray et al., 2008; Steinschneider et al., 2008; Edwards et al., 2009). Although much progress has been made, there still remains a need for further study to elucidate a definitive and comprehensive explanation. Regardless of whether precise quantitative relationships can be established at this time, we subscribe to the belief that such metrics comprise correlates of time-delimited physiological processes reflecting changes in brain function caused by different stimulus attributes, learning history, and future expectancies whether induced by manifestations of incoming activations, output processes, or memory readouts.
Liégeois-Chauvel et al. (2004) reported greater sensitivity of primary auditory cortex to 4 Hz AM noise in the left hemisphere compared to the right hemisphere. Abrams et al. (2008), however, measured scalp EEG responses to slow temporal features of speech corresponding to the syllable rate and found that envelope following responses were larger on the right hemisphere. Clinical considerations prevented us from recording directly from the left and right hemispheres in the same subject. Nonetheless, taking into account inter-subject differences and small sample size, we have no compelling evidence for across-hemisphere differences in the representation of the temporal envelope within HG. Combining functional magnetic resonance imaging and ECoG measures of auditory cortical activity in same subjects may be helpful in addressing this issue.
The speech sentences used by Ahissar et al. (2001) and in the present study can be characterized in terms of the modal frequencies of their temporal envelopes, which correspond to the average syllabic rate. The range of the modal frequencies, from 3.7 to 14 Hz (corresponding to compression ratios of 0.75 and 0.20 of a normal speaking rate), is within the envelope-following capacity of auditory cortex for sinusoidal AM acoustic stimuli, as has been shown previously using fMRI (Giraud et al., 2000) as well as time-domain averaged electroencephalogram (EEG), MEG, and ECoG recordings (Kuwada et al., 1986; Rees et al., 1986; Roβ et al., 2000; Liégeois-Chauvel et al., 2004; Nourski et al., 2009). By compressing the utterance, its duration was reduced from 1.05 to 0.29 s. In the case of the shortest-duration stimulus, a large AEP to stimulus onset may have obscured a phase-locked response. The onset AEP response, however, does not mask high frequency activity revealed by ERBP measures even with the shortest-duration stimulus.
Periodic acoustic stimuli presented at repetition rates between 3.7 and 14 Hz can elicit percepts of discrete events at the low end and flutter at the high end (Rosen, 1992; Bendor & Wang, 2007). We note that whereas speech comprehension may be lost at high speech compression ratios, the perception of acoustic flutter associated with envelope frequency of the speech signal is indeed retained. The discrepancy between the AEP and the high frequency ERBP ability to follow the accelerated speech envelope may provide a physiological counterpart of the difference between speech comprehension and perception of its acoustic features at a relatively early level of cortical speech processing within the core auditory cortex.
The current data reveal that the primary auditory cortex is capable of resolving low-frequency (up to at least 15 Hz) envelope information of time-compressed speech, which represents segmental cues. However, cortical evoked activity may not represent information on a shorter time scale (milliseconds to tens-of-milliseconds) that corresponds to voice onset times and formant transitions and is critical to speech perception. Based on the results of lesion studies, it has been suggested that human primary auditory cortex plays a specialized role in processing speech information in the milliseconds to tens-of-milliseconds time frame (Phillips & Farmer, 1990). The speech compression method used in the study of Ahissar et al (2001), and the current study, preserves spectral features of natural speech, but does alter the timing of voice onset and formant transitions. Experiments are underway to examine whether the inability to comprehend compressed speech correlates with a loss of the ability of the primary auditory cortex to temporally represent stimulus features such as voice onsets and formant transitions, occurring on a time scale of tens of milliseconds.
We thank Mitchell Steinschneider, Christopher Turner and Paul Poon for advice and comments, Ehud Ahissar for providing experimental stimuli, Chandan Reddy and Fangxiang Chen for help during data collection, and Carol Dizack for graphic artwork. This study was supported by NIDCD RO1-DC04290, MO1-RR-59 General Clinical Research Centers Program, the Hoover Fund, and the Carver Trust.