Words and sentences from different English speakers were presented aurally to 15 patients undergoing neurosurgical procedures for epilepsy or brain tumor. All patients in this study had normal language capacity as determined by neurological exam. Cortical surface field potentials were recorded from non-penetrating multi-electrode arrays placed over the lateral temporal cortex (, red circles), including the pSTG. We investigated the nature of auditory information contained in temporal cortex neural responses using a stimulus reconstruction approach (see Materials and Methods
. The reconstruction procedure is a multi-input, multi-output predictive model that is fit to stimulus-response data. It constitutes a mapping from neural responses to a multi-dimensional stimulus representation ( and ). This mapping can be estimated using a variety of different learning algorithms 
. In this study a regularized linear regression algorithm was used to minimize the mean-square error between the original and reconstructed stimulus (see Materials and Methods
). Once the model was fit to a training set, it could then be used to predict the spectro-temporal content of any arbitrary sound, including novel speech not used in training.
The key component in the reconstruction algorithm is the choice of stimulus representation, as this choice encapsulates a hypothesis about the neural coding strategy under study. Previous applications of stimulus reconstruction in non-human auditory systems 
have focused primarily on linear models to reconstruct the auditory spectrogram. The spectrogram is a time-varying representation of the amplitude envelope at each acoustic frequency (, bottom left) 
. The spectrogram envelope of natural sounds is not static but rather fluctuates across both frequency and time 
. Envelope fluctuations in the spectrogram are referred to as modulations 
and play an important role in the intelligibility of speech 
. Temporal modulations occur at different temporal rates and spectral modulations occur at different spectral scales. For example, slow and intermediate temporal modulation rates (<4 Hz) are associated with syllable rate, while fast modulation rates (>16 Hz) correspond to syllable onsets and offsets. Similarly, broad spectral modulations relate to vowel formants while narrow spectral structure characterizes harmonics. In the linear spectrogram model, modulations are represented implicitly as the fluctuations of the spectrogram envelope. Furthermore, neural responses are assumed to be linearly related to the spectrogram envelope.
For stimulus reconstruction, we first applied the linear spectrogram model to human pSTG responses using a stimulus set of isolated words from an individual speaker. We used a leave-one-out cross-validation fitting procedure in which the reconstruction model was trained on stimulus-response data from isolated words and evaluated by directly comparing the original and reconstructed spectrograms of the out-of-sample word. Reconstruction accuracy is quantified as the correlation coefficient (Pearson's r
) between the original and reconstructed stimulus. The reconstruction procedure is illustrated in for one participant with a high-density (4 mm) electrode grid placed over posterior temporal cortex. For different words, the linear model yielded accurate spectrogram reconstructions at the level of single trial stimulus presentations (; see Figure S7
and Supporting Audio File S1
for example audio reconstructions). The reconstructions captured major spectro-temporal features such as energy concentration at vowel harmonics (, purple bars) and high frequency components during fricative consonants (, [z] and [s], green bars). The anatomical distribution of weights in the fitted reconstruction model revealed that the most informative electrode sites within temporal cortex were largely confined to pSTG ().
Across the sample of participants (N
15), cross-validated reconstruction accuracy for single trials was significantly greater than zero in all individual participants (p
<0.001, randomization test, ). At the population level, mean accuracy averaged over all participants and stimulus sets (including different word sets and continuous sentences from different speakers) was highly significant (mean accuracy r
, one-sample t
14). As a function of acoustic frequency, mean accuracy ranged from r
Individual participant and group average reconstruction accuracy.
We observed that overall reconstruction quality was influenced by a number of anatomical and functional factors as described below. First, informative temporal electrodes were primarily localized to pSTG. To quantify this, we defined “informative” electrodes as those associated with parameters with high signal-to-noise ratio in the reconstruction models (t
<0.05, false discovery rate (FDR) correction) shows the anatomical distribution of informative electrodes pooled across participants and plotted in standardized anatomical coordinates (Montreal Neurological Institute, MNI) 
). The distribution was centered in the pSTG (x
12, MNI coordinates; Brodmann area 42), and was dispersed along the anterior-posterior axis.
Factors influencing reconstruction quality.
Second, significant predictive power (r
>0) was largely confined to neural responses in the high gamma band (~70–170 Hz; ; p
<0.01, one-sample t
14, Bonferroni correction). Predictive power for the high gamma band (~70–170 Hz) was significantly better compared to other neural frequency bands (p
<0.05, Bonferroni adjusted pair-wise comparisons between frequency bands, following significant one-way repeated measures analysis of variance (ANOVA), F
). This is consistent with robust speech-induced high gamma responses reported in previous intracranial studies 
and with observed correlations between high gamma power and local spike rate 
Third, increasing the number of electrodes used in the reconstruction improved overall reconstruction accuracy (). Overall prediction quality was relatively low for participants with five or fewer responsive STG electrodes (mean accuracy r
6 participants) and was robust for cases with high density grids (mean accuracy r
4, mean of 37 responsive STG electrodes per participant).
What neural response properties allow the linear model to find an effective mapping to the stimulus spectrogram? There are two major requirements as described in the following paragraphs. First, individual recording sites must exhibit reliable frequency selectivity (e.g., , right column; Figures S1B
). An absence of frequency selectivity (i.e., equal neural response amplitudes to all stimulus frequencies) would imply that neural responses do not encode frequency and could not be used to differentiate stimulus frequencies. To quantify frequency tuning at individual electrodes, we used estimates of standard spectro-temporal receptive fields (STRFs) (see Materials and Methods
). The STRF is a forward modeling approach commonly used to estimate neural tuning to a wide variety of stimulus parameters in different sensory systems 
. We found that different electrodes were sensitive to different acoustic frequencies important for speech sounds, ranging from low (~200 Hz) to high (~7,000 Hz). The majority of individual sites exhibited a complex tuning profile with multiple peaks (e.g., , rows 2 and 3; Figure S2B
). The full range of the acoustic speech spectrum was encoded by responses from multiple electrodes in the ensemble, although coverage of the spectrum varied by participant (). Across participants, total reconstruction accuracy was positively correlated with the proportion of spectrum coverage (r
A second key requirement of the linear model is that the neural response must rise and fall reliably with fluctuations in the stimulus spectrogram envelope. This is because the linear model assumes a linear mapping between the response and the spectrogram envelope. This requirement for “envelope-locking” reveals a major limitation of the linear model, which is most evident at fast temporal modulation rates. This limitation is illustrated in (blue curve), which plots reconstruction accuracy as a function of modulation rate. A one-way repeated measures ANOVA (F
) indicated that accuracy was significantly higher for slow modulation rates (≤4 Hz) compared to faster modulation rates (>8 Hz) (p
<0.05, post hoc pair-wise comparisons, Bonferroni correction). Accuracy for slow and intermediate modulation rates (≤8 Hz) was significantly greater than zero (r
~0.15 to 0.42; one-sample paired t
14, Bonferroni correction) indicating that the high gamma response faithfully tracks the spectrogram envelope at these rates 
. However, accuracy levels were not significantly greater than zero at fast modulation rates (>8 Hz; r
~0.10; one-sample paired t
14, Bonferroni correction), indicating a lack of reliable envelope-locking to rapid temporal fluctuations 
Comparison of linear and nonlinear coding of temporal fluctuations.
Given the failure of the linear spectrogram model to reconstruct fast modulation rates, we evaluated competing models of auditory neural encoding. We investigated an alternative, nonlinear model based on modulation (described in detail in 
). Speech sounds are characterized by both slow and fast temporal modulations (e.g., syllable rate versus onsets) as well as narrow and broad spectral modulations (e.g., harmonics versus formants) 
. The modulation model represents these multi-resolution features explicitly through a complex wavelet analysis of the auditory spectrogram. Computationally, the modulation representation is generated by a population of modulation-selective filters that analyze the two-dimensional spectrogram and extract modulation energy (a nonlinear operation) at different temporal rates and spectral scales () 
. Conceptually, this transformation is similar to the modulus of a 2-D Fourier transform of the spectrogram, localized at each acoustic frequency 
. The modulation model and applications to speech processing are described in detail in 
Schematic of nonlinear modulation model.
The nonlinear component of the model is phase invariance to the spectrogram envelope (). A fundamental difference with the linear spectrogram model is that phase invariance permits a nonlinear temporal coding scheme, whereby envelope fluctuations are encoded by amplitude rather than envelope-locking (). Such amplitude-based coding schemes are broadly referred to as “energy models” 
. The modulation model therefore represents an auditory analog to the classical energy model of complex cells in the visual system 
, which are invariant to the spatial phase of visual stimuli.
Reconstructing the modulation representation proceeds similarly to the spectrogram, except that individual reconstructed stimulus components now correspond to modulation energy at different rates and scales instead of spectral energy at different acoustic frequencies (see Materials and Methods
, Stimulus Reconstruction). We next compared reconstruction accuracy using the nonlinear modulation model to that of the linear spectrogram model (; Figure S3
). In the group data, the nonlinear model yielded significantly higher accuracy compared to the linear model (two-way repeated measures ANOVA; main effect of model type, F
). This included significantly better accuracy for fast temporal modulation rates compared to the linear spectrogram model (4–32 Hz; , red versus blue curves; model type by modulation rate interaction effect, F
<0.01; post hoc pair-wise comparisons, p
, Bonferroni correction).
The improved performance of the modulation model suggested that this representation provided better neural sensitivity to fast modulation rates compared to the linear spectrogram. To further investigate this possibility, we estimated modulation rate tuning curves at individual STG electrode sites (n
195) using linear and nonlinear STRFs, which are based on the spectrogram and modulation representations, respectively (Figure S4
). Consistent with prior recordings from lateral temporal human cortex 
, average envelope-locked responses exhibit prominent tuning to low rates (1–8 Hz) with a gradual loss of sensitivity at higher rates (>8 Hz) (). In contrast, the average modulation-based tuning curves preserve sensitivity to much higher rates approaching 32 Hz ().
Sensitivity to fast modulation rates at single STG electrodes is illustrated for one participant in . In this example (the word “waldo”), the spectrogram envelope (blue curve, top) fluctuates rapidly between the two syllables (“wal” and “do,” ~300 ms). The linear model assumes that neural responses (high gamma power, black curves, left) are envelope-locked and directly track this rapid change. However, robust tracking of such rapid envelope changes was not generally observed, in violation of linear model assumptions. This is illustrated for several individual electrodes in (compare black curves, left, with blue curve, top). In contrast, the modulation representation encodes this fluctuation nonlinearly as an increase in energy at fast rates (>8 Hz, dashed red curves, ~300 ms, bottom two rows). This allows the model to capture energy-based modulation information in the neural response. Modulation energy encoding at these sites is quantified by the corresponding nonlinear rate tuning curves (, right column). These tuning curves show neural sensitivity to a range of temporal modulations with a single peak rate. For illustrative purposes, (left) compares modulation energy at the peak temporal rate (dashed red curves) with the neural responses (black curves) at each individual site. This illustrates the ability of the modulation model to account for a rapid decrease in the spectrogram envelope without a corresponding decrease in the neural response.
Example of nonlinear modulation coding and reconstruction.
The effect of sensitivity to fast modulation rates can also be observed when the modulation reconstruction is viewed in the spectrogram domain (, middle, see Material and Methods, Reconstruction Accuracy). The result is that dynamic spectral information (such as the upward frequency sweep at ~400–500 ms, , top) is better resolved compared to the linear spectrogram-based reconstruction (, bottom). These combined results support the idea of an emergent population-level representation of temporal modulation energy in primate auditory cortex 
. In support of this notion, subpopulations of neurons have been found that exhibit both envelope and energy-based response properties in primary auditory cortex of non-human primates 
. This has led to the suggestion of a dual coding scheme in which slow fluctuations are encoded by synchronized (envelope-locked) neurons, while fast fluctuations are encoded by non-synchronized (energy-based) neurons 
While these results indicate that a nonlinear model is required to reliably reconstruct fast modulation rates, psychoacoustic studies have shown that slow and intermediate modulation rates (~1–8 Hz) are most critical for speech intelligibility 
. These slow temporal fluctuations carry essential phonological information such as formant transitions and syllable rate 
. The linear spectrogram model, which also yielded good performance within this range (; Figure S3
), therefore appears sufficient to reconstruct the essential range of temporal modulations. To examine this issue, we further assessed reconstruction quality by evaluating the ability to identify isolated words using the linear spectrogram reconstructions. We analyzed a participant implanted with a high-density electrode grid (4 mm spacing), the density of which provided a large set of pSTG electrodes. Compared to lower density grid cases, data for this participant included ensemble frequency tuning that covered the majority of the (speech-related) acoustic spectrum (180–7,000 Hz), a factor which we found was critical for accurate reconstruction (). Spectrogram reconstructions were generated for each of 47 words, using neural responses either from single trials or averaged over 3–5 trials per word (same word set and cross-validated fitting procedure as described in ). To identify individual words from the reconstructions, a simple speech recognition algorithm based on dynamic time warping was used to temporally align words of variable duration 
. For a target word, a similarity score (correlation coefficient) was then computed between the target reconstruction and the actual spectrograms of each of the 47 words in the candidate set. The 47 similarity scores were sorted and word identification rank was quantified as the percentile rank of the correct word. (1.0 indicates the target reconstruction matched the correct word out of all candidate words; 0.0 indicates the target was least similar to the correct word among all other candidates.) The expected mean of the distribution of identification ranks is 0.5 at chance level.
Word identification using averaged trials was substantially higher than chance (, median identification rank
<0.0001; randomization test), with correctly identified words exhibiting accurate reconstructions and poorly identified words exhibiting inaccurate reconstructions (). For single trials, identification performance declined slightly but remained significant (median
<0.0001; randomization test). In addition, for each possible word pair, we computed the similarity between the two original spectrograms and compared this to the similarity between the reconstructed and actual spectrograms (using averaged trials; ; Figure S5
). Acoustic and reconstruction word similarities were correlated (r
45), suggesting that acoustic similarity of the candidate words is likely to influence identification performance (i.e., identification is more difficult when the word set contains many acoustically similar sounds).