Auditory objects, like their visual counterparts, are perceptually defined constructs, but nevertheless must arise from underlying neural circuitry. Using magnetoencephalography (MEG) recordings of the neural responses of human subjects listening to complex auditory scenes, we review studies that demonstrate that auditory objects are indeed neurally represented in auditory cortex. The studies use neural responses obtained from different experiments in which subjects selectively listen to one of two competing auditory streams embedded in a variety of auditory scenes. The auditory streams overlap spatially and often spectrally. In particular, the studies demonstrate that selective attentional gain does not act globally on the entire auditory scene, but rather acts differentially on the separate auditory streams. This stream-based attentional gain is then used as a tool to individually analyze the different neural representations of the competing auditory streams.
The neural representation of the attended stream, located in posterior auditory cortex, dominates the neural responses. Critically, when the intensities of the attended and background streams are separately varied over a wide intensity range, the neural representation of the attended speech adapts only to the intensity of that speaker, irrespective of the intensity of the background speaker. This demonstrates object-level intensity gain control in addition to the above object-level selective attentional gain.
Overall, these results indicate that concurrently streaming auditory objects, even if spectrally overlapping and not resolvable at the auditory periphery, are individually neurally encoded in auditory cortex, as separate objects.
auditory object; MEG; neural representation; cortical representation; speech
Auditory cortical activity is entrained to the temporal envelope of speech, which corresponds to the syllabic rhythm of speech. Such entrained cortical activity can be measured from subjects naturally listening to sentences or spoken passages, providing a reliable neural marker of online speech processing. A central question still remains to be answered about whether cortical entrained activity is more closely related to speech perception or non-speech-specific auditory encoding. Here, we review a few hypotheses about the functional roles of cortical entrainment to speech, e.g., encoding acoustic features, parsing syllabic boundaries, and selecting sensory information in complex listening environments. It is likely that speech entrainment is not a homogeneous response and these hypotheses apply separately for speech entrainment generated from different neural sources. The relationship between entrained activity and speech intelligibility is also discussed. A tentative conclusion is that theta-band entrainment (4–8 Hz) encodes speech features critical for intelligibility while delta-band entrainment (1–4 Hz) is related to the perceived, non-speech-specific acoustic rhythm. To further understand the functional properties of speech entrainment, a splitter’s approach will be needed to investigate (1) not just the temporal envelope but what specific acoustic features are encoded and (2) not just speech intelligibility but what specific psycholinguistic processes are encoded by entrained cortical activity. Similarly, the anatomical and spectro-temporal details of entrained activity need to be taken into account when investigating its functional properties.
auditory cortex; entrainment of rhythms; speech intelligibility; speech perception in noise; speech envelope; cocktail party problem
Speech recognition is robust to background noise. One underlying neural mechanism is that the auditory system segregates speech from the listening background and encodes it reliably. Such robust internal representation has been demonstrated in auditory cortex by neural activity entrained to the temporal envelope of speech. A paradox, however, then arises, as the spectro-temporal fine structure rather than the temporal envelope is known to be the major cue to segregate target speech from background noise. Does the reliable cortical entrainment in fact reflect a robust internal “synthesis” of the attended speech stream rather than direct tracking of the acoustic envelope? Here, we test this hypothesis by degrading the spectro-temporal fine structure while preserving the temporal envelope using vocoders. Magnetoencephalography (MEG) recordings reveal that cortical entrainment to vocoded speech is severely degraded by background noise, in contrast to the robust entrainment to natural speech. Furthermore, cortical entrainment in the delta-band (1–4 Hz) predicts the speech recognition score at the level of individual listeners. These results demonstrate that reliable cortical entrainment to speech relies on the spectro-temporal fine structure, and suggest that cortical entrainment to the speech envelope is not merely a representation of the speech envelope but a coherent representation of multiscale spectro-temporal features that are synchronized to the syllabic and phrasal rhythms of speech.
envelope entrainment; auditory cortex; auditory scene analysis; MEG
Natural sensory inputs, such as speech and music, are often rhythmic. Recent studies have consistently demonstrated that these rhythmic stimuli cause the phase of oscillatory, i.e. rhythmic, neural activity, recorded as local field potential (LFP), electroencephalography (EEG) or magnetoencephalography (MEG), to synchronize with the stimulus. This phase synchronization, when not accompanied by any increase of response power, has been hypothesized to be the result of phase resetting of ongoing, spontaneous, neural oscillations measurable by LFP, EEG, or MEG. In this article, however, we argue that this same phenomenon can be easily explained without any phase resetting, and where the stimulus-synchronized activity is generated independently of background neural oscillations. It is demonstrated with a simple (but general) stochastic model that, purely due to statistical properties, phase synchronization, as measured by ‘inter-trial phase coherence’, is much more sensitive to stimulus-synchronized neural activity than is power. These results question the usefulness of analyzing the power and phase of stimulus-synchronized activity as separate and complementary measures; particularly in the case of attempting to demonstrate whether stimulus-phase-locked neural activity is generated by phase resetting of ongoing neural oscillations.
Phase resetting; Neural oscillations; Phase coherence; Entrainment
Speech recognition is remarkably robust to the listening background, even when the energy of background sounds strongly overlaps with that of speech. How the brain transforms the corrupted acoustic signal into a reliable neural representation suitable for speech recognition, however, remains elusive. Here, we hypothesize that this transformation is performed at the level of auditory cortex through adaptive neural encoding, and we test the hypothesis by recording, using magnetoencephalography (MEG), the neural responses of human subjects listening to a narrated story. Spectrally matched stationary noise, which has maximal acoustic overlap with the speech, is mixed in at various intensity levels. Despite the severe acoustic interference caused by this noise, it is here demonstrated that low-frequency auditory cortical activity is reliably synchronized to the slow temporal modulations of speech, even when the noise is twice as strong as the speech. Such a reliable neural representation is maintained by intensity contrast gain control, and by adaptive processing of temporal modulations at different time scales, corresponding to the neural delta and theta bands. Critically, the precision of this neural synchronization predicts how well a listener can recognize speech in noise, indicating that the precision of the auditory cortical representation limits the performance of speech recognition in noise. Taken together, these results suggest that, in a complex listening environment, auditory cortex can selectively encode a speech stream in a background insensitive manner, and this stable neural representation of speech provides a plausible basis for background-invariant recognition of speech.
How speech signals are analyzed and represented remains a foundational challenge both for cognitive science and neuroscience. A growing body of research, employing various behavioral and neurobiological experimental techniques, now points to the perceptual relevance of both phoneme-sized (10–40 Hz modulation frequency) and syllable-sized (2–10 Hz modulation frequency) units in speech processing. However, it is not clear how information associated with such different time scales interacts in a manner relevant for speech perception. We report behavioral experiments on speech intelligibility employing a stimulus that allows us to investigate how distinct temporal modulations in speech are treated separately and whether they are combined. We created sentences in which the slow (~4 Hz; Slow) and rapid (~33 Hz; Shigh) modulations—corresponding to ~250 and ~30 ms, the average duration of syllables and certain phonetic properties, respectively—were selectively extracted. Although Slow and Shigh have low intelligibility when presented separately, dichotic presentation of Shigh with Slow results in supra-additive performance, suggesting a synergistic relationship between low- and high-modulation frequencies. A second experiment desynchronized presentation of the Slow and Shigh signals. Desynchronizing signals relative to one another had no impact on intelligibility when delays were less than ~45 ms. Longer delays resulted in a steep intelligibility decline, providing further evidence of integration or binding of information within restricted temporal windows. Our data suggest that human speech perception uses multi-time resolution processing. Signals are concurrently analyzed on at least two separate time scales, the intermediate representations of these analyses are integrated, and the resulting bound percept has significant consequences for speech intelligibility—a view compatible with recent insights from neuroscience implicating multi-timescale auditory processing.
speech perception; speech segmentation; temporal processing; modulation spectrum; auditory processing; syllable; phoneme
Humans routinely segregate a complex acoustic scene into different auditory streams, through the extraction of bottom-up perceptual cues and the use of top-down selective attention. To determine the neural mechanisms underlying this process, neural responses obtained through magnetoencephalography (MEG) were correlated with behavioral performance in the context of an informational masking paradigm. In half the trials, subjects were asked to detect frequency deviants in a target stream, consisting of a rhythmic tone sequence, embedded in a separate masker stream composed of a random cloud of tones. In the other half of the trials, subjects were exposed to identical stimuli but asked to perform a different task—to detect tone-length changes in the random cloud of tones. In order to verify that the normalized neural response to the target sequence served as an indicator of streaming, we correlated neural responses with behavioral performance under a variety of stimulus parameters (target tone rate, target tone frequency, and the “protection zone”, that is, the spectral area with no tones around the target frequency) and attentional states (changing task objective while maintaining the same stimuli). In all conditions that facilitated target/masker streaming behaviorally, MEG normalized neural responses also changed in a manner consistent with the behavior. Thus, attending to the target stream caused a significant increase in power and phase coherence of the responses in recording channels correlated with an increase in the behavioral performance of the listeners. Normalized neural target responses also increased as the protection zone widened and as the frequency of the target tones increased. Finally, when the target sequence rate increased, the buildup of the normalized neural responses was significantly faster, mirroring the accelerated buildup of the streaming percepts. Our data thus support close links between the perceptual and neural consequences of the auditory stream segregation.
Most ecologically natural sensory inputs are not limited to a single modality. While it is possible to use real ecological materials as experimental stimuli to investigate the neural basis of multi-sensory experience, parametric control of such tokens is limited. By using artificial bimodal stimuli composed of approximations to ecological signals, we aim to observe the interactions between putatively relevant stimulus attributes. Here we use MEG as an electrophysiological tool and employ as a measure the steady-state response (SSR), an experimental paradigm typically applied to unimodal signals. In this experiment we quantify the responses to a bimodal audio-visual signal with different degrees of temporal (phase) congruity, focusing on stimulus properties critical to audiovisual speech. An amplitude modulated auditory signal (‘pseudo-speech’) is paired with a radius-modulated ellipse (‘pseudo-mouth’), with the envelope of low-frequency modulations occurring in phase or at offset phase values across modalities. We observe (i) that it is possible to elicit an SSR to bimodal signals; (ii) that bimodal signals exhibit greater response power than unimodal signals; and (iii) that the SSR power at specific harmonics and sensors differentially reflects the congruity between signal components. Importantly, we argue that effects found at the modulation frequency and second harmonic reflect differential aspects of neural coding of multisensory signals. The experimental paradigm facilitates a quantitative characterization of properties of multi-sensory speech and other bimodal computations.
Audio-visual; Cross-modal; Magnetoencephalography; Speech; Multi-sensory
The ability to focus on and understand one talker in a noisy social environment is a critical social-cognitive capacity, whose underlying neuronal mechanisms are unclear. We investigated the manner in which speech streams are represented in brain activity and the way that selective attention governs the brain’s representation of speech using a ‘Cocktail Party’ Paradigm, coupled with direct recordings from the cortical surface in surgical epilepsy patients. We find that brain activity dynamically tracks speech streams using both low frequency phase and high frequency amplitude fluctuations, and that optimal encoding likely combines the two. In and near low level auditory cortices, attention ‘modulates’ the representation by enhancing cortical tracking of attended speech streams, but ignored speech remains represented. In higher order regions, the representation appears to become more ‘selective,’ in that there is no detectable tracking of ignored speech. This selectivity itself seems to sharpen as a sentence unfolds.
Diffusion Kurtosis Imaging (DKI) provides quantifiable information on the non-Gaussian behavior of water diffusion in biological tissue. Changes in water diffusion tensor imaging (DTI) parameters and DKI parameters in several white and grey matter regions were investigated in a mild controlled cortical impact (CCI) injury rat model at both the acute (2 hours) and the sub-acute (7 days) stages following injury. Mixed model ANOVA analysis revealed significant changes in temporal patterns of both DTI and DKI parameters in the cortex, hippocampus, external capsule and corpus callosum. Post-hoc tests indicated acute changes in mean diffusivity (MD) in the bilateral cortex and hippocampus (p < 0.0005) and fractional anisotropy (FA) in ipsilateral cortex (p < 0.0005), hippocampus (p = 0.014), corpus callosum (p = 0.031) and contralateral external capsule (p = 0.011). These changes returned to baseline by the sub-acute stage. However, mean kurtosis (MK) was significantly elevated at the sub-acute stages in all ipsilateral regions and scaled inversely with the distance from the impacted site (cortex and corpus callosum: p < 0.0005; external capsule: p = 0.003; hippocampus: p = 0.011). Further, at the sub-acute stage increased MK was also observed in the contralateral regions compared to baseline (cortex: p = 0.032; hippocampus: p = 0.039) while no change was observed with MD and FA. An increase in mean kurtosis was associated with increased reactive astrogliosis from immunohistochemistry analysis. Our results suggest that DKI is sensitive to microstructural changes associated with reactive astrogliosis which may be missed by standard DTI parameters alone. Monitoring changes in MK allows the investigation of molecular and morphological changes in vivo due to reactive astrogliosis and may complement information available from standard DTI parameters. To date the use of diffusion tensor imaging has been limited to study changes in white matter integrity following traumatic insults. Given the sensitivity of DKI to detect microstructural changes even in the gray matter in vivo, allows the extension of the technique to understand patho-morphological changes in the whole brain following a traumatic insult.
Magnetic Resonance Imaging; diffusion tensor imaging; diffusion kurtosis imaging; traumatic brain injury; astrogliosis; rat brain
A biologically detailed model of the binaural avian nucleus laminaris is constructed, as a two-dimensional array of multicompartment, conductance-based neurons, along tonotopic and interaural time delay (ITD) axes. The model is based primarily on data from chick nucleus laminaris. Typical chick-like parameters perform ITD discrimination up to 2 kHz, and enhancements for barn owl perform ITD discrimination up to 6 kHz. The dendritic length gradient of NL is explained concisely. The response to binaural out-of-phase input is suppressed well below the response to monaural input (without any spontaneous activity on the opposite side), implicating active potassium channels as crucial to good ITD discrimination.
The auditory systems of birds and mammals use timing information from each ear to detect interaural time difference (ITD). To determine whether the Jeffress-type algorithms that underlie sensitivity to ITD in birds are an evolutionarily stable strategy, we recorded from the auditory nuclei of crocodilians, who are the sister group to the birds. In alligators, precisely timed spikes in the first-order nucleus magnocellularis (NM) encode the timing of sounds, and NM neurons project to neurons in the nucleus laminaris (NL) that detect interaural time differences. In vivo recordings from NL neurons show that the arrival time of phase-locked spikes differs between the ipsilateral and contralateral inputs. When this disparity is nullified by their best ITD, the neurons respond maximally. Thus NL neurons act as coincidence detectors. A biologically detailed model of NL with alligator parameters discriminated ITDs up to 1 kHz. The range of best ITDs represented in NL was much larger than in birds, however, and extended from 0 to 1000 μs contralateral, with a median ITD of 450 μs. Thus, crocodilians and birds employ similar algorithms for ITD detection, although crocodilians have larger heads.
Studies in all sensory modalities have demonstrated amplification of early brain responses to attended signals, but less is known about the processes by which listeners selectively ignore stimuli. Here we use MEG and a new paradigm to dissociate the effects of selectively attending, and ignoring in time. Two different tasks were performed successively on the same acoustic stimuli: triplets of tones (A, B, C) with noise-bursts interspersed between the triplets. In the COMPARE task subjects were instructed to respond when tones A and C were of same frequency. In the PASSIVE task they were instructed to respond as fast as possible to noise-bursts. COMPARE requires attending to A and C and actively ignoring tone B, but PASSIVE involves neither attending to nor ignoring the tones. The data were analyzed separately for frontal and auditory-cortical channels to independently address attentional effects on low-level sensory versus putative control processing. We observe the earliest attend/ignore effects as early as 100 ms post stimulus onset in auditory cortex. These appear to be generated by modulation of exogenous (stimulus-driven) sensory evoked activity. Specifically related to ignoring, we demonstrate that active-ignoring-induced input inhibition involves early selection. We identified a sequence of early (<200ms post onset) auditory cortical effects, comprised of onset response attenuation and the emergence of an inhibitory response, and provide new, direct evidence that listeners actively ignoring a sound can reduce their stimulus related activity in auditory cortex by 100 ms after onset when this is required to execute specific behavioral objectives.
Attention; suppression; MEG; auditory evoked response; M100; auditory cortex; frontal cortex; gain modulation
We present a method for removing unwanted components of biological origin from neurophysiological recordings such as magnetoencephalography (MEG), electroencephalography (EEG), or multichannel electrophysiogical or optical recordings. A spatial filter is designed to partition recorded activity into stimulus-related and stimulus-unrelated components, based on a criterion of stimulus-evoked reproducibility. Components that are not reproducible are projected out to obtain clean data. In experiments that measure stimulus-evoked activity, typically about 80% of noise power is removed with minimal distortion of the evoked response. Signal-to-noise ratios of better than 0 dB (50% reproducible power) may be obtained for the single most reproducible spatial component. The spatial filters are synthesized using a blind source separation method known as Denoising Source Separation (DSS), that allows the measure of interest (here proportion of evoked power) to guide the source separation. That method is of greater general use, allowing data denoising beyond the classical stimulus-evoked response paradigm.
MEG; Magnetoencephalography; EEG; Electroencephalography; noise reduction; artifact removal; Principal Component Analysis
Bottom-up (stimulus-driven) and top-down (attentional) processes interact when a complex acoustic scene is parsed. Both modulate the neural representation of the target in a manner strongly correlated with behavioral performance.
The mechanism by which a complex auditory scene is parsed into coherent objects depends on poorly understood interactions between task-driven and stimulus-driven attentional processes. We illuminate these interactions in a simultaneous behavioral–neurophysiological study in which we manipulate participants' attention to different features of an auditory scene (with a regular target embedded in an irregular background). Our experimental results reveal that attention to the target, rather than to the background, correlates with a sustained (steady-state) increase in the measured neural target representation over the entire stimulus sequence, beyond auditory attention's well-known transient effects on onset responses. This enhancement, in both power and phase coherence, occurs exclusively at the frequency of the target rhythm, and is only revealed when contrasting two attentional states that direct participants' focus to different features of the acoustic stimulus. The enhancement originates in auditory cortex and covaries with both behavioral task and the bottom-up saliency of the target. Furthermore, the target's perceptual detectability improves over time, correlating strongly, within participants, with the target representation's neural buildup. These results have substantial implications for models of foreground/background organization, supporting a role of neuronal temporal synchrony in mediating auditory object formation.
Attention is the cognitive process underlying our ability to focus on specific aspects of our environment while ignoring others. By its very definition, attention plays a key role in differentiating foreground (the object of attention) from unattended clutter, or background. We investigate the neural basis of this phenomenon by engaging listeners to attend to different components of a complex acoustic scene. We present a spectrally and dynamically rich, but highly controlled, stimulus while participants perform two complementary tasks: to attend either to a repeating target note in the midst of random interferers (“maskers”), or to the background maskers themselves. Simultaneously, the participants' neural responses are recorded using the technique of magnetoencephalography (MEG). We hold all physical parameters of the stimulus fixed across the two tasks while manipulating one free parameter: the attentional state of listeners. The experimental findings reveal that auditory attention strongly modulates the sustained neural representation of the target signals in the direction of boosting foreground perception, much like known effects of visual attention. This enhancement originates in auditory cortex, and occurs exclusively at the frequency of the target rhythm. The results show a strong interaction between the neural representation of the attended target with the behavioral task demands, the bottom-up saliency of the target, and its perceptual detectability over time.
Auditory objects are detected if they differ acoustically from the ongoing background. In simple cases, the appearance or disappearance of an object involves a transition in power, or frequency content, of the ongoing sound. However, it is more realistic that the background and object possess substantial non-stationary statistics, and the task is then to detect a transition in the pattern of ongoing statistics. How does the system detect and process such transitions? We use magnetoencephalography (MEG) to measure early auditory cortical responses to transitions between constant tones, regularly alternating, and randomly alternating tone-pip sequences. Such transitions embody key characteristics of natural auditory temporal edges. Our data demonstrate that the temporal dynamics and response polarity of the neural temporal-edge-detection processes depend in specific ways on the generalized nature of the edge (the context preceding and following the transition) and suggest that distinct neural substrates in core and non-core auditory cortex are recruited depending on the kind of computation (discovery of a violation of regularity, vs. the detection of a new regularity) required to extract the edge from the ongoing fluctuating input entering a listener’s ears.
Cognitive and Behavioral Neuroscience; auditory regularity; change detection; M100; M50; Magnetoencephalography; MEG; MMN; scene analysis
We present a method to remove the effects of sensor-specific noise in multiple-channel recordings such as magnetoencephalography (MEG) or electroencephalography (EEG). The method assumes that every source of interest is picked up by more than one sensor, as is the case with systems with spatially dense sensors. To reduce noise, each sensor signal is projected on the subspace spanned by its neighbors and replaced by its projection. In this process, components specific to the sensor (typically wide-band noise and/or ‘glitches’) are eliminated, while sources of interest are retained. Evaluation with real and simulated MEG signals shows that the method removes sensor-specific noise effectively, without removing or distorting signals of interest. It complements existing noise-reduction methods that target environmental or physiological noise.
MEG; EEG; Magnetoencephalography; Electroencephalography; Noise suppression; Artifact suppression; PCA; Subspace methods; Projection Methods
We present an algorithm for removing environmental noise from neurophysiological recordings such as magnetoencephalography (MEG). Noise fields measured by reference magnetometers are optimally filtered and subtracted from brain channels. The filters (one per reference/brain sensor pair) are obtained by delaying the reference signals, orthogonalizing them to obtain a basis, projecting the brain sensors onto the noise-derived basis, and removing the projections to obtain clean data. Simulations with synthetic data suggest that distortion of brain signals is minimal. The method surpasses previous methods by synthesizing, for each reference/brain sensor pair, a filter that compensates for convolutive mismatches between sensors. The method enhances the value of data recorded in health and scientific applications by suppressing harmful noise, and reduces the need for deleterious spatial or spectral filtering. It should be applicable to a wider range of physiological recording techniques, such as EEG, local field potentials, etc.
MEG; Magnetoencephalography; EEG; Electroencephalography; noise reduction; artifact removal; Principal Component Analysis; artifact rejection; regression