|Home | About | Journals | Submit | Contact Us | Français|
Auditory object analysis requires two fundamental perceptual processes: the definition of the boundaries between objects, and the abstraction and maintenance of an object's characteristic features. While it is intuitive to assume that the detection of the discontinuities at an object's boundaries precedes the subsequent precise representation of the object, the specific underlying cortical mechanisms for segregating and representing auditory objects within the auditory scene are unknown. We investigated the cortical bases of these two processes for one type of auditory object, an ‘acoustic texture’, composed of multiple frequency-modulated ramps. In these stimuli we independently manipulated the statistical rules governing a) the frequency-time space within individual textures (comprising ramps with a given spectrotemporal coherence) and b) the boundaries between textures (adjacent textures with different spectrotemporal coherence). Using functional magnetic resonance imaging (fMRI), we show mechanisms defining boundaries between textures with different coherence in primary and association auditory cortex, while texture coherence is represented only in association cortex. Furthermore, participants' superior detection of boundaries across which texture coherence increased (as opposed to decreased) was reflected in a greater neural response in auditory association cortex at these boundaries. The results suggest a hierarchical mechanism for processing acoustic textures that is relevant to auditory object analysis: boundaries between objects are first detected as a change in statistical rules over frequency-time space, before a representation that corresponds to the characteristics of the perceived object is formed.
Natural sounds are both spectrally as well as temporally complex, and a fundamental question in auditory perception is how the brain represents and segregates distinct auditory objects that contain many individual spectro-temporally-varying components. Such auditory object analysis requires at least two fundamental perceptual processes. First, the detection of boundaries between adjacent objects, which necessitates mechanisms that identify changes in the statistical rules governing object regions in frequency-time space (Kubovy and Van Valkenburg, 2001; Chait et al., 2007; 2008). Second, the abstraction and maintenance of an object's characteristic features, while ignoring local stochastic variation within an object (Griffiths and Warren, 2004; Nelken, 2004).
While it is intuitive to assume that the detection of statistical changes at object boundaries precedes the subsequent precise representation of the object (Ohl et al., 2001; Chait et al., 2007; 2008; Scholte et al., 2008), the specific underlying cortical mechanisms for segregating and representing auditory objects within the auditory scene have not been directly addressed. Zatorre and colleagues (2004) demonstrated a parametric increase in activity within right superior temporal sulcus (STS) as a function of object distinctiveness. However, whether this effect was due to object distinctiveness as such or due to a change percept between objects is unclear. That is, as the distinctiveness between objects increased, participants were also more likely to hear a change at object boundaries.
Other studies focussed on the neural correlates of boundary or ‘auditory edge’ detection without investigating in detail processes necessary for object formation (Chait et al., 2007; 2008). Schönwiesner and colleagues (2007) investigated the perception of different levels of changes in acoustic duration. They found a cortical hierarchy for processing duration changes as indicated by three distinct stages: an initial change detection mechanism in primary auditory cortex, followed by a more detailed analysis in association cortex and attentional mechanisms originating in frontal cortex.
The present study used a form of spectrotemporal coherence to create object regions and object boundaries in frequency-time space. The stimulus, an ‘acoustic texture’, was based on randomly distributed linear FM ramps with varying trajectories, where the overall coherence between ramps was controlled. Conceptually, the stimulus is similar to the coherent visual motion paradigm (Newsome and Paré, 1988; Rees et al., 2000; Braddick et al., 2001). In both, the coherence of constituent elements can be parametrically controlled. The analysis of ‘acoustic textures’ comprising different spectrotemporal coherence requires perceptual mechanisms that can assess common statistical properties of the stimulus irrespective of local stochastic variation within an object, and detect transitions when these properties change.
Thus, this stimulus enables a direct and orthogonal assessment of the neural correlates of (i) detecting boundaries between acoustic textures with different spectrotemporal coherence, and (ii) representing spectrotemporal coherence within a texture. We hypothesised that the detection of a change in spectrotemporal coherence between textures would engage auditory areas including primary cortex (Schönwiesner et al., 2007), while spectrotemporal coherence within textures would be encoded in higher-level auditory areas only (Zatorre et al., 2004).
23 right-handed participants (aged 18-31, mean age = 25.04, 12 females) with normal hearing and no history of audiological or neurological disorders provided written consent prior to the study. The study was approved by the Institute of Neurology Ethics Committee, London.
The ‘acoustic texture’ stimulus was based on randomly distributed linear frequency-modulated (FM) ramps with varying trajectories (Figure 1). The percentage of coherent spectrotemporal modulation, i.e. the proportion of ramps with identical direction (slope-sign) and trajectory (slope-value), can be systematically controlled, creating acoustic textures with different levels of spectrotemporal coherence. Boundaries were created and their magnitude varied by juxtaposing acoustic textures of different coherence levels. In the example given in Figure 1, a 3.5-s stimulus segment with 100% spectrotemporal coherence (all ramps move upwards and with the same trajectory) is followed by a 4.5-s stimulus segment of 0% coherence (ramps move in different directions and with different trajectories) and so forth (Figure 1; see also Supplementary soundfile S1). The associated change in coherence at the boundaries between acoustic textures with different spectrotemporal coherence is also shown.
All stimuli were created digitally in the frequency domain using Matlab 6.5 software (MathWorks) at a sampling frequency of 44.1 kHz and 16 bit resolution. Stimuli consisted of a dense texture of linear FM ramps; each ramp had a duration of 300 ms and started at a random time and frequency (passband 250-6000 Hz), with a density of 80 glides per second, roughly equalling one ramp per critical band (see Figure S2). For ramps that extended beyond the passband, i.e. went below 250 Hz or beyond 6000 Hz, we implemented a wraparound such that the ramps ‘continued’ at the other extreme of the frequency band, i.e. at 6000 Hz or 250 Hz, respectively. Stimuli differed in terms of the coherent movement of the ramps: six different coherence conditions were created, where the percentage of ramps moving in the same direction for a given sound segment was systematically varied from 0% coherence to 100% coherence in 20% increments. Thus, for a given sound segment with 40% coherence, 40% of the ramps increased (or decreased) in frequency with an excursion traversing 2.5 octaves / 300 ms; the direction and excursion of the remaining 60% of the FM ramps were randomised. Crucially, the only difference between the six levels is the degree of coherence or ‘common fate’ of the ramps; the total number of ramps, the number of ramps in a critical band as well as the passband of each stimulus did not differ systematically across the levels (Figure S2).
Prior to scanning, participants were familiarised with and trained on the stimuli and then performed two-interval two-alternative forced choice (2I2AFC) psychophysics distinguishing the non-random against a random reference (0% coherence) sound. Stimuli were two seconds long and the direction of the FM glides (up versus down) was counterbalanced. There were 30 pairs for each of the six levels (0%-100% coherence in 20% steps). Participants had to reach at least 90% correct performance for the last level (100% coherence) to be included in the fMRI study.
During the scanning session, stimuli were presented in blocks of sound with an average duration of 16 seconds (range: 11 to 18 seconds). The blocks contained four contiguous segments with a given absolute spectrotemporal coherence (0%, 20%, 40%, 60%, 80%, or 100%). Within a block, the direction (up versus down) of the coherent ramps was maintained. The length of the segments varied (1.5, 3, 3.5, 4.5, 5, or 6.5 seconds) and was randomised within a block. Thus, a given block might have, for example, [100% 0% 40% 80%] contiguous coherence segments with durations [3.5 4.5 6.5 1.5] seconds. The associated change in coherence between the four segments within this block of sound is [−100 +40 +40]. Note that a +40% change in coherence, for example, can be obtained in a number of ways by suitably arranging adjacent pairs of acoustic textures with certain absolute coherence levels (0% to 40% and 40% to 80%, in the example in Figure 1). Stimuli were presented in one of six pseudorandom permutations which orthogonalised absolute coherence and change in coherence (average correlation between absolute coherence and change in coherence across the six permutations: r = 0.06, p > 0.1).
The task of participants was to detect a change in coherence within the block, regardless of whether that change was from less coherent to more coherent or vice versa. Participants were required to press a button whenever they heard such a change and were instructed that the frequency of perceptual changes within one block likely ranged from no perceptual change (e.g. a block consisting of [0% 20% 40% 20%] coherence segments, since here the changes are likely to be too small to be detected) to a maximum of three changes (e.g. a sound block consisting of [0% 100% 20% 80%] segments). Sound blocks were separated by a silent period of 6 seconds, in which participants were told to relax.
In each of three experimental scanning sessions, each coherence level was presented 30 times, amounting to a total of 7.2 minutes presentation time per coherence level. The number of times each of the six different levels of change in coherence (regardless of their direction) occurred can, consequently, not be perfectly balanced; however, permutations were created such that the change that occurred most often occurred less than three times as often than the change that occurred least frequently.
Stimuli were presented via NordicNeuroLab electrostatic headphones at a sound pressure level of 85 dB. Participants saw a cross at the centre of the screen and were asked to look at this cross during the experiment.
Participants' button presses where recorded and analysed with respect to the onset of each segment within a block. Responses were only counted if they occurred within three seconds after the onset of a segment (and within 1.5 seconds after the onset of the shortest segments). The average percentage correct response was then computed by comparing the number of responses to a given change in spectrotemporal coherence to the actual number of those changes. ‘Responses’ to 0% changes served as a ‘false alarm’ chance baseline.
Gradient weighted echo planar images (EPI) were acquired on a 3 Tesla Siemens Allegra system (Erlangen, Germany), using a continuous imaging design with 42 contiguous slices per volume (time to repeat/time to echo, 2730/30 ms). The volume was tilted forward such that slices were parallel to and centred on the superior temporal gyrus. Participants completed three sessions of 372 volumes each, resulting in a total of 1116 volumes. To correct for geometric distortions in the EPI due to B0 field variations, Siemens fieldmaps were acquired for each subject, usually after the second session (Hutton et al., 2002; Cusack et al., 2003). A structural T1 weighted scan was acquired for each participant (Deichmann et al., 2004).
Imaging data were analysed using Statistical Parametric Mapping software (SPM5, http://www.fil.ion.ucl.ac.uk/spm). The first four volumes in each session were discarded to control for saturation effects. The resulting 1104 volumes were realigned to the first volume and unwarped using the fieldmap parameters, spatially normalised to stereotactic space (Friston et al., 1995a) and smoothed with an isotropic Gaussian kernel of 8 mm full-width-at-half-maximum (FWHM). Statistical analysis used a random-effects model within the context of the general linear model (Friston et al., 1995b). A region of interest (ROI) in auditory cortex was derived from a functional contrast that modelled the BOLD response to the onset of each sound block (see below); the ROI was based on a significance threshold of p < 0.01 (corrected for false discovery rate, FDR, Genovese et al., 2002). Activations within this ROI revealed by the contrasts of interest (see below) were thresholded at p < 0.001 (uncorrected) for display purposes and only local maxima surviving a FDR-corrected threshold of p < 0.01 within the ROI are reported.
The design matrix for each participant consisted of 18 regressors. All regressors collapsed across the direction of the coherent ramps, i.e. 100% coherent segments in which the ramps moved up were collapsed with 100% coherent segments in which the ramps moved down. The first regressor modelled the haemodynamic response to the onset of each sound block as a stick function (i.e. delta function with 0 sec. duration). Regressors 2-7 modelled the onset and duration of the segments within a block corresponding to one of the six levels of spectrotemporal coherence (0%, 20%, 40%, 60%, 80%, and 100%). Regressors 8-18 modelled the response to changes in coherence as stick functions, with the eighth regressor modelling 0% changes (i.e. all consecutive percentage coherence pairs of 0-0, 20-20, 40-40, 60-60, 80-80, 100-100), while the subsequent regressors modelled ‘positive’ and ‘negative’ changes of a given magnitude (+20%, −20%, +40%, −40%, +60%, −60%, +80%, −80%, +100%, −100%). The 6-s silent baseline epochs between sound blocks were not modelled explicitly in the design matrix.
The following planned contrasts were performed. To delineate the response to sound onset, a simple contrast
was performed. This contrast was used to derive an ROI of bilateral auditory cortex and is orthogonal to the contrasts of interest. The bilateral ROI had a size of 10124 voxels (4934 voxels in left auditory cortex, 5190 voxels in right auditory cortex). To probe for an effect of increase in activity with increasing absolute coherence, the corresponding regressors 2-7 (i.e. 0% to 100% coherence) were linearly weighted
To probe for an effect of increasing change in coherence, the corresponding regressors 8-18 (i.e. 0%, +20%, −20% … +100%, −100% change magnitude) were linearly weighted
These weighted values are all mean centred on zero.
The behavioural performance (Figure 2) for detecting the various types of changes (i.e. 0%, +20%, −20% … +100%, −100%) was:
To probe for an effect of relative texture salience that reflects the behavioural asymmetry where ‘positive’ changes are more readily detected than ‘negative’ changes (Figure 2), we computed the difference between the detection rates for ‘positive’ and ‘negative’ changes. The differences between detecting +20% and −20%, +40% and −40%, +60% and −60%, +80% and −80%, and +100% and −100% changes were [6.35, 10.7, 15.42, 15.37, 14.12], respectively; thus, the corresponding contrast probing for an effect of relative texture salience weighed the five different ‘positive’ and ‘negative’ change pairs
This contrast is also mean centred on zero and reflects the magnitude of the perceptual difference between ‘positive’ and ‘negative’ changes.
Behavioural results obtained during scanning for the detection of a change in spectrotemporal coherence between textures are shown in Figure 2. Performance increased with the magnitude of change (both for boundaries across which coherence increased or decreased) and was significantly better than chance or false alarm performance corresponding to 0% changes: two separate repeated-measures ANOVAs with factor ChangeLevel (0%, 20%, 40%, 60%, 80%, 100%) for changes across which coherence either increased or decreased revealed main effects of ChangeLevel (increase), F(5,110) = 58.0, p < 0.001, and ChangeLevel (decrease), F(5,110) = 23.04, p < 0.001, respectively. Pairwise comparisons (two-tailed t-tests) with performance for 0% changes were all significant (p < 0.05) for change levels from 20-100% (increase) and 60-100% (decrease), indicating that participants performed above chance for these changes in spectrotemporal coherence. Furthermore, performance was better for changes across which coherence increased: a repeated-measures ANOVA with factors ChangeLevel (0% - 100%) and ChangeType (increase vs. decrease) revealed main effects for ChangeLevel (F(5,110) = 52.05, p < 0.001) and ChangeType (F(1,22) = 52.32, p < 0.001), as well as a significant interaction (F(5,110) = 7.87, p < 0.001).
We carried out an analysis to identify areas in auditory cortex that parametrically varied in activity as a function of the absolute magnitude of change in coherence (i.e. for both ‘positive’ and ‘negative’ changes in coherence) at the boundaries between adjacent segments. The analysis revealed BOLD signal increases in Heschl's gyrus (HG), planum temporale (PT), temporo-parietal junction (TPJ) and superior temporal sulcus (STS) as a function of absolute change magnitude (Figure 3 and Table 1). The bar charts in Figure 3 show (in red) the contrast estimates for the different degrees of change in coherence at the boundaries between textures in all these areas of auditory cortex.
Next, we sought areas within auditory cortex that were increasingly activated as a function of increasing spectrotemporal coherence within textures. Bilateral areas in auditory association cortex including PT and its extension into TPJ showed a contrast estimate increase with increasing absolute spectrotemporal coherence (Figure 3, blue bars; Table 1). In contrast, activity in HG and STS did not differ significantly across the six levels of spectrotemporal coherence. In order to determine whether the absence of an effect for increasing spectrotemporal coherence in HG and STS was not an artefact of the statistical threshold used and was qualitatively different from the response in those areas to the magnitude of change, we performed repeated measures analyses of variance (ANOVA) based on the mean contrast estimate in a sphere with 10 mm radius around the local maxima in HG and STS corresponding to the six levels of change in coherence. Two separate 2 Hemisphere (left vs. right) × 2 Condition (coherence vs. change) × 6 Level (1-6) repeated measures ANOVA's for HG and STS revealed significant Condition × Level interactions: F(5,110) = 4.54, p = 0.001 for HG, and F(5,110) = 7.64, p < 0.001 for STS, respectively.
The experimental design also enabled us to investigate in more detail an effect of absolute spectrotemporal coherence by way of changes in relative coherence. Behavioural results (see Figure 2) demonstrated that boundaries across which coherence increased (‘positive’ changes) were generally more salient than those across which coherence decreased (‘negative’ changes), suggesting a perceptual asymmetry where texture salience increases with spectrotemporal coherence. We tested whether this perceptual asymmetry (Cusack and Carlyon, 2003) was also reflected at the neural level (see Experimental Procedures). Figure 4 shows that this was the case in TPJ, which showed stronger responses to increases as opposed to decreases in spectrotemporal coherence between textures (see also Table 1).
The results demonstrate a specific mapping of abstracted object properties, as represented by spectrotemporal coherence, and object boundaries, as represented by changes in spectrotemporal coherence, to distinct regions of auditory cortex. First, activity in auditory cortex including HG, PT, TPJ and STS increased as a function of the magnitude of the change in spectrotemporal coherence at boundaries between textures. Second, activity as a function of the absolute spectrotemporal coherence within textures increased in auditory association cortex in PT and in TPJ. Finally, increases in spectrotemporal coherence at segment boundaries were more perceptually salient than decreases in spectrotemporal coherence at object boundaries, and this was reflected by stronger neural activity at such boundaries.
While the observed parametric responses to absolute spectrotemporal coherence within textures and change in coherence between textures show some overlap in cortical resources (in PT and TPJ), they are indeed separable processes, since the experimental design orthogonalised the absolute coherence of acoustic textures and changes in coherence at boundaries. This indicates that the overlapping representations of change in coherence and absolute coherence in the non-primary auditory areas in PT and TPJ represent a distinct mapping of these two processes in similar cortical areas; these mappings could be subserved by activity within distinct units or networks in those areas (Price et al., 2005; Nelken, 2008; Nelken and Bar-Yosef, 2008). Furthermore, the results are unlikely to be confounded by the behavioural task (detection of a change in spectrotemporal coherence), since a pilot study in which the absolute spectrotemporal coherence of the sounds was task-relevant yielded very similar results (Figure S1).
A number of neuronal mechanisms might underlie the response to boundaries that we demonstrate across auditory cortex (including primary cortex) and acoustic texture coherence that we demonstrate in association cortex. Computationally, both boundary detection and acoustic texture analysis within boundaries must depend on the statistical properties of the stimulus over frequency-time space, since low-level acoustic features such as spectral density over time were kept constant. For the present stimuli, boundary detection requires mechanisms that do not need to assess large spectrotemporal regions but still need to assess a ‘local’ statistical rule change in the absence of any physical ‘edge’ (as would be the case in the perception of an object arising out of silence, for example, where a boundary could be defined by a discontinuity in intensity). Acoustic texture coherence analysis necessarily involves larger spectrotemporal regions than boundary detection, and the analysis of boundary before texture that we demonstrate here is consistent with the idea that more extended segments of spectrotemporal space are analysed in areas further from primary cortex. This notion is further supported by studies focusing on the time domain that suggest that the analysis of sound occurs over longer time-windows in non-primary than in primary cortex (Boemio et al., 2005; Overath et al., 2008). In terms of the underlying neuronal mechanism for the present stimulus, we are not aware of any studies of coherent FM in either primary or non-primary cortex. Neurons that are sensitive to the direction of single-FM sweeps have been demonstrated in the auditory cortex of rats (Ricketts et al., 1998), cats (Mendelson and Cynader, 1985; Heil et al., 1992), and rhesus monkeys (Tian and Rauschecker, 2004) (for a review see Rees and Malmierca, 2005). In our study, the analysis of boundaries (in primary cortex and association cortex) and texture (in association cortex) could be subserved by ensembles of such units tuned to similar sweeps in different regions of frequency-time space. Alternatively, if such neurons were ever shown to exist, boundaries and acoustic texture could also be analysed by single neurons that were sensitive to coherent FM over spectrotemporal regions. In the case of both ensemble mechanisms and single-neuron mechanisms, the ‘receptive field’ of the mechanism would need to be larger for texture analysis in association areas than for boundary detection in primary areas.
The present study provides a contrasting yet complementary approach to change detection mechanisms from the classical mismatch negativity (MMN) paradigm, which is thought to reflect the violation of a previously established regularity (Näätänen and Winkler, 1999). Both paradigms require mismatch or change detection processes in auditory cortex. However, our results suggest that, in the current stimulus paradigm, the emergence of regularity (or coherence) has a different representation to its disappearance or violation. For example, the transition from noise to a regular interval sound with pitch has a different cortical representation than the reverse transition (Krumbholz et al., 2003). Recently, Chait and colleagues (2007; 2008) demonstrated distinct cortical mechanisms for the detection of auditory ‘edges’ based on statistical properties, where the detection of a statistical regularity (in violation of a previous irregularity) had a different cortical signature than the detection of a violation of statistical regularity. The current results support the existence of such neural and perceptual asymmetries. We propose that the degree of spectrotemporal coherence is encoded in a continuous manner, with neurons tuned to coherence levels that are equal or greater in coherence than the neurons' thresholds. Such a cumulative neural code contains an inherent asymmetry (Treisman and Gelade, 1980; Cusack and Carlyon, 2003): transitions to more coherent sounds excite a larger neural population, rendering them more perceptually salient. This is reflected in the neural response (Figure 4).
The stimulus manipulation employed in this study addresses generic processes underlying complex auditory object analysis, but it is not intended to represent all possible auditory object classes. In speech perception, the spectrotemporal analysis necessarily spans a large frequency range over multiple temporal scales, where coherent acoustic properties (or those with ‘common fate’) need to be abstracted across the frequency-time axis; for example, formant transitions often display such coherent acoustic spectrotemporal properties (Stevens, 1998). At the same time, there is no one ecological sound that the acoustic texture stimulus represents, but we argue here that its generic nature ensures applicability to a variety of ecological sound properties. While coherent FM is arguably a relatively weak grouping cue compared to simultaneous onset and harmonicity (Carlyon, 1991; Summerfield and Culling, 1992; Darwin and Carlyon, 1995), it is nevertheless one basis upon which figure-ground selection can occur (McAdams, 1989). It is important to note, however, that these studies generally used simple sinusoidal FM in which ‘FM coherence’ was defined as either in- or out-of-phase. The stimulus employed here is more complex in the sense that spectrotemporal coherence can only be detected as a whole, irrespective of low-level features such as phase, since FM ramps were randomly distributed in frequency-time space.
The visual depiction of the stimulus (Figure 1) evokes the coherent visual motion paradigm using random dot kinematograms (Newsome and Paré, 1988; Britten et al., 1992; Rees et al., 2000; Braddick et al., 2001). However, direct comparisons with the visual system based on superficial similarities are often not straightforward and need to be treated with caution (King and Nelken, 2009). For example, objects in the visual stimulus are defined spatially while space plays a relatively minor role in the definition of auditory objects. Further, in the present case, the perceptual effect is more subtle than in the visual domain.
The data reported here move beyond the analysis of simple FM sounds to the analysis of auditory object patterns within stochastic stimuli; such auditory object analysis is dependent on mechanisms that are fundamental for the analysis of ecologically valid sounds in a dynamic auditory environment. We demonstrate a mechanism for the assessment of acoustic texture boundaries that is already present in primary auditory cortex, based on recognising changing higher-order statistical properties governing frequency-time space. Such a mechanism precedes the encoding of the absolute properties of acoustic textures in higher-level auditory association cortex.
The authors would like to thank two anonymous reviewers for their helpful comments.
Funding. This work was funded by the Wellcome Trust, UK.