|Home | About | Journals | Submit | Contact Us | Français|
The fluency and reliability of speech production suggests a mechanism that links motor commands and sensory feedback. Here, we examine the neural organization supporting such links by using fMRI to identify regions in which activity during speech production is modulated according to whether auditory feedback matches the predicted outcome or not, and examining the overlap with the network recruited during passive listening to speech sounds. We use real-time signal processing to compare brain activity when participants whispered a consonant-vowel-consonant word (‘Ted’) and either heard this clearly, or heard voice-gated masking noise. We compare this to when they listened to yoked stimuli (identical recordings of ‘Ted’ or noise) without speaking. Activity along the superior temporal sulcus (STS) and superior temporal gyrus (STG) bilaterally was significantly greater if the auditory stimulus was a) processed as the auditory concomitant of speaking and b) did not match the predicted outcome (noise). The network exhibiting this Feedback type by Production/Perception interaction includes an STG/MTG region that is activated more when listening to speech than to noise. This is consistent with speech production and speech perception being linked in a control system that predicts the sensory outcome of speech acts, and that processes an error signal in speech-sensitive regions when this and the sensory data do not match.
In the study of human communication, the relationship between speech perception and speech production has long been controversial (e.g. Liberman, 1967). That the two sides of spoken language are linked somehow is not in dispute - prelingual deafness without immediate cochlear implantation severely hinders the development of normal speech in children (Geers et al., 2003; Schauwers et al., 2004), and hearing impairment in adults affects aspects of speech production such as the control of fundamental frequency and intensity (Cowie et al., 1982). Talking, like all motor control, must require sensory feedback to maintain the accuracy and stability of movement (Levelt, 1983). However, the nature of the link between perception and action in speech, its cognitive structure and its neural organization, are not well understood. Whether the same neural system subserves auditory perception both for speech comprehension and for the purpose of articulatory control is not known.
Clinical data indicate that, if auditory feedback is eliminated, speech fluency is affected (Cowie et al, 1982; Geers et al, 2003; Schauwers et al, 2004), and if feedback is delayed, fluency is disrupted (Black, 1951; Fukawa et al, 1988; Howell & Powell, 1987; Mackay, 1968; Siegel et al, 1982). Given the rapidity of speech, it would be difficult for acoustic feedback to contribute to control as part of a servomechanism (Lashley, 1951), since the articulation of speech sounds would be finished by the time acoustic feedback could be processed. Alternatively, feedforward mechanisms, capable of predictive motor planning, have been proposed (Kawato, 1989). Such “internal models” of the motor system (Jordan and Rumelhart, 1992) are hypothesized to learn through sensory (e.g., auditory) feedback and thus involve representations of actions and their consequences.
One source of evidence for internal models in speech comes from auditory perturbation studies, in which talkers are observed to alter their speech production in response to altered acoustic feedback. For example, if the fundamental frequency of auditory feedback is shifted or if frequencies of vowel formants are altered, talkers accommodate rapidly as if to normalize their production so that it is acoustically closer to a desired output (Chen et al., 2007; Elman, 1981; Houde & Jordan, 1998; Jones & Munhall, 2000; Purcell & Munhall, 2006a,b). Such studies yield two important conclusions. First, that subjects rapidly compensate for the perturbation in subsequent trials indicates the presence of an error detection and correction mechanism (e.g., Houde & Jordan, 1998; Jones & Munhall, 2000). Second, the compensation persists for a short period of time after acoustic feedback is returned to normal (e.g., Purcell & Munhall, 2006a,b), suggesting that the error detection mechanism is producing learning that tunes the speech motor controller.
The neural network supporting this acoustic error detection and correction has been investigated in both nonhuman primates and humans. A recent study in marmosets reveals that vocalization-induced suppression within neurons in the auditory cortex is markedly reduced (firing rate increases) if feedback is altered by frequency shifting (Eliades & Wang, 2008). Recent functional neuroimaging studies in humans reveal increases in activity in the superior temporal areas when vocalization with altered feedback is compared to a condition with normal feedback (Christoffels et al, 2007; Hashimoto & Sakai, 2003; Fu et al, 2006; Tourville et al, 2008). This signal increase has been interpreted as evidence for an error signal that could be used to tune ongoing speech production (Guenther, 2006).
The engagement of the superior temporal region in processing speech during listening has been extensively documented. Superior temporal activation is observed for processing of speech stimuli (see Table 1), when compared to: noise (Binder et al., 1994; Jancke et al., 2002; Rimol et al., 2005; Oblesser et al., 2006; Zatorre et al., 1992), tones (Ashtari et al., 2004; Binder et al., 1996; Benson et al., 2001; Binder et al., 2000; Burton & Small, 2006; Burton et al., 2000; Joanisse & Gati, 2003; O’Leary et al., 1996; Poeppel et al., 2004; Specht & Reul, 2003, Vouloumanos et al., 2001; Zaehle et al., 2004), and other nonspeech sounds (Belin et al., 2002; Benson et al., 2006; Gandour et al., 2003; Giraud & Price, 2001; Meyer et al., 2005; Thierry et al., 2003; Uppenkamp et al., 2006). Recent functional imaging studies in which speech and nonspeech conditions are very closely matched acoustically also elicit signal change in this region, suggesting that it is the processing of speech qua speech, and not the acoustic structure of speech that is responsible for the signal change (Dehaene-Lambertz et al., 2005; Desai et al., 2008; Liebenthal et al., 2005; Mottonen et al., 2006; Narain et al., 2003; Scott et al., 2000; Scott et al., 2006; Uppenkamp et al 2006).
Despite the evidence for the involvement of the superior temporal region in the processing of both error signal and speech, whether and how these two functions are linked has never been investigated in the same context. In the experiment presented here, we examine whether the same regions within subjects are processing the speaker’s own auditory feedback as well as subserving the perception of speech. We argue that simply observing a change in auditory cortical activity during altered, compared to normal, feedback is not sufficient evidence of an error detection mechanism. The characteristics of a speech error signal include that it be generated uniquely during speech production, and not during listening. Accordingly, it is best identified as an interaction, where the same acoustic difference between altered and normal feedback in the context of talking, provokes a different pattern of activity than when the identical stimuli are presented in the context of listening. In addition, we will examine whether this error detection mechanism recruits regions that are also active – albeit in a different way – when listeners passively hear speech.
Accordingly, we will compare speech production to passive listening, with identical acoustic stimuli present in both conditions. On production trials, participants hear either their own voice as normal (unaltered feedback), or hear voice-gated, signal correlated noise (manipulated feedback). On listening trials, participants hear either recordings of their own voice or noise yoked to previous production trials. We identify brain regions that respond more to noise feedback than unaltered feedback during speech production (consistent with an error signal), but that are not sensitive to speech during passive listening, or that respond more to unaltered feedback than to noise feedback during passive listening. By assessing the interaction between speaking condition and feedback, we will examine whether the same brain regions contribute to speech perception for the purpose of ongoing articulatory control, and for the purpose of comprehension.
Written informed consent was obtained from twenty-one individuals (mean age 23 years, range 18 – 45 years; 16 females). All were right-handed, without any history of neurological or hearing disorder, and spoke English as their first language. Each participant received $15 to compensate them for their time. Procedures were approved by the Queen’s Health Sciences Research Ethics Board.
We adopted a 2×2 factorial design (4 experimental conditions) and a low-level silence/rest control condition. The four experimental conditions were: Production-clear: Producing whispered speech (‘Ted’) and hearing this through headphones; Production-masked: Producing whispered speech (‘Ted’) and hearing voice-gated, signal-correlated masking noise (Schroeder, 1968), which is created by applying the amplitude envelope of the utterance to white noise; Listen-clear: Listening to the stimuli of Production-clear trials (without production); and Listen-masked: Listening to the stimuli of Production-masked trials (without production). Whispered speech was used to minimize bone conducted auditory feedback (Barany, 1938) and to make sure that noise would effectively mask speech (Houde & Jordan, 2002).
Functional magnetic resonance imaging data were collected on a 3T Siemens Trio MRI system, using a rapid sparse-imaging procedure (Orfanidou et al., 2006). In order to hear responses and minimize acoustic interference, stimuli were presented and responses recorded (i.e., a single trial occurred) in a 1400 ms silent period between successive 1600 ms scans (EPI; 26 slices; voxel size 3.3 × 3.3 × 4.0 mm). A high-resolution T1-weighted MPRAGE structural scan was also acquired on each subject.
In the two production conditions, participants spoke into the optical microphone (Phone-Or, Magmedix, Fitchburg, MA) and their utterances were digitized at 10 Hz with 16-bit precision using a National Instruments PXI-6052E input/output board. Real-time analysis was achieved using a National Instruments PXI-8176 embedded controller. Processed signals were converted back to analogue by the input/output board at 10 kHz with 16-bit precision and played over high-fidelity magnet-compatible headphones (NordicNeuroLab, Bergen, Norway) in real time. The processing delays were short enough (iteration delay less than 10 ms) that they would not be noticeable to listeners. The processed signals were also recorded and stored on a computer to be used for the yoked trials of the listen-only conditions. Sound was played at a comfortable listening level (approximately 85 dB). See Figure 1a.
In the Production-clear condition, the whispered utterance was simply digitized and recorded, and played out unaltered. The masking noise in the Production-masked condition was produced by applying the amplitude envelope of the utterance to uniform Gaussian white noise so that the resulting noise would have the envelope of the original speech signal, or signal-correlated noise (Schroeder, 1968). The masking noise was also temporally gated with the onset and offset of speaking. All our subjects reported in the exiting interviews that they could only hear masking noise but not their own speech through the headphones when they were speaking in the Production-masked condition.
Participants were scanned in three functional runs, each lasting 9 min and comprising 180 trials; 36 of each of the five conditions. Conditions were presented in pseudorandom order, with the limitation that transitional probabilities were approximately equal and all five conditions were presented once in a block of five trials (conditions could repeat at the transition from one block of five trials to the next). The stimuli for listening trials were taken from the production trials in the preceding block, except in the first run when the stimuli for listening trials were from the production trials in the same block. Trials were 3000 ms long, comprising a 1400 ms period without scanning, followed by a 1600 ms whole-brain EPI acquisition (see Figure 1b). Each trial began with a fixation cross appearing in the middle of a black screen (viewed through mirrors placed on the head coil and in the scanner bore) 100 ms before the offset of the previous scan; this signaled the beginning of the trial and the color of the cross indicated whether the volunteer should whisper ‘Ted’ (if green) or remain silent (if red). Of the five conditions types, the two production conditions were cued with a green-cross, and the two listening conditions and the rest condition were cued with a red cross (See Figure 1b). Pilot testing in 5 subjects revealed that it took volunteers at least 200 ms to respond to the green or red cue, so we could present it 100 ms before the end of the scan and still ensure that subjects (on production trials) would not commence speaking during the scan, thereby using our 1400 ms silent period effectively. .
Subjects practiced the whispering-on-cue task before scanning commenced, and we monitored their behaviour during scanning; Subject’s vocal production and auditory feedback signal was segregated into two different channels and therefore monitored and recorded simultaneously. Their performance in each trial was monitored in real-time in the control room for possible errors. In addition, we also inspected the recordings of both production and auditory feedback in each trial afterwards to ensure that every incorrect trial was identified and properly counted. The auditory feedback was not picked up by the microphone and thus there was no danger of the recorded utterances being masked by it. Sessions in which the error rate exceeded 5% were excluded from analysis, as explained below.
SPM2 (www.fil.ion.ucl.ac.uk/spm/spm2.html) was used for data analysis and visualization. Data were first realigned, within subjects, to the first true functional scan of the session (after discarding 2 dummy scans), and individual’s structural image was coregistered to the mean fMRI image. The coregistered structural was spatially normalized to the ICBM 152 T1 template, and the realigned functional data were normalized using the same deformation parameters. The fMRI data were then smoothed using a Gaussian kernel of 10 mm (FWHM).
Data from each subject were entered into a fixed-effects general linear model using an event-related analysis procedure (Josephs & Henson, 1999). Four event types were modeled for each run. We included six parameters from the motion correction (realignment) stage of preprocessing as regressors in our model to ensure that variability due to head motion was properly accounted for. We chose the hemodynamic response function (HRF) as the basis function. A high-pass filter (cut-off 128 sec) and AR(1) correction for serial autocorrelation were applied. Contrast images assessing main effects, simple effects, and interactions were created for each subject and these were entered into random-effects analyses (one-sample t-tests) comparing the mean parameter-estimate difference over subjects to zero. Clusters were deemed significant if they exceeded a statistical threshold of p<0.05 after correction for multiple comparisons at the cluster level.
Errors occurred when volunteers: 1) spoke after a red cross; 2) remained silent after a green cross, or 3) spoke so quietly that the gated noise was not triggered on production-masked trials. We excluded any trial in which an error occurred. In addition, we excluded from analysis any run in which errors exceeded 5% of the trials; this happened in 3 runs - one run from each of 3 individuals. Performance exceeded 95% correct in all remaining runs. For the three individuals with missing data, contrast images were computed at the single-subject level from the remaining two runs and these were included in the random-effects analyses across subjects.
We analyzed main effects of our two factors (production vs listening, and speech feedback vs noise feedback) and interactions between these factors at the group level. When interactions were significant, we also analyzed simple effects. We start by reporting interactions because they are the main focus of the study and because their presence influences the interpretation of the main effects.
We reasoned that regions involved in processing auditory feedback during talking ought to exhibit a greater increase in activity for noise compared to normal speech, when these are heard as the auditory concomitants of one’s own utterances compared to when one is simply listening to them, without talking. We observed such an interaction in the posterior superior temporal gyrus (STG) bilaterally (see Table 2 and Figure 2). In order to better understand how differences among conditions produced this significant interaction, we explored the simple effects that constitute this interaction. We observed that, within the bilateral STG regions where this interaction yielded significant activity, activation for Production-masked trials was significantly greater than for Production-clear trials. In the left hemisphere, there was a significant cluster (see Table 3) in the posterior STG, extending into the middle temporal gyrus (MTG) and posteriorly into the supramarginal gyrus. In the right hemisphere, there was one cluster in the STG. The contrast of Listen-clear − Listen-masked yielded activation that was strongly lateralized to the left hemisphere, with a peak cluster in the left MTG extending into the middle superior temporal sulcus (STS), and a cluster in the right anterior STG (see Table 4). The left MTG activation peak for this contrast is also in the neighborhood of areas shown in previous studies that contrasted speech stimuli with nonspeech sounds (see Table 1 and Figure 3). The overlap between regions exhibiting a significant Feedback type by Production/Perception interaction and regions sensitive to speech (activated more by speech than noise during passive listening) is shown in Figure 4. As can be seen, a region of the STG/MTG is sensitive to speech and exhibits increased activation during production when masking noise is heard.
The opposite interaction contrast revealed areas where normal feedback yielded more activation than masking noise during talking, compared to passive listening. These regions included left inferior frontal gyrus, left superior parietal lobule, left anterior cingulum, left middle occipital gyrus, left caudate, and right fusiform (see Table 5). Again, in order to understand this interaction, we explored the simple effects. The contrast of Production-clear - Production-masked yielded no significant activation. However, the contrast of Listen-masked - Listen-clear revealed significant clusters in regions highlighed by the interaction (see Table 5). Thus, the interaction appears to be due to a greater difference in these areas when passively listening to noise compared to speech, relative to hearing these sounds while talking.
Production - Listening activated an extremely large region involving both hemispheres centered on the left inferior frontal gyrus. Much greater activity during production than listening is consistent with a number of previous studies of speech motor control (Blank et al., 2002; Dhanjal et al., 2008; Riecker et al., 2005; Wildgruber et al., 1996; Wise et al., 1999). For example, Wilson et al (2004) thresholds activation maps at P < 10−4 for listening conditions and P < 10−12 for speech production conditions in order to achieve comparable levels of activity. This suggests that speech production must have yielded much more activation relative to listening in their study. When we increased the threshold to p<10−11, we observed clusters in the left inferior frontal gyrus, left postcentral gyrus, and right thalamus (see Figure 5 and Table 6), consistent with previous studies investigating vocal production (Kleber et al., 2007; Moser et al., 2009; Riecker et al., 2008). The reverse contrast, in which activity during production conditions was subtracted from that during listening conditions, did not reveal any significant activation.
We did not observe significant activation at the whole brain level when comparing speech with noise. This is probably due to the strong Feedback type by Production/Perception interaction in speech sensitive regions described earlier; the effect of this crossover interaction is to markedly attenuate the main effect, rendering it not significant. As noted above, during passive listening, signal increases in response to speech as opposed to noise were statistically significant in a left MTG region that many other studies have identified as speech sensitive.
Brain regions more responsive to noise than to speech include left inferior frontal gyus, left postcentral gyrus, left putamen, left cerebellum, and right supramarginal gyrus (see Table 7). Significant activation in auditory regions for this contrast was not observed.
In the bilateral STG, greater activity is elicited during speech production when the auditory concomitants of one’s own speech are masked with noise, compared with when they are heard clearly. These regions, together with more inferior STS/MTG regions, also exhibit greater activity when listening to speech compared to noise, consistent with them being speech-sensitive. That the same STG/MTG region is recruited both for the perception of speech and for processing an error signal during production when the predictive signal and auditory concomitant do not match implies that speech-sensitive regions are implicated in an online articulatory control system.
The signal modulation that we observed here was an increase for masked speech compared to clear speech during production but not listening. Although the relationship between hemodynamic response and neural activity is uncertain, this pattern is consistent with neurophysiological work, which demonstrates a release from neural suppression when feedback is altered during production. For example, Eliades & Wang (2008) have observed that, in marmosets, the majority of auditory cortex neurons exhibited suppression during vocalization. However, this vocalization-induced suppression of neural activity was significantly reduced when auditory feedback was altered during vocal production but not during passive listening (Eliades & Wang, 2005, 2008). The attenuation of suppression implies changes in balance of excitatory and inhibitory processes, which would affect regional metabolic energy demands and alter vascular response (Logothetis, 2008). This could in principle manifest as changes in the blood oxygen level that would be detectable as fMRI BOLD signal. Our observation of activity in the bilateral posterior STG in response to altered feedback in production is consistent with such a pattern of neural activity. Models of speech motor control (e.g., Guenther 2006) suggest that vocal motor centers generate specific predictions about the expected sensory consequences of articulatory gestures, which are compared with the actual sensory outcome. Elevated activity in the STG in the presence of mismatched auditory feedback may reflect the involvement of this region in an error detection mechanism when vocalization occurs (Bays et al., 2006; Blakemore et al., 1998; Matsuzawa et al., 2005; Sommer and Wurtz, 2006).
The observed bilateral STG regions (and in particular the left posterior STG) in the interaction contrast are in accordance with a number of previous studies of speech processing (see Table 1), including studies in which speech and nonspeech stimuli are matched acoustically [e.g., Dehaene-Lambertz et al., 2005: (−60, −24, +4), Mottonen et al., 2006: (−61, −39, +2), Narain et al., 2003: (−52, −54, +14), and Scott et al., (2006): (−60, −44, +10)]. This lends support to the idea that this area is sensitive, if not specific, for speech sounds.
In the macaque monkey, the extreme capsule (EmC) interconnects the rostral part of the lateral and medial frontal cortex with the mid-part of the superior temporal gyrus and the cortex of the superior temporal sulcus. These frontal regions are connected with inferior parietal cortex through the middle longitudinal fasciculus (MdLF) (Petrides & Pandya, 1988, 2006; Makris & Pandya, 2009). This EmC-MdLF pathway, spanning frontal-parietal-temporal cortex, has been suggested to play a crucial role in language functions (Schmahmann et al., 2007; Makris & Pandya, 2009). Another fiber system, the superior longitudinal fasciculus–arcuate fasciculus (SLF-AF), courses between the caudal-dorsal prefrontal cortex and the caudal part of the superior temporal gyrus. This pathway has been implicated in sensory-motor processing (Makris & Pandya, 2009). A recent study using diffusion tensor imaging based tractography in humans has found that sublexical repetition of speech is mediated by the SLF-AF pathway, whereas auditory comprehension is subserved by the EmC-MdLF system (Saur et al., 2008). Therefore, the anatomical location of the posterior STG demonstrating signal increases for masked speech over clear speech during production, and the opposite during passive listening (i.e., the interaction), is such that it may be involved in both of these processing pathways, thereby serving both functions of speech processing and sensory-motor integration (Buchsbaum et al., 2005; Hickok et al., 2003; Okada & Hickok, 2006; Okada et al., 2003; Scott & Johnsrude, 2003; Warren et al., 2005).
A number of previous neuroimaging studies have reported bilateral STG activation in response to manipulated auditory feedback compared with normal feedback during reading aloud (Christoffels et al., 2007; Fu et al., 2006; Hashimoto & Sakai, 2003; McGuire et al., 1996; Toyomura et al., 2007). One feature shared by our study and these studies is that we all manipulated auditory feedback in a way that caused it to be different from what was expected. What distinguishes our study is that we have crossed the feedback-type manipulation common to these studies (normal vs altered feedback) with a task manipulation (production vs listening). It is precisely the specificity of this interaction that indicates that activity in brain regions involved in speech perception is modulated by speech production.
We used whispered speech in order to control the bone-conducted acoustic signals in the masking noise condition more effectively (Houde & Jordan, 2002; Paus, et al., 1996). Previous studies indicate that the auditory regions activated by whispered speech are very similar to those activated by vocalized speech, although the level of activation could be somewhat different (Haslinger et al., 2005; Schulz et al., 2005). This suggests that the use of whispered speech did not qualitatively affect our results.
Vocal production invariably entails both auditory and somatosensory goals (Nasir & Ostry, 2008). We did not manipulate somatosensory signals due to the complexity of such experimental setups (e.g., Tremblay et al., 2003). In addition, since speech production in our study only involves repetitively vocalizing a single CVC word (‘Ted’), what we have observed is undoubtedly an underestimate of the functional network subserving the speech production and speech perception. Future work could investigate how speech production and perception are coupled using somatosensory perturbations and more naturalistic speech stimuli.
We have observed enhanced activity in the STG region bilaterally during speech production when auditory feedback and the predicted auditory consequences of speaking do not match. The same region is sensitive to speech during listening. This suggests a self-monitoring/feedback system at work, presumably involved in controlling online articulatory planning. Furthermore, the network supporting speech perception appears to overlap with this self-monitoring system, in the STG at least, highlighting the intimate link between perception and production.
We thank the referees for their valuable comments. This work is supported by an R-01 operating grant from the US National Institutes of Health, NIDCD grant DC08092 (KM), and by grants from the Canadian Institutes of Health Research (IJ), the Natural Sciences and Engineering Research Council of Canada (IJ, KM), the Ontario Ministry of Research and Innovation, and Queen’s University (IJ and ZZ). IJ is supported by the Canada Research Chairs Program.