Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Cogn Neurosci. Author manuscript; available in PMC 2010 August 1.
Published in final edited form as:
PMCID: PMC2862116

Functional overlap between regions involved in speech perception and in monitoring one’s own voice during speech production


The fluency and reliability of speech production suggests a mechanism that links motor commands and sensory feedback. Here, we examine the neural organization supporting such links by using fMRI to identify regions in which activity during speech production is modulated according to whether auditory feedback matches the predicted outcome or not, and examining the overlap with the network recruited during passive listening to speech sounds. We use real-time signal processing to compare brain activity when participants whispered a consonant-vowel-consonant word (‘Ted’) and either heard this clearly, or heard voice-gated masking noise. We compare this to when they listened to yoked stimuli (identical recordings of ‘Ted’ or noise) without speaking. Activity along the superior temporal sulcus (STS) and superior temporal gyrus (STG) bilaterally was significantly greater if the auditory stimulus was a) processed as the auditory concomitant of speaking and b) did not match the predicted outcome (noise). The network exhibiting this Feedback type by Production/Perception interaction includes an STG/MTG region that is activated more when listening to speech than to noise. This is consistent with speech production and speech perception being linked in a control system that predicts the sensory outcome of speech acts, and that processes an error signal in speech-sensitive regions when this and the sensory data do not match.


In the study of human communication, the relationship between speech perception and speech production has long been controversial (e.g. Liberman, 1967). That the two sides of spoken language are linked somehow is not in dispute - prelingual deafness without immediate cochlear implantation severely hinders the development of normal speech in children (Geers et al., 2003; Schauwers et al., 2004), and hearing impairment in adults affects aspects of speech production such as the control of fundamental frequency and intensity (Cowie et al., 1982). Talking, like all motor control, must require sensory feedback to maintain the accuracy and stability of movement (Levelt, 1983). However, the nature of the link between perception and action in speech, its cognitive structure and its neural organization, are not well understood. Whether the same neural system subserves auditory perception both for speech comprehension and for the purpose of articulatory control is not known.

Clinical data indicate that, if auditory feedback is eliminated, speech fluency is affected (Cowie et al, 1982; Geers et al, 2003; Schauwers et al, 2004), and if feedback is delayed, fluency is disrupted (Black, 1951; Fukawa et al, 1988; Howell & Powell, 1987; Mackay, 1968; Siegel et al, 1982). Given the rapidity of speech, it would be difficult for acoustic feedback to contribute to control as part of a servomechanism (Lashley, 1951), since the articulation of speech sounds would be finished by the time acoustic feedback could be processed. Alternatively, feedforward mechanisms, capable of predictive motor planning, have been proposed (Kawato, 1989). Such “internal models” of the motor system (Jordan and Rumelhart, 1992) are hypothesized to learn through sensory (e.g., auditory) feedback and thus involve representations of actions and their consequences.

One source of evidence for internal models in speech comes from auditory perturbation studies, in which talkers are observed to alter their speech production in response to altered acoustic feedback. For example, if the fundamental frequency of auditory feedback is shifted or if frequencies of vowel formants are altered, talkers accommodate rapidly as if to normalize their production so that it is acoustically closer to a desired output (Chen et al., 2007; Elman, 1981; Houde & Jordan, 1998; Jones & Munhall, 2000; Purcell & Munhall, 2006a,b). Such studies yield two important conclusions. First, that subjects rapidly compensate for the perturbation in subsequent trials indicates the presence of an error detection and correction mechanism (e.g., Houde & Jordan, 1998; Jones & Munhall, 2000). Second, the compensation persists for a short period of time after acoustic feedback is returned to normal (e.g., Purcell & Munhall, 2006a,b), suggesting that the error detection mechanism is producing learning that tunes the speech motor controller.

The neural network supporting this acoustic error detection and correction has been investigated in both nonhuman primates and humans. A recent study in marmosets reveals that vocalization-induced suppression within neurons in the auditory cortex is markedly reduced (firing rate increases) if feedback is altered by frequency shifting (Eliades & Wang, 2008). Recent functional neuroimaging studies in humans reveal increases in activity in the superior temporal areas when vocalization with altered feedback is compared to a condition with normal feedback (Christoffels et al, 2007; Hashimoto & Sakai, 2003; Fu et al, 2006; Tourville et al, 2008). This signal increase has been interpreted as evidence for an error signal that could be used to tune ongoing speech production (Guenther, 2006).

The engagement of the superior temporal region in processing speech during listening has been extensively documented. Superior temporal activation is observed for processing of speech stimuli (see Table 1), when compared to: noise (Binder et al., 1994; Jancke et al., 2002; Rimol et al., 2005; Oblesser et al., 2006; Zatorre et al., 1992), tones (Ashtari et al., 2004; Binder et al., 1996; Benson et al., 2001; Binder et al., 2000; Burton & Small, 2006; Burton et al., 2000; Joanisse & Gati, 2003; O’Leary et al., 1996; Poeppel et al., 2004; Specht & Reul, 2003, Vouloumanos et al., 2001; Zaehle et al., 2004), and other nonspeech sounds (Belin et al., 2002; Benson et al., 2006; Gandour et al., 2003; Giraud & Price, 2001; Meyer et al., 2005; Thierry et al., 2003; Uppenkamp et al., 2006). Recent functional imaging studies in which speech and nonspeech conditions are very closely matched acoustically also elicit signal change in this region, suggesting that it is the processing of speech qua speech, and not the acoustic structure of speech that is responsible for the signal change (Dehaene-Lambertz et al., 2005; Desai et al., 2008; Liebenthal et al., 2005; Mottonen et al., 2006; Narain et al., 2003; Scott et al., 2000; Scott et al., 2006; Uppenkamp et al 2006).

Table 1
The activation peaks within the left temporal lobe reported in previously published neuroimaging studies in which listening to speech is compared to listening to a nonspeech stimulus are listed. The contrast for the present study listed in the table represents ...

Despite the evidence for the involvement of the superior temporal region in the processing of both error signal and speech, whether and how these two functions are linked has never been investigated in the same context. In the experiment presented here, we examine whether the same regions within subjects are processing the speaker’s own auditory feedback as well as subserving the perception of speech. We argue that simply observing a change in auditory cortical activity during altered, compared to normal, feedback is not sufficient evidence of an error detection mechanism. The characteristics of a speech error signal include that it be generated uniquely during speech production, and not during listening. Accordingly, it is best identified as an interaction, where the same acoustic difference between altered and normal feedback in the context of talking, provokes a different pattern of activity than when the identical stimuli are presented in the context of listening. In addition, we will examine whether this error detection mechanism recruits regions that are also active – albeit in a different way – when listeners passively hear speech.

Accordingly, we will compare speech production to passive listening, with identical acoustic stimuli present in both conditions. On production trials, participants hear either their own voice as normal (unaltered feedback), or hear voice-gated, signal correlated noise (manipulated feedback). On listening trials, participants hear either recordings of their own voice or noise yoked to previous production trials. We identify brain regions that respond more to noise feedback than unaltered feedback during speech production (consistent with an error signal), but that are not sensitive to speech during passive listening, or that respond more to unaltered feedback than to noise feedback during passive listening. By assessing the interaction between speaking condition and feedback, we will examine whether the same brain regions contribute to speech perception for the purpose of ongoing articulatory control, and for the purpose of comprehension.

Materials and Methods


Written informed consent was obtained from twenty-one individuals (mean age 23 years, range 18 – 45 years; 16 females). All were right-handed, without any history of neurological or hearing disorder, and spoke English as their first language. Each participant received $15 to compensate them for their time. Procedures were approved by the Queen’s Health Sciences Research Ethics Board.

Procedure and functional image acquisition

We adopted a 2×2 factorial design (4 experimental conditions) and a low-level silence/rest control condition. The four experimental conditions were: Production-clear: Producing whispered speech (‘Ted’) and hearing this through headphones; Production-masked: Producing whispered speech (‘Ted’) and hearing voice-gated, signal-correlated masking noise (Schroeder, 1968), which is created by applying the amplitude envelope of the utterance to white noise; Listen-clear: Listening to the stimuli of Production-clear trials (without production); and Listen-masked: Listening to the stimuli of Production-masked trials (without production). Whispered speech was used to minimize bone conducted auditory feedback (Barany, 1938) and to make sure that noise would effectively mask speech (Houde & Jordan, 2002).

Functional magnetic resonance imaging data were collected on a 3T Siemens Trio MRI system, using a rapid sparse-imaging procedure (Orfanidou et al., 2006). In order to hear responses and minimize acoustic interference, stimuli were presented and responses recorded (i.e., a single trial occurred) in a 1400 ms silent period between successive 1600 ms scans (EPI; 26 slices; voxel size 3.3 × 3.3 × 4.0 mm). A high-resolution T1-weighted MPRAGE structural scan was also acquired on each subject.

In the two production conditions, participants spoke into the optical microphone (Phone-Or, Magmedix, Fitchburg, MA) and their utterances were digitized at 10 Hz with 16-bit precision using a National Instruments PXI-6052E input/output board. Real-time analysis was achieved using a National Instruments PXI-8176 embedded controller. Processed signals were converted back to analogue by the input/output board at 10 kHz with 16-bit precision and played over high-fidelity magnet-compatible headphones (NordicNeuroLab, Bergen, Norway) in real time. The processing delays were short enough (iteration delay less than 10 ms) that they would not be noticeable to listeners. The processed signals were also recorded and stored on a computer to be used for the yoked trials of the listen-only conditions. Sound was played at a comfortable listening level (approximately 85 dB). See Figure 1a.

Figure 1
a) Schematic diagram of the hardware used for the experiment, and its interconnections. b) Schematic diagram of three trials from a functional run. The trials are (in order) production-clear, listen-masked and rest. The cross appeared on the screen 200 ...

In the Production-clear condition, the whispered utterance was simply digitized and recorded, and played out unaltered. The masking noise in the Production-masked condition was produced by applying the amplitude envelope of the utterance to uniform Gaussian white noise so that the resulting noise would have the envelope of the original speech signal, or signal-correlated noise (Schroeder, 1968). The masking noise was also temporally gated with the onset and offset of speaking. All our subjects reported in the exiting interviews that they could only hear masking noise but not their own speech through the headphones when they were speaking in the Production-masked condition.

Participants were scanned in three functional runs, each lasting 9 min and comprising 180 trials; 36 of each of the five conditions. Conditions were presented in pseudorandom order, with the limitation that transitional probabilities were approximately equal and all five conditions were presented once in a block of five trials (conditions could repeat at the transition from one block of five trials to the next). The stimuli for listening trials were taken from the production trials in the preceding block, except in the first run when the stimuli for listening trials were from the production trials in the same block. Trials were 3000 ms long, comprising a 1400 ms period without scanning, followed by a 1600 ms whole-brain EPI acquisition (see Figure 1b). Each trial began with a fixation cross appearing in the middle of a black screen (viewed through mirrors placed on the head coil and in the scanner bore) 100 ms before the offset of the previous scan; this signaled the beginning of the trial and the color of the cross indicated whether the volunteer should whisper ‘Ted’ (if green) or remain silent (if red). Of the five conditions types, the two production conditions were cued with a green-cross, and the two listening conditions and the rest condition were cued with a red cross (See Figure 1b). Pilot testing in 5 subjects revealed that it took volunteers at least 200 ms to respond to the green or red cue, so we could present it 100 ms before the end of the scan and still ensure that subjects (on production trials) would not commence speaking during the scan, thereby using our 1400 ms silent period effectively. .

Subjects practiced the whispering-on-cue task before scanning commenced, and we monitored their behaviour during scanning; Subject’s vocal production and auditory feedback signal was segregated into two different channels and therefore monitored and recorded simultaneously. Their performance in each trial was monitored in real-time in the control room for possible errors. In addition, we also inspected the recordings of both production and auditory feedback in each trial afterwards to ensure that every incorrect trial was identified and properly counted. The auditory feedback was not picked up by the microphone and thus there was no danger of the recorded utterances being masked by it. Sessions in which the error rate exceeded 5% were excluded from analysis, as explained below.


SPM2 ( was used for data analysis and visualization. Data were first realigned, within subjects, to the first true functional scan of the session (after discarding 2 dummy scans), and individual’s structural image was coregistered to the mean fMRI image. The coregistered structural was spatially normalized to the ICBM 152 T1 template, and the realigned functional data were normalized using the same deformation parameters. The fMRI data were then smoothed using a Gaussian kernel of 10 mm (FWHM).

Data from each subject were entered into a fixed-effects general linear model using an event-related analysis procedure (Josephs & Henson, 1999). Four event types were modeled for each run. We included six parameters from the motion correction (realignment) stage of preprocessing as regressors in our model to ensure that variability due to head motion was properly accounted for. We chose the hemodynamic response function (HRF) as the basis function. A high-pass filter (cut-off 128 sec) and AR(1) correction for serial autocorrelation were applied. Contrast images assessing main effects, simple effects, and interactions were created for each subject and these were entered into random-effects analyses (one-sample t-tests) comparing the mean parameter-estimate difference over subjects to zero. Clusters were deemed significant if they exceeded a statistical threshold of p<0.05 after correction for multiple comparisons at the cluster level.



Errors occurred when volunteers: 1) spoke after a red cross; 2) remained silent after a green cross, or 3) spoke so quietly that the gated noise was not triggered on production-masked trials. We excluded any trial in which an error occurred. In addition, we excluded from analysis any run in which errors exceeded 5% of the trials; this happened in 3 runs - one run from each of 3 individuals. Performance exceeded 95% correct in all remaining runs. For the three individuals with missing data, contrast images were computed at the single-subject level from the remaining two runs and these were included in the random-effects analyses across subjects.

Functional data

We analyzed main effects of our two factors (production vs listening, and speech feedback vs noise feedback) and interactions between these factors at the group level. When interactions were significant, we also analyzed simple effects. We start by reporting interactions because they are the main focus of the study and because their presence influences the interpretation of the main effects.


(Production-masked − Production-clear) vs (Listen-masked − Listen-clear)

We reasoned that regions involved in processing auditory feedback during talking ought to exhibit a greater increase in activity for noise compared to normal speech, when these are heard as the auditory concomitants of one’s own utterances compared to when one is simply listening to them, without talking. We observed such an interaction in the posterior superior temporal gyrus (STG) bilaterally (see Table 2 and Figure 2). In order to better understand how differences among conditions produced this significant interaction, we explored the simple effects that constitute this interaction. We observed that, within the bilateral STG regions where this interaction yielded significant activity, activation for Production-masked trials was significantly greater than for Production-clear trials. In the left hemisphere, there was a significant cluster (see Table 3) in the posterior STG, extending into the middle temporal gyrus (MTG) and posteriorly into the supramarginal gyrus. In the right hemisphere, there was one cluster in the STG. The contrast of Listen-clearListen-masked yielded activation that was strongly lateralized to the left hemisphere, with a peak cluster in the left MTG extending into the middle superior temporal sulcus (STS), and a cluster in the right anterior STG (see Table 4). The left MTG activation peak for this contrast is also in the neighborhood of areas shown in previous studies that contrasted speech stimuli with nonspeech sounds (see Table 1 and Figure 3). The overlap between regions exhibiting a significant Feedback type by Production/Perception interaction and regions sensitive to speech (activated more by speech than noise during passive listening) is shown in Figure 4. As can be seen, a region of the STG/MTG is sensitive to speech and exhibits increased activation during production when masking noise is heard.

Figure 2
Areas in which the difference in activity between the production-masked and production-clear conditions is significantly greater than the difference in activity between listen-masked and listen-clear conditions. Results are shown at p<0.001, uncorrected. ...
Figure 3
The location of peak activity in the left temporal lobes from the studies listed in Table 1. We used squares and circles to differentiate studies exploring speech perception under passive listening mode and using active tasks such as target detection ...
Figure 4
Overlap between areas that are speech sensitive and areas supporting feedback processing. Regions that are activated more by speech than by noise during passive listening are shown in green. Areas in which the difference in signal between masked and clear ...
Table 2
Areas in which the difference in activity between the Production-masked (PM) and Production-clear (PC) conditions is significantly greater than the difference in activity between Listen-masked (LM) and Listen-clear (LC) conditions. Cluster peaks are reported ...
Table 3
Areas that show increased activation for Production-masked (PM) relative to Production-clear (PN).
Table 4
Areas that are more activated for Listen-clear (LC) than for Listen-masked (LM).

(Production-clear − Production-masked) vs (Listen-clear − Listen-masked)

The opposite interaction contrast revealed areas where normal feedback yielded more activation than masking noise during talking, compared to passive listening. These regions included left inferior frontal gyrus, left superior parietal lobule, left anterior cingulum, left middle occipital gyrus, left caudate, and right fusiform (see Table 5). Again, in order to understand this interaction, we explored the simple effects. The contrast of Production-clear - Production-masked yielded no significant activation. However, the contrast of Listen-masked - Listen-clear revealed significant clusters in regions highlighed by the interaction (see Table 5). Thus, the interaction appears to be due to a greater difference in these areas when passively listening to noise compared to speech, relative to hearing these sounds while talking.

Table 5
Areas in which the difference in activity between the Production-clear (PC) and Production-masked (PM) conditions is significantly greater than the difference in activity between Listen-clear (LC) and Listen-masked (LM) conditions.

Main Effects

Production vs Listening: (Production-clear + Production-masked) vs (Listen-clear + Listen-masked)

Production - Listening activated an extremely large region involving both hemispheres centered on the left inferior frontal gyrus. Much greater activity during production than listening is consistent with a number of previous studies of speech motor control (Blank et al., 2002; Dhanjal et al., 2008; Riecker et al., 2005; Wildgruber et al., 1996; Wise et al., 1999). For example, Wilson et al (2004) thresholds activation maps at P < 10−4 for listening conditions and P < 10−12 for speech production conditions in order to achieve comparable levels of activity. This suggests that speech production must have yielded much more activation relative to listening in their study. When we increased the threshold to p<10−11, we observed clusters in the left inferior frontal gyrus, left postcentral gyrus, and right thalamus (see Figure 5 and Table 6), consistent with previous studies investigating vocal production (Kleber et al., 2007; Moser et al., 2009; Riecker et al., 2008). The reverse contrast, in which activity during production conditions was subtracted from that during listening conditions, did not reveal any significant activation.

Figure 5
Areas that are activated more for Production than for Listening conditions. Results are shown at p<10−11, uncorrected.
Table 6
Areas that are more activated for Production (PC + PM) than for Listening (LC + LM). Cluster peaks are reported if they exceeded a statistical threshold of p<1011 after correction for multiple comparisons at the cluster level.

Speech vs Noise: (Production-clear + Listen-clear) vs (Production-masked + Listen-masked)

We did not observe significant activation at the whole brain level when comparing speech with noise. This is probably due to the strong Feedback type by Production/Perception interaction in speech sensitive regions described earlier; the effect of this crossover interaction is to markedly attenuate the main effect, rendering it not significant. As noted above, during passive listening, signal increases in response to speech as opposed to noise were statistically significant in a left MTG region that many other studies have identified as speech sensitive.

Noise vs Speech: (Production-masked + Listen-masked) vs (Production-clear + Listen-clear)

Brain regions more responsive to noise than to speech include left inferior frontal gyus, left postcentral gyrus, left putamen, left cerebellum, and right supramarginal gyrus (see Table 7). Significant activation in auditory regions for this contrast was not observed.

Table 7
Areas that are more responsive to Production-masked (PM) + Listen-masked (LM) than to Production-clear (PC) + Listen-clear (LC).


In the bilateral STG, greater activity is elicited during speech production when the auditory concomitants of one’s own speech are masked with noise, compared with when they are heard clearly. These regions, together with more inferior STS/MTG regions, also exhibit greater activity when listening to speech compared to noise, consistent with them being speech-sensitive. That the same STG/MTG region is recruited both for the perception of speech and for processing an error signal during production when the predictive signal and auditory concomitant do not match implies that speech-sensitive regions are implicated in an online articulatory control system.

The signal modulation that we observed here was an increase for masked speech compared to clear speech during production but not listening. Although the relationship between hemodynamic response and neural activity is uncertain, this pattern is consistent with neurophysiological work, which demonstrates a release from neural suppression when feedback is altered during production. For example, Eliades & Wang (2008) have observed that, in marmosets, the majority of auditory cortex neurons exhibited suppression during vocalization. However, this vocalization-induced suppression of neural activity was significantly reduced when auditory feedback was altered during vocal production but not during passive listening (Eliades & Wang, 2005, 2008). The attenuation of suppression implies changes in balance of excitatory and inhibitory processes, which would affect regional metabolic energy demands and alter vascular response (Logothetis, 2008). This could in principle manifest as changes in the blood oxygen level that would be detectable as fMRI BOLD signal. Our observation of activity in the bilateral posterior STG in response to altered feedback in production is consistent with such a pattern of neural activity. Models of speech motor control (e.g., Guenther 2006) suggest that vocal motor centers generate specific predictions about the expected sensory consequences of articulatory gestures, which are compared with the actual sensory outcome. Elevated activity in the STG in the presence of mismatched auditory feedback may reflect the involvement of this region in an error detection mechanism when vocalization occurs (Bays et al., 2006; Blakemore et al., 1998; Matsuzawa et al., 2005; Sommer and Wurtz, 2006).

The observed bilateral STG regions (and in particular the left posterior STG) in the interaction contrast are in accordance with a number of previous studies of speech processing (see Table 1), including studies in which speech and nonspeech stimuli are matched acoustically [e.g., Dehaene-Lambertz et al., 2005: (−60, −24, +4), Mottonen et al., 2006: (−61, −39, +2), Narain et al., 2003: (−52, −54, +14), and Scott et al., (2006): (−60, −44, +10)]. This lends support to the idea that this area is sensitive, if not specific, for speech sounds.

In the macaque monkey, the extreme capsule (EmC) interconnects the rostral part of the lateral and medial frontal cortex with the mid-part of the superior temporal gyrus and the cortex of the superior temporal sulcus. These frontal regions are connected with inferior parietal cortex through the middle longitudinal fasciculus (MdLF) (Petrides & Pandya, 1988, 2006; Makris & Pandya, 2009). This EmC-MdLF pathway, spanning frontal-parietal-temporal cortex, has been suggested to play a crucial role in language functions (Schmahmann et al., 2007; Makris & Pandya, 2009). Another fiber system, the superior longitudinal fasciculus–arcuate fasciculus (SLF-AF), courses between the caudal-dorsal prefrontal cortex and the caudal part of the superior temporal gyrus. This pathway has been implicated in sensory-motor processing (Makris & Pandya, 2009). A recent study using diffusion tensor imaging based tractography in humans has found that sublexical repetition of speech is mediated by the SLF-AF pathway, whereas auditory comprehension is subserved by the EmC-MdLF system (Saur et al., 2008). Therefore, the anatomical location of the posterior STG demonstrating signal increases for masked speech over clear speech during production, and the opposite during passive listening (i.e., the interaction), is such that it may be involved in both of these processing pathways, thereby serving both functions of speech processing and sensory-motor integration (Buchsbaum et al., 2005; Hickok et al., 2003; Okada & Hickok, 2006; Okada et al., 2003; Scott & Johnsrude, 2003; Warren et al., 2005).

A number of previous neuroimaging studies have reported bilateral STG activation in response to manipulated auditory feedback compared with normal feedback during reading aloud (Christoffels et al., 2007; Fu et al., 2006; Hashimoto & Sakai, 2003; McGuire et al., 1996; Toyomura et al., 2007). One feature shared by our study and these studies is that we all manipulated auditory feedback in a way that caused it to be different from what was expected. What distinguishes our study is that we have crossed the feedback-type manipulation common to these studies (normal vs altered feedback) with a task manipulation (production vs listening). It is precisely the specificity of this interaction that indicates that activity in brain regions involved in speech perception is modulated by speech production.

We used whispered speech in order to control the bone-conducted acoustic signals in the masking noise condition more effectively (Houde & Jordan, 2002; Paus, et al., 1996). Previous studies indicate that the auditory regions activated by whispered speech are very similar to those activated by vocalized speech, although the level of activation could be somewhat different (Haslinger et al., 2005; Schulz et al., 2005). This suggests that the use of whispered speech did not qualitatively affect our results.

Vocal production invariably entails both auditory and somatosensory goals (Nasir & Ostry, 2008). We did not manipulate somatosensory signals due to the complexity of such experimental setups (e.g., Tremblay et al., 2003). In addition, since speech production in our study only involves repetitively vocalizing a single CVC word (‘Ted’), what we have observed is undoubtedly an underestimate of the functional network subserving the speech production and speech perception. Future work could investigate how speech production and perception are coupled using somatosensory perturbations and more naturalistic speech stimuli.


We have observed enhanced activity in the STG region bilaterally during speech production when auditory feedback and the predicted auditory consequences of speaking do not match. The same region is sensitive to speech during listening. This suggests a self-monitoring/feedback system at work, presumably involved in controlling online articulatory planning. Furthermore, the network supporting speech perception appears to overlap with this self-monitoring system, in the STG at least, highlighting the intimate link between perception and production.


We thank the referees for their valuable comments. This work is supported by an R-01 operating grant from the US National Institutes of Health, NIDCD grant DC08092 (KM), and by grants from the Canadian Institutes of Health Research (IJ), the Natural Sciences and Engineering Research Council of Canada (IJ, KM), the Ontario Ministry of Research and Innovation, and Queen’s University (IJ and ZZ). IJ is supported by the Canada Research Chairs Program.


  • Ashtari M, Lencz T, Zuffante P, Bilder R, Clarke T, Diamond A, Kane J, Szeszko P. Left middle temporal gyrus activation during a phonemic discrimination task. Neuroreport. 2004;15:389–393. [PubMed]
  • Barany E. A contribution to the physiology of bone conduction. Acta Oto-Laryngol. 1938;26:1–228.
  • Bays PM, Flanagan JR, Wolpert DM. Attenuation of self-generated tactile sensations is predictive, not postdictive. PLOS Biology. 2006;4:e28. [PMC free article] [PubMed]
  • Belin P, Zatorre RJ, Ahad P. Human temporal-lobe response to vocal sounds. Cognitive Brain Research. 2002;13:17–26. [PubMed]
  • Benson RR, Richardson M, Whalen DH, Lai S. Phonetic processing areas revealed by sinewave speech and acoustically similar non-speech. NeuroImage. 2006;31:342–353. [PubMed]
  • Benson RR, Whalen DH, Richardson M, Swainson B, Clark VP, Lai S, Liberman AM. Parametrically dissociating speech and nonspeech perception in the brain using fMRI. Brain and Language. 2001;78:364–396. [PubMed]
  • Binder JR, Frost JA, Hammeke TA, Bellgowan PS, Springer JA, Kaufman JN, Possing ET. Human temporal lobe activation by speech and nonspeech sounds. Cerebral Cortex. 2000;10:512–528. [PubMed]
  • Binder JR, Frost JA, Hammeke TA, Rao SM, Cox RW. Function of the left planum temporale in auditory and linguistic processing. Brain. 1996;119:1239–1247. [PubMed]
  • Binder JR, Rao SM, Hammeke TA, Yetkin FZ, Jesmanowicz A, Bandettini PA, Wong EC, Estkowski LD, Goldstein MD, Haughton VM. Functional magnetic resonance imaging of human auditory cortex. Annals of Neurology. 1994;35:662–672. [PubMed]
  • Black JW. The effect of delayed side-tone upon vocal rate and intensity. Journal of Speech and Hearing Disorders. 1951;16:56–60. [PubMed]
  • Blakemore SJ, Wolpert DM, Frith CD. Central cancellation of self-produced tickle sensation. Nature Neuroscience. 1998;1:635–640. [PubMed]
  • Blank SC, Scott SK, Murphy K, Warburton E, Wise RJ. Speech production: Wernicke, Broca and beyond. Brain. 2002;125:1829–1838. [PubMed]
  • Buchsbaum BR, Olsen RK, Koch PF, Kohn P, Kippenhan JS, Berman KF. Reading, hearing, and the planum temporale. NeuroImage. 2005;24:444–454. [PubMed]
  • Burton MW, Small SL. Functional neuroanatomy of segmenting speech and nonspeech. Cortex. 2006;42:644–651. [PubMed]
  • Burton MW, Small SL, Blumstein S. The role of segmentation in phonological processing: an fMRI investigation. Journal of Cognitive Neuroscience. 2000;12:679–690. [PubMed]
  • Chen SH, Liu H, Xu Y, Larson CR. Voice F0 responses to pitch-shifted voice feedback during English speech. Journal of the Acoustical Society of America. 2007;121:1157–1163. [PubMed]
  • Christoffels IK, Formisano E, Schiller NO. Neural correlates of verbal feedback processing: An fMRI study employing overt speech. Human Brain Mapping. 2007;28:868–879. [PubMed]
  • Cowie R, Douglas-Cowie E, Kerr AG. A study of speech deterioration in post-lingually deafened adults. Journal of Laryngology and Otology. 1982;96:101–112. [PubMed]
  • Dehaene-Lambertz G, Pallier C, Semiclaes W, Sprenger-Charolles L, Jobert A, Dehaene S. Neural correlates of switching from auditory to speech perception. NeuroImage. 2005;24:21–33. [PubMed]
  • Desai R, Liebenthal E, Waldron E, Binder JR. Left posterior temporal regions are sensitive to auditory categorization. Journal of Cognitive Neuroscience. 2008;20:1174–1188. [PMC free article] [PubMed]
  • Dhanjal NS, Handunnetthi L, Patel MC, Wise RJ. Perceptual systems controlling speech production. Journal of Neuroscience. 2008;28:9969–9975. [PubMed]
  • Eliades SJ, Wang X. Dynamics of auditory-vocal interaction in monkey auditory cortex. Cerebral Cortex. 2005;15:1510–1523. [PubMed]
  • Eliades SJ, Wang X. Neural substrates of vocalization feedback monitoring in primate auditory cortex. Nature. 2008;453:1102–1106. [PubMed]
  • Elman JL. Effects of frequency-shifted feedback on the pitch of vocal productions. Journal of the Acoustical Society of America. 1981;70:45–50. [PubMed]
  • Fu CH, Vythelingum GN, Brammer MJ, Williams SC, Amaro E, Jr, Andrew CM, et al. An fMRI study of verbal self-monitoring: Neural correlates of auditory verbal feedback. Cerebral Cortex. 2006;16:969–977. [PubMed]
  • Fukawa T, Yoshioka H, Ozawa E, Yoshida S. Difference of susceptibility to delayed auditory feedback between stutterers and nonstutterers. Journal of Speech and Hearing Research. 1988;31:475–479. [PubMed]
  • Gandour J, Xu Y, Wong D, Dzemidzic M, Lowe M, Li XJ, Tong YX. Neural correlates of segmental and tonal information in speech perception. Human Brain Mapping. 2003;20:185–200. [PubMed]
  • Geers AE, Nicholas JG, Sedey AL. Language skills of children with early cochlear implantation. Ear and Hearing. 2003;24:46S–58S. [PubMed]
  • Giraud AL, Price CJ. The constraints functional neuroimaging places on classical models of auditory word processing. Journal of Cognitive Neuroscience. 2001;13:754–765. [PubMed]
  • Guenther FH. Cortical interactions underlying the production of speech sounds. Journal of Communication Disorders. 2006;39:350–365. [PubMed]
  • Hashimoto Y, Sakai KL. Brain activations during conscious self-monitoring of speech production with delayed auditory feedback: an fMRI study. Human Brain Mapping. 2003;20:22–28. [PubMed]
  • Haslinger B, Erhard P, Dresel C, Castrop F, Roettinger M, Ceballos-Baumann AO. “Silent event-related” fMRI reveals reduced sensorimotor activation in laryngeal dystonia. Neurology. 2005;65:1562–1569. [PubMed]
  • Hickok G, Buchsbaum B, Humphries C, Muftuler T. Auditory-motor interaction revealed by fMRI: speech, music and working memory in area Spt. Journal of Cognitive Neuroscience. 2003;15:673–682. [PubMed]
  • Houde JF, Jordan MI. Sensorimotor adaptation in speech production. Science. 1998;279:1213–1216. [PubMed]
  • Houde JF, Jordan MI. Sensorimotor adaptation of speech I: Compensation and adaptation. Journal of Speech, Language, and Hearing Research. 2002;45:295–310. [PubMed]
  • Howell P, Powell DJ. Delayed auditory feedback with delayed sounds varying in duration. Perception and Psychophysics. 1987;42:166–172. [PubMed]
  • Jancke L, Wustenberg T, Scheich H, Heinze HJ. Phonetic perception and the temporal cortex. NeuroImage. 2002;15:733–746. [PubMed]
  • Joanisse MF, Gati JS. Overlapping neural regions for processing rapid temporal cues in speech and nonspeech signals. NeuroImage. 2003;19:64–79. [PubMed]
  • Jordan MI, Rumelhart DE. Forward models: supervised learning with a distal teacher. Cognitive Science. 1992;16:307–354.
  • Jones JA, Munhall KG. Perceptual calibration of F0 production: evidence from feedback perturbation. Journal of the Acoustical Society of America. 2000;108:1246–1251. [PubMed]
  • Josephs O, Henson RN. Event-related functional magnetic resonance imaging: modelling, inference and optimization. Philosophical Transactions of the Royal Society B: Biological Sciences. 1999;354:1215–1228. [PMC free article] [PubMed]
  • Kawato M. Motor theory of speech perception revisted from the minimum torquechange neural network model. 8th Symposium on Future Electron Devices; Tokyo, Japan. 1989. pp. 141–150.
  • Kleber B, Birbaumer N, Veit R, Trevorrow T, Lotze M. Overt and imagined singing of an italian aria. NeuroImage. 2007;36:889–900. [PubMed]
  • Lashley KS. The problem of serial order in behaviour. In: Jefree LA, editor. Cerebral mechanisms in behaviour. Wiley; New York: 1951.
  • Levelt WJ. Monitoring and self-repair in speech. Cognition. 1983;14:41–104. [PubMed]
  • Liberman AM, Cooper FS, Shankweiler DP, Studdert-Kennedy M. Perception of the speech code. Psychological Review. 1967;74:431–461. [PubMed]
  • Liebenthal E, Binder JR, Piorkowski RL, Remez RT. Short-term reorganization of auditory analysis induced by phonetic experience. Journal of Cognitive Neuroscience. 2003;15:549–558. [PubMed]
  • Liebenthal E, Binder JR, Spitzer SM, Possing ET, Medler DA. Neural substrates of phonemic perception. Cerebral Cortex. 2005;15:1621–1631. [PubMed]
  • Logothetis NK. What we can do and what we cannot do with fMRI. Nature. 2008;453:869–878. [PubMed]
  • Mackay DG. Metamorphosis of a critical interval: age-linked changes in the delay in auditory feedback that produces maximal disruption of speech. Journal of the Acoustical Society of America. 1968;43:811–821. [PubMed]
  • Makris N, Pandya DN. The extreme capsule in humans and rethinking of the language circuitry. Brain Structure and Function. 2009;213:343–358. [PubMed]
  • Matsuzawa M, Matsuo K, Sugio T, Kato C, Nakai T. Temporal relationship between action and visual outcome modulates brain activation: An fMRI study. Magnetic Resonance in Medical Sciences. 2005;4:115–121. [PubMed]
  • McGuire PK, Silbersweig DA, Frith CD. Functional neuroanatomy of verbal self-monitoring. Brain. 1996;119:907–917. [PubMed]
  • Meyer M, Zysset S, Yves von Cramon D, Alter K. Distinct fMRI responses to laughter, speech, and sounds along the human peri-sylvian cortex. Cognitive Brain Research. 2005;24:291–306. [PubMed]
  • Moser D, Fridriksson J, Bonilha L, Healy EW, Baylis G, Baker JM, Rorden C. Neural recruitment for the production of native and novel speech sounds. NeuroImage. 2009;46:549–557. [PMC free article] [PubMed]
  • Mottonen R, Calvert GA, Jaaskelainen IP, Matthews PM, Thesen T, Tuomainen J, Sams M. Perceiving identical sounds as speech or non-speech modulates activity in the left posterior superior temporal sulcus. NeuroImage. 2006;30:563–569. [PubMed]
  • Mummery CJ, Ashbumer J, Scott SK, Wise RJ. Functional neuroimaging of speech perception in six normal and two aphasic subjects. Journal of the Acoustical Society of America. 1999;106:449–457. [PubMed]
  • Narain C, Scott SK, Wise RJ, Rosen S, Leff A, Iversen SD, Matthews PM. Defining a left-lateralized response specific to intelligible speech using fMRI. Cerebral Cortex. 2003;13:1362–1368. [PubMed]
  • Nasir SM, Ostry DJ. Speech motor learning in profoundly deaf adults. Nature Neuroscience. 2008;11:1217–1222. [PMC free article] [PubMed]
  • Obleser J, Boecker H, Drzezga A, Haslinger B, Hennenlotter A, Roettinger M, Eulitz C, Rauschecker JP. Vowel sound extraction in anterior superior temporal cortex. Human Brain Mapping. 2006;27:562–571. [PubMed]
  • Okada K, Hickok G. Left posterior auditory-relatd cortices participate both in speech perception and speech production: Neural overlap revealed by fMRI. Brain and Language. 2006;98:112–117. [PubMed]
  • Okada K, Smith KR, Humphries C, Hickok G. Word length modulates neural activity in auditory cortex during covert object naming. Neuroreport. 2003;14:2323–2326. [PubMed]
  • O’Leary DS, Andreason NC, Hurtig RR, Hichwa RD, Watkins GL, Ponto LL, Rogers M, Kirchner PT. A positron emission tomography study of binaurally and dichotically presented stimuli: effects of level of language and directed attention. Brain and Language. 1996;53:20–39. [PubMed]
  • Orfanidou E, Marslen-Wilson WD, Davis MH. Neural response suppression predicts repetition priming of spoken words and pseudowords. Journal of Cognitive Neuroscience. 2006;18:1237–1252. [PubMed]
  • Paus T, Perry DW, Zatorre RJ, Worsley KJ, Evans AC. Modulation of cerebral blood flow in the human auditory cortex during speech: Role of motor-to-sensory discharges. European Journal of Neuroscience. 1996;8:2236–2246. [PubMed]
  • Petrides M, Pandya DN. Association fiber pathways to the frontal cortex from the superior temporal region in the rhesus monkey. Journal of Comparative Neurology. 1988;273:52–66. [PubMed]
  • Petrides M, Pandya DN. Efferent association pathways originating in the caudal prefrontal cortex in the macaque monkey. Journal of Comparative Neurology. 2006;498:227–251. [PubMed]
  • Poeppel D, Guillemin A, Thompson J, Fritz J, Bavelier D, Braun AR. Auditory lexical decision, categorical perception, and FM direction discrimination differentially engage left and right auditory cortex. Neuropsychologia. 2004;42:183–200. [PubMed]
  • Purcell DW, Munhall KG. Adaptive control of vowel formant frequency: evidence from real-time formant manipulation. Journal of the Acoustical Society of America. 2006a;120:966–977. [PubMed]
  • Purcell DW, Munhall KG. Compensation following real-time manipulation of formants in isolated vowels. Journal of the Acoustical Society of America. 2000b;119:2288–2297. [PubMed]
  • Riecker A, Brendel B, Ziegler W, Erb M, Ackermann H. The influence of syllable onset complexity and syllable frequency on speech motor control. Brain and Language. 2008;107:102–113. [PubMed]
  • Riecker A, Mathiak K, Wildgruber D, Erb M, Hertrich I, Grodd W, Ackermann H. fMRI reveals two distinct cerebral networks subserving speech motor control. Neurology. 2005;64:700–706. [PubMed]
  • Rimol LM, Specht K, Weis S, Savoy R, Hugdahl K. Processing of sub-syllabic speech units in the posterior temporal lobe: an fMRI study. NeuroImage. 2005;26:1059–1067. [PubMed]
  • Saur D, Kreher BW, Schnell S, Kummerer D, Kellmeyer P, Vry M, Umarova R, Musso M, Glauche V, Abel S, Huber W, Rijntjes M, Hennig J, Weiller C. Ventral and dorsal pathways for language. Proceedings of the National Academy of Sciences. 2008;105:18035–18040. [PubMed]
  • Schauwers K, Gillis S, Daemers K, De Beukelaer C, De Ceulaer G, Yperman M, Govaerts PJ. Normal hearing and language development in a deaf-born child. Otology and Neurotology. 2004;25:924–929. [PubMed]
  • Schmahmann JD, Pandya DN, Wang R, Dai G, D’Arceuil HE, de Crespigny AJ, Wedeen VJ. Association fibre pathways of the brain: parallel observations from diffusion spectrum imaging and autoradiography. Brain. 2007;130:630–653. [PubMed]
  • Schroeder MR. Reference signal for signal quality studies. Journal of the Acoustical Society of America. 1968;44:1735–1736.
  • Schulz GM, Varga M, Jeffires K, Ludlow CL, Braun AR. Functional neuroanatomy of human vocalization: an H215O PET study. Cerebral Cortex. 2005;15:1835–1847. [PubMed]
  • Scott SK, Blank CC, Rosen S, Wise RJ. Identification of a pathway for intelligible speech in the left temporal lobe. Brain. 2000;123:2400–2406. [PubMed]
  • Scott SK, Johnsrude IS. The neuroanatomical and functional organization of speech perception. Trends in Neurosciences. 2003;26:100–107. [PubMed]
  • Scott SK, Rosen S, Lang H, Wise RJ. Neural correlates of intelligibility in speech investigated with noise vocoded speech – A positron emission tomography study. Journal of the Acoustical Society of America. 2006;120:1075–1083. [PubMed]
  • Siegel GM, Schork EJ, Jr, Pick HL, Jr, Garber SR. Parameters of auditory feedback. Journal of Speech and Hearing Research. 1982;25:473–475. [PubMed]
  • Sommer MA, Wurtz RH. Influence of the thalamus on spatial visual processing in frontal cortex. Nature. 2006;444:374–377. [PubMed]
  • Specht K, Reul J. Functional segregation of the temporal lobes into highly differentiated subsystems for auditory perception: an auditory rapid event-related fMRI task. NeuroImage. 2003;20:1944–1954. [PubMed]
  • Thierry G, Giraud AL, Price CJ. Hemispheric dissociation in access to the human semantic system. Neuron. 2003;38:499–506. [PubMed]
  • Tourville JA, Reilly KJ, Guenther FH. Neural mechanisms underlying auditory feedback control of speech. NeuroImage. 2008;39:1429–1443. [PubMed]
  • Toyomura A, Koyama S, Miyamaoto T, Terao A, Omori T, Murohashi H, Kuriki S. Neural correlates of auditory feedback control in human. Neuroscience. 2007;146:499–503. [PubMed]
  • Tremblay S, Shiller DM, Ostry DJ. Somatosensory basis of speech production. Nature. 2003;423:866–869. [PubMed]
  • Uppenkamp S, Johnsrude IS, Norris D, Marslen-Wilson W, Patterson RD. Locating the initial stages of speech-sound processing in human temporal cortex. NeuroImage. 2006;31:1284–1296. [PubMed]
  • Vouloumanos A, Kiehl KA, Werker JF, Liddle PF. Detection of sounds in the auditory stream: event-related fMRI evidence for differential activation to speech and nonspeech. Journal of Cognitive Neuroscience. 2001;13:994–1005. [PubMed]
  • Warren JE, Wise RJ, Warren JD. Sounds do-able: auditory-motor transformations and the posterior temporal plane. Trends in Neuroscience. 2005;28:636–643. [PubMed]
  • Wildgruber D, Ackermann H, Klose U, Kardatzki B, Grodd W. Functional lateralization of speech production at primary motor cortex: An fMRI study. Neuroreport. 1996;7:2791–2795. [PubMed]
  • Wilson SM, Saygin AP, Sereno MI, Iacoboni M. Listening to speech activates motor areas involved in speech production. Nature Neuroscience. 2004;7:701–702. [PubMed]
  • Wise RJS, Greene J, Buchel C, Scott SK. Brain regions involved in articulation. Lancet. 1999;353:1057–1061. [PubMed]
  • Zaehle T, Wustenberg T, Meyer M, Jancke L. Evidence for rapid auditory perception as the foundation of speech processing: a sparse temporal sampling fMRI study. European Journal of Neuroscience. 2004;20:2447–2456. [PubMed]
  • Zatorre RJ, Evans AC, Meyer E, Gjedde A. Lateralization of phonetic and pitch discrimination in speech processing. Science. 1992;256:846–849. [PubMed]