|Home | About | Journals | Submit | Contact Us | Français|
The discovery of mirror neurons, a class of neurons that respond when a monkey performs an action and also when the monkey observes others producing the same action, has promoted a renaissance for the Motor Theory (MT) of speech perception. This is because mirror neurons seem to accomplish the same kind of one to one mapping between perception and action that MT theorizes to be the basis of human speech communication. However, this seeming correspondence is superficial, and there are theoretical and empirical reasons to temper enthusiasm about the explanatory role mirror neurons might have for speech perception. In fact, rather than providing support for MT, mirror neurons are actually inconsistent with the central tenets of MT.
One of the more intriguing and highly cited theories in cognitive science is the Motor Theory of speech perception (MT) proposed by Alvin Liberman and his collaborators [1–3]. According to MT, humans perceive speech sounds not as sounds, per se, but as the `intended phonetic gestures of the speaker' . The proposal is that production and perception of speech share the same neural processes and representations, based in a linguistic module evolved specifically for communication. Empirical tests of the predictions of MT have provided mixed support, at best  (Box 1), and the number of proponents of MT in the field of speech perception has dwindled. However, the discovery of a class of mirror neurons in monkeys [5,6] and a purported homologous mirror system in humans [7,8] has resulted in a recent renaissance for MT. The discovery of mirror neurons has affected research in the neuroscience of speech and language processing, speech development and language evolution, to name several domains [9–14].
Mirror neurons are a class of neurons found in premotor cortex of the monkey that respond both when performing an action, such as grasping food, and when seeing someone else (such as a human) perform the same action . Thus, it has been suggested that perception and production of action potentially share a common neural code. The salient parallel between this description of mirror neurons and MT's proposal for a common perception and production code has not escaped attention. It has become common for articles on mirror neurons to reference MT and to suggest that the mirror system in humans could have an important role in speech perception [14–17]. This link is supported by the fact that the area of monkey cortex where mirror neurons were discovered (F5) might be homologous to Broca's area in humans, which has long been implicated in speech and language [18,19] (see also Ref. ). Moreover, a sub-class of mirror neurons – echo neurons – responds both to executing an action and to the sound resulting from such an action (e.g. the action and sound of crushing a peanut ).
This constellation of findings has led to proposals that `mirror neurons represent the link between sender and receiver that Liberman postulated…' , and that `the echo-neuron system mediates…speech perception' . The resemblance of mirror neurons to tenets of MT has been included in the justification for a more encompassing theory of mirror neurons to be fundamental to the evolution of human communication (e.g. Refs. [22,23]) and has reignited debate over MT [24–26].
To this point, the link between mirror neurons and MT has been based largely on analogy and the similarity of rather coarse descriptions of both mirror neurons and MT. There have been no explicit models to indicate exactly what role the mirror system would have in speech perception and there are not many empirical tests of putative links. We believe there are reasons to temper enthusiasm about the relationship of mirror neurons and MT. In particular, we argue that (i) mirror neurons do not provide a meaningful solution to the central dilemma of speech perception, (ii) mapping from a speech sound to action is of no help to communication without linguistically relevant representations and (iii) mirror neurons have been equated with an extreme caricature of MT. We conclude that, rather than providing support for MT, mirror neurons are actually inconsistent with MT and are unlikely to have a central role in speech perception.
The typical representation of speech as a series of phonemes gives the illusion that speech is produced as a sequence of discrete movements of the articulators (e.g. tongue, jaw, etc.). In fact, speech is the result of highly overlapping and continuous movements. The shape of the vocal tract at any time is a compromise among productions of several phonemes extending forward and backward in the sequence. That is, the movements and articulator shape associated with production of a phoneme are dependent on the preceding and following phonemes; a phenomenon referred to as `coarticulation' . Because speech acoustics are a function of the shape of the articulators across time, coarticulation leads to context-dependent acoustic realizations of phonemes. As a result, the mapping between acoustics and phoneme is notoriously complex.
As an example, think of the vowel in dud. If one examines the acoustics of a typical production, the vowel will not match an `uh' spoken in isolation. It actually can resemble the vowel in pep. This is because the tongue movement for the vowel in dud is influenced by the surrounding `d's . Does this mean that listeners hear the word as dead? No; rather, listeners seem to compensate for the effects of coarticulation, perceiving the originally intended phoneme [29–31]. Liberman and Mattingly  pointed out that this pattern of perception is problematic if one presumes that listeners perceive speech as `sound', but not if they perceive the speaker's intended motor plan or `gesture'.
The problem is that invariant phonemic gestures themselves also are not readily apparent in the acoustics, articulator shape or movement pattern. Uncovering the intended gestures of a speaker must require computation by the listener. MT is the proposal that this computation is accomplished by a specialized linguistic module using the same processes that calculate the compromises among movement patterns in producing speech.
The MT proposal that speech perception is the function of a specialized hard-wired language module leads to rather strong behavioral predictions. Specifically, classic speech perception phenomena are predicted to be present only for speech sounds (which are the sole input to the module) and only for human listeners (who are the sole possessors of such a module). Unfortunately for the tenability of MT, these strong predictions have not been supported.
A phenomenon purported to be diagnostic of the phonetic module is the context-sensitive categorization of speech sounds that compensates for the effects of coarticulation. The same `phonetic context effects' have been obtained with birds (quail) listening to human speech . Human listeners will also show shifts in the identification of a phoneme as a function of surrounding non-speech sounds (such as noises and tones) [39–41]. This interaction of speech and non-speech would not be possible within a dedicated phonetic module. These findings are difficult to accommodate within a motor-based account of perception (given that quail cannot produce the actions for human speech and humans cannot produce sine wave tones).
Other classic phenomena that have been proffered as evidence for a special speech mode of perception have also turned out to be more general patterns of perceptual behavior. For example, duplex perception, in which an acoustic signal is heard as both non-speech and part of a spoken syllable, has been demonstrated for door slams . The McGurk effect of auditory-visual interaction in speech also can be modeled as the outcome of general perceptual processes without the need of a phonetic module .
Over the last several decades, the empirical basis for MT has been eroded by these non-human and non-speech demonstrations. Moreover, most of these results also contradict any motor-based account of speech perception because animals cannot produce speech and because humans cannot produce many of the non-speech sounds that have been studied.
Coarticulation was the central problem of speech perception that led Liberman  to propose MT. As Liberman and Mattingly  point out – MT would be `meaningless' if there were, in fact, a one to one relationship between the acoustic signal and phonemes. MT proposes a mechanism to deal with the complex mapping between production and perception. It has been claimed that the mirror system is the neural basis of this mechanism (e.g. Ref. ), but mirror neurons provide no insight into how the mapping is accomplished. There is no process proposed that would enable a perceiver to map from context-dependent coarticulated acoustics to a motor response.
One might suggest that something like echo neurons represent the output of MT's phonetic module even if they tell us nothing about its operation. However, speech perception is quite different from the examples of sound-motor mapping observed so far for echo neurons . Quite different from distinguishing the sound of a peanut being crushed or a stick being dropped, speech perceivers must discriminate very subtle acoustic changes (such as those in beet versus bit) and these discriminations must be sensitive to context-dependent coarticulation across extended forward and backward-going time windows. This is more similar to being able to distinguish crushing dry roasted versus honey roasted peanuts and also determining if any sound differences arise from the actor tiring from crushing previous peanuts or planning to crush future peanuts. Echo neurons have not demonstrated a sufficient level of specificity to indicates that they can represent the subtle actions that characterize speech production.
MT is not just the proposal that speech perception uses the processes and representations responsible for speech production; it is explicit that these shared processes are part of a linguistic system and are wholly separate from other perceptual processes. This is not a side proposal of MT; it is at the core of the revised MT [2,32]. The contention that mirror neurons are responsible for speech perception is directly contradictory to this central position. In explaining how mirror neurons might have evolved to support speech function, mirror neuron theorists have argued explicitly that `Hand/arm and speech gestures must be strictly linked and must, as least in part, share a common neural substrate' . The proposal that speech and non-verbal gestures share a common substrate is at odds with MT's contention that speech movements are fundamentally different from non-verbal actions .
In fact, the importance of the linguistic module for MT is that it makes exactly this distinction between oral gestures that are phonetic and those that are not. It does the listener no good to get back to a motor representation unless that motor representation can be used by the linguistic system. One could suggest a second perceptual stage that maps from motor representations to phonemes. MT, however, rejects the idea of a second stage of processing in favor of a linguistic module in which motor representations are the phoneme representations.
The resemblance of mirror neurons and MT is really quite shallow. The explanatory power of MT comes not simply from the suggestion that perception and production have similar or identical representations; it comes from the idea that a specialized module solves two important problems in speech perception – how perception accommodates effects of coarticulation and how this process interfaces with the language system. Equating MT and mirror neurons does not just fail to provide explanatory power; it ignores the direct contradictions between the proposals of MT and the empirical observations of mirror neurons.
MT makes a very clear and strong prediction: damage to motor speech areas should produce deficits in speech perception. Yet, it has been known at least since the time of Broca that damage to left frontal regions can cause severe motor speech deficits while leaving speech recognition intact [43,44]. For example, a recent study reported that Broca's aphasics were indistinguishable from control subjects on an auditory word comprehension test . Lesions associated with Broca's aphasia tend to be relatively large, involving most of the lateral frontal lobe, motor cortex and anterior insula but often also extend posteriorly to include the parietal lobe [44–47]. Thus, the entire left hemisphere `mirror system' can be affected in Broca's aphasia, with intact speech recognition. Neither MT nor mirror neurons offer explanation for this spared recognition.
Whereas MT does not address the speech recognition abilities of Broca's aphasics (e.g. a recent review of MT failed even to mention the syndrome ), motor theorists often note that Broca's aphasics can be impaired on discrimination of nonsense syllables such as ba-ba versus ba-da [48,49]. This impairment would seem to provide evidence favoring MT. However, it does not; speech `discrimination' doubly dissociates from speech `recognition' , indicating that syllable discrimination tasks are tapping some ability or abilities (e.g. working memory, executive or attentional processes) not necessary for ordinary, ecologically valid speech recognition (see Refs [51–53] for reviews).
Another syndrome clearly demonstrating the dissociability of motor-speech functions and speech understanding is mixed transcortical aphasia, characterized by a severe deficit in speech comprehension despite a well-preserved ability to repeat heard speech [54,55]. Damage to left frontal and posterior parietal regions (with sparing of Broca's area, superior temporal gyrus and the tissue in between) seems to disrupt networks playing a part in mapping speech onto conceptual-semantic representations while leaving the sensory-motor functions that support repetition of speech intact. This dissociation is opposite the deficits of Broca's aphasia, indicating that – directly counter to MT–preservation of motor speech functions is neither necessary nor sufficient for speech perception.
In short, data from lesion studies of speech processing unequivocally demonstrate that MT and associated mirror neuron theories of speech perception are incorrect in any strong form. This is not to say that sensory-motor circuits cannot contribute to speech recognition (Box 3). Top-down processes initiated in any frontal circuit (not just motor) might be able to influence speech recognition to some extent via sensory-motor circuits. However, this influence is modulatory, not primary.
Since the discovery of mirror neurons, a weaker version of MT has surfaced, possibly to accommodate the superficial resemblance to mirror neurons. This denuded MT posits that speech perception involves some aspect of the motor system or `involves access to the speech motor system' . It is clear that MT requires more than access to or involvement of the motor system in speech perception. MT equates the two systems: the motor system is necessary for speech perception.
The problem is that the weak version of MT is uncontroversial. Even the most critical opponents of MT would not suggest motor and perceptual systems do not interact or that the capacity for speech production has absolutely no bearing on perception . Given that we typically perceive the speech we produce, it seems unsurprising for there to be correlated neural activity corresponding to perception and production. However, the strong version of MT is probably untrue given empirical evidence demonstrated with animals listening to speech and humans listening to non-speech sounds (Box 1).
Brain imaging studies demonstrating motor area activity during speech perception provide evidence only for a weak form of MT. It is very different to claim that motor area activity is present during perception than that motor activity plays a necessary part in perception, as proposed by MT. To this point, there is very little evidence for the latter (Box 2). For example, one often-cited study applied transcranial magnetic stimulation (TMS) to cortical regions related to tongue movement while participants listened to words with speech sounds requiring strong tongue involvement, words with moderate tongue involvement or non-speech sounds . Recordings of motor-evoked potentials from the tongue increased when participants listened to words with strong tongue movement. The interpretation was that hearing these words activated neural circuits related to their production, which enhanced the TMS effect. One problem with this interpretation is that the same study observed no difference for words with moderate tongue involvement and non-speech sounds. Surely, a motor-based account would predict a difference between words produced with the tongue and sounds that cannot be produced. Nonetheless, even if one observed a speech versus non-speech difference, these sorts of studies cannot resolve whether motor system activity is essential to speech perception (as in MT) or just concomitant with it (Box 3).
Mirror neurons have been taken as evidence for MT because they provide neural confirmation of a perception–production link. However, it is important to be clear that MT is more than just the proposal that there is a link between speech perception and production, or that processes of speech perception and production interact. There is no debate that speech production and perception interact in some manner. Auditory cortical regions, for example, are activated during speech production tasks (e.g. Ref. ). It is the `nature' of the production–perception link that has not been established.
MT makes two claims about the nature of the link: (i) processes of speech motor planning are mandatory to speech perception and (ii) the shared representations of speech perception and production are articulatory (motor) and linguistic.
However, there is little evidence for a mandatory role for production processes in speech perception and much evidence against it (Box 2). Although perhaps not mandatory, it is possible that production can aid perception, especially in challenging listening situations. For example, production processes could be used to create representations of candidate words or syllables to be compared to the auditory input, as in an analysis by synthesis model [57,58]. However, note that such assistance from the production system could be based on comparisons of auditory representations and there is no need to think that this interaction would be required of normal speech perception.
In contrast to MT's claim that the shared representations of speech are motor and linguistic, others have proposed that the complementary relationship governs the speech production-perception link. By these accounts, speech `production' relies on speech `perception' and the shared representations are auditory and general (non-linguistic). Guenther's DIVA  (Directions Into Velocities of Articulators) model uses comparisons of auditory perceptual representations and internally generated speech sound maps to calibrate speech production. Based on data from functional neuroimaging and aphasia, Hickok and Poeppel [51–53] have also argued that auditory processing has an important role in speech production (a view proposed by Wernicke, in 1874, [43,44]). Thus, to this point, whereas the exact nature of the link between speech production and perception remains to be discovered, existing evidence strongly indicates that perceptual systems have a much stronger influence on production than motor systems have on perception.
Future research to uncover the nature of the speech perception–production link can investigate how changes in audition (e.g. through perceptual learning) affect speech production, how changes in production (e.g. through motor learning) affect speech perception and whether there are asymmetries in the size of these effects. Investigating speech motor disorders could also prove informative. Dysarthria is a class of motor control disorders that leads to abnormal speech production, but it is unknown whether there are concomitant deficits in speech perception that can be predicted by the motor disruptions.
Mirror neurons have often been discussed as representing the goals of movement patterns as opposed to the movements themselves . That is, the goal might be grasping a piece of food, but the approach path or grip type can vary. But, what is the goal of speech movements? A simple answer to this question might be; `to create a particular sound'. As with grasping, the actual movements to accomplish this goal can vary. For example, speakers who have a bite-block obstruction placed in their mouths adjust their articulation to create sounds acoustically similar to speech produced without obstruction . (This compensatory movement on the part of speaker could be because of auditory or somatosensory feedback, but the result is typically a preservation of the acoustic output, [36,37].) If perception were gesture-based, as in MT, then speakers presumably would attempt to produce the original gesture as closely as possible as a different underlying gesture would mislead perceivers. However, if perception were based on an auditory representation, it makes sense that speakers would attempt to match the sound as much as possible.
Of course, stating the goals for the producer (and perceiver) is quite different from explaining how they are accomplished. We know remarkably little about how humans communicate using sound. Many of the questions that led to the original proposal of MT are still with us: how do speakers plan coarticulation? How do listeners recover coarticulated phonemes? How do infant listeners develop into speakers? How are acoustic signals transformed into linguistic entities? How do speakers and listeners modulate their roles in the communicative dance depending on the communication setting? However, the current interest in applying the construct of a mirror system to speech communication provides no new answers for these questions. Worse still, coupling discussion of mirror neurons with theories of speech communication encourages the impression that the complex interaction between speech perception and production has been `solved' (Box 3) when, in fact, the challenges of understanding it have yet to be met.
Grants from the National Institutes of Health and the National Science Foundation supported manuscript preparation. The authors thank Davi Vitela for her assistance.