|Home | About | Journals | Submit | Contact Us | Français|
Observing a speaker’s mouth profoundly influences speech perception. For example, listeners perceive an “illusory” “ta” when the video of a face producing /ka/ is dubbed onto an audio /pa/. Here, we show how cortical areas supporting speech production mediate this illusory percept and audiovisual (AV) speech perception more generally. Specifically, cortical activity during AV speech perception occurs in many of the same areas that are active during speech production. We find that different perceptions of the same syllable and the perception of different syllables are associated with different distributions of activity in frontal motor areas involved in speech production. Activity patterns in these frontal motor areas resulting from the illusory “ta” percept are more similar to the activity patterns evoked by AV/ta/ than they are to patterns evoked by AV/pa/ or AV/ka/. In contrast to the activity in frontal motor areas, stimulus-evoked activity for the illusory “ta” in auditory and somatosensory areas and visual areas initially resembles activity evoked by AV/pa/ and AV/ka/, respectively. Ultimately, though, activity in these regions comes to resemble activity evoked by AV/ta/. Together, these results suggest that AV speech elicits in the listener a motor plan for the production of the phoneme that the speaker might have been attempting to produce, and that feedback in the form of efference copy from the motor system ultimately influences the phonetic interpretation.
Observable mouth movements profoundly influence speech perception. The McGurk-MacDonald effect is a striking demonstration of this influence: When participants are presented with audiovisual (AV) speech stimuli, they report hearing a phoneme that is neither what they saw nor what they heard but rather a “fusion” of the auditory and visual modalities (McGurk and MacDonald 1976). For example, participants report hearing “ta” when a sound track containing the syllable /pa/ is dubbed onto a video track of a mouth producing /ka/. Another such effect, “visual capture,” occurs when listeners hear the visually presented syllable (i.e., /ka/ in the prior example).
Other remarkable findings demonstrate the extent to which normal visual cues can affect speech perception. Adding visible facial movements to speech enhances speech recognition comparable with removing up to 20 dB of noise from the auditory signal (Sumby and Pollack 1954). Multisensory enhancements in intelligibility of degraded auditory speech are anywhere from 2 to 6 times greater than would be expected for the comprehension of words or sentences in the auditory or visual modalities when presented alone (Risberg and Lubker 1978; Grant and Greenberg 2001). Importantly, such effects are not limited to unnatural or degraded stimulus conditions: Visual speech contributes to understanding clear but hard to comprehend speech or speech spoken with an accent (Reisberg et al. 1987).
How do observable mouth movements influence speech perception? Research on mirror neurons in the macaque and a putative mirror system in humans (see Rizzolatti and Craighero 2004 for a review) led us, like others, to propose that observable mouth movements elicit a motor plan in the listener that would be used by the listener to produce the observed movement (Skipper et al. 2005, 2006). Mirror neurons are a small subset of neurons, originally found in the macaque premotor area F5, that fire both during the production of goal-directed actions but also during the observation of similar actions. Similar “mirroring” functionality has been ascribed to the human motor system (Rizzolatti and Craighero 2004).
Indeed, both behavioral and neurophysiological evidence support the notion that the human mirror system and, therefore, the motor system, play a critical role in speech perception when mouth movements are observed. Behaviorally, listeners’ perception of the McGurk-MacDonald effect is altered by viewing mouth movements produced by others or by oneself in a mirror (Sams et al. 2005). Similarly, speech production performance is changed or enhanced when producing a syllable and viewing someone saying that syllable compared with when that person is saying a different syllable (Kerzel and Bekkering 2000; Gentilucci and Cattaneo 2005). Neurophysiologically, activation (Campbell et al. 2001; Nishitani and Hari 2002; Olson et al. 2002; Callan, Jones, et al. 2003; Calvert and Campbell 2003; Paulesu et al. 2003; Buccino et al. 2004; Watkins and Paus 2004; Pekkola et al. 2006) and transcranial magnetic stimulation (Sundara et al. 2001; Watkins et al. 2003; Watkins and Paus 2004) of the motor system during the observation of mouth movements have been used to argue for a role of the mirror or motor system in AV speech perception.
Using functional magnetic resonance imaging (fMRI), we have previously shown that AV speech perception activates a network of motor areas including the cerebellum and cortical motor areas involved in planning and executing speech production and areas subserving proprioception related to speech production (Skipper et al. 2005). We also showed that it is primarily the visual aspects of observable mouth movements rather than the auditory content of speech that is responsible for this motor system activity. Auditory speech alone evoked far less activity in the motor system than AV speech, which is typical in speech perception studies that involve no explicit motor responses on the part of the listener (compare Zatorre et al. 1996; Belin et al. 2000, 2002; Burton et al. 2000; Zatorre and Belin 2001). Furthermore, activity in the motor system was modulated by the visual salience of AV speech: Although an increase in visually distinguishable phonemes in AV stories was associated with a concomitant increase in motor system activity, no such effect occurred during auditory story listening without visual input (Skipper et al. 2005). Together these findings support the notion that it is the visual information about the phonetic content of the stories and not simply the lexical or acoustic properties (or corresponding motor or gestural code for those acoustic properties) per se that induces motor system activity (but see Wilson et al. 2004; Wilson and Iacoboni 2006).
Consistent with other behavioral and neurophysiological research, we concluded that AV speech perception is mediated in part by cortical activity resulting from prior experience producing speech (Stevens and Halle 1967; Liberman and Mattingly 1985). That is, recognition of AV speech involves invoking motor commands that could have generated the observed mouth movements.
Positing that mirror neurons, the mirror system, or the motor system associated with speech production is involved in AV speech perception is not, however, explanatory. Recognizing a mouth movement (in the sense of activity in mirror neurons) and interpreting the significance of that movement with respect to speech perception are qualitatively different. Similar neural activity during production and perception of the same behavior certainly demonstrates a correspondence that could reflect matching or recognition of previously produced actions but it does not indicate how the motor system mediates the effect of observable mouth movements on speech perception.
To explain the mechanism by which motor system activity is understood by the brain as relevant to speech perception, we have developed a theoretical model of AV speech perception (Skipper et al. 2005; van Wassenhove et al. 2005; Skipper et al. 2006) based on the “hypothesize-and-test” or “analysis-by-synthesis” approach (Stevens and Halle 1967; see Discussion for further elaboration). Specifically, this model proposes that early multisensory speech representations, derived from sound patterns and observed facial movements, can be thought of as multisensory hypotheses (i.e., “unconscious inferences” see Helmholtz 1867; Hatfield 2002)—but not final interpretations—about the phonemes produced by a speaker. These hypotheses, extracted from sensory information, are mapped onto motor commands used in speech production. These motor commands are associated with hypotheses based on past experience producing speech. Activated motor commands then predict the acoustic and somatosensory consequences of executing a speech movement through efference copy (von Holst and Mittelstaedt 1950; sometimes referred to as “corollary discharge,” Sperry 1950 or “forward models” in the motor control literature, Jordan and Rumelhart 1992; Miall 2003; Callan et al. 2004). These internally generated sensory consequences can constrain phonetic interpretation of the incoming sensory information by, for example, lending weight to a particular interpretation of a stretch of utterance corresponding to the sensory representation of the phoneme statistically associated with the production of the hypothesized phoneme.
We have specified the cortical areas that support the mechanism described by this model (Fig. 1; see Skipper et al. 2006 for a review of imaging data associated with these areas). These regions are visual areas, primary auditory cortex (A1), posterior superior temporal (STp) areas, supramarginal gyrus (SMG), somatosensory cortices (SI/SII), ventral premotor (PMv) cortex, and the pars opercularis (POp). The sequence of processing of observable mouth movements begins with a multisensory representation corresponding to a hypothesis in multisensory STp areas (visual area → STp ← A1). This hypothesis is specified in terms of the motor goal of that movement (STp → POp). The motor goal of the movement is mapped to the motor commands that could generate the observed movement in a somatotopically organized manner, in this case the mouth area of PMv cortex (POp → PMv ← M1). These motor commands yield a prediction of both the auditory (PMv → STp) and somatosensory (PMv → SI/SII → SMG → STp) consequences of those commands had they been produced. These predictions can be used to constrain speech processing by supporting a particular interpretation or hypothesis (STp).
Using event-related fMRI, we tested specific aspects of this model. First, we looked for evidence that observing mouth movements and producing those mouth movements are associated with similar patterns of motor activity. Specifically, neural activity during production of a syllable was expected to be similar to that generated when observing this syllable in an AV condition, or in a silent visual-alone (V) condition. However, neural activity during production of a syllable was expected to be less similar to activity evoked by an audio-only (A) syllable. This would be suggestive of a shared underlying mechanism for production and observation of speech that is based on the presence of observable mouth movements and the ability of the motor system to predict the acoustic and somatosensory consequences of the observed information (van Wassenhove et al. 2005).
Second, if the motor system plays a role in determining perception as proposed by the above model, we expect that different patterns of activity in the motor system evoked by AV stimuli would correspond to different perceptual experiences of those stimuli. For example, the exact same AV stimulus perceived as “ka” or “ta” would be expected to be associated with different patterns of activity in the motor system. Similarly, different AV stimuli perceived as “ka” and “ta” would also be expected to be associated with different patterns of activity in the motor system.
Third, we looked for a particular pattern of neural activity in the motor system that would constitute evidence for the hypothesis-and-test or analysis-by-synthesis model outlined above. To do this we capitalized on the unique property of the McGurk-MacDonald effect, in which sensory aspects of the stimulus do not correspond to participants’ perceptual experience. On our account, motor system activity is proposed to be an early hypothesis about the identity of sensory patterns. If this is the case, patterns of activity during AV speech perception in frontal motor areas would correspond to the participants’ perceptual experience and not the physical stimuli transduced by sensory receptors. Therefore, from an early stage of processing, the motor activity for the stimulus that elicits the McGurk-MacDonald effect (i.e., the stimulus that results in an illusory “ta” percept as described above) would more resemble the AV stimulus corresponding to participants’ perception of that stimulus (i.e., /ta/) than the stimuli corresponding to the sensory information thatwas actually presented (i.e., /pa/ or /ka/; Table 1A).
Fourth, we looked for evidence that the hypothesis about the phonetic identity of a stimulus, reflected in frontal motor system activity, results in a prediction of the sensory consequences of producing those movements and influences sensory cortices through efference copy. Again, using the McGurk-MacDonald effect, we looked for evidence that early activity in sensory areas initially corresponds to a pattern of activity that is consistent with the sensory properties of the stimulus (i.e., /pa/ or /ka/). However, if efference copy is involved in determining perception, subsequent patterns of activity in these sensory regions should come to correspond to a pattern of activity consistent with the motor hypothesis (i.e., /ta/; Table 1B).
The details of the specific analyses that were performed to address these questions are elucidated in greater detail in the Materials and Methods and Results sections below.
Participants were 21 right-handed native speakers of American English with normal hearing and vision and no history of neurological or psychological disturbance. Handedness was determined by the Edinburgh handedness inventory (Oldfield 1971). Participants gave written consent, and the Institutional Review Board of The University of Chicago approved the study.
All tasks and stimuli are described in Table 2. Participants passively listened to and/or watched speech stimuli during 3 separate runs of AV, V, or A stimuli. These runs were presented in a randomized and counterbalanced manner across participants. No explicit motor response was required and no supplementary task (e.g., discrimination, identification, etc.) was performed during this portion of the experiment. AV stimuli were AV/pa/, AV/ka/, and AV/ta/ spoken by a female actress filmed from the neck up. The actress made no noticeable facial movements besides those used in articulation. In addition, participants watched and listened to a stimulus designed to elicit the McGurk-MacDonald effect. This stimulus was composed of an audio /pa/ (A/pa/) dubbed onto the video of a face saying /ka/ (V/ka/), henceforth denoted as ApVk. Visual-alone stimuli were V/pa/, V/ka/, and V/ta/ and were created by removing the audio tracks from the AV stimuli. Audio-alone stimuli were A/pa/, A/ka/, and A/ta/ and were created by removing the video tracks from the AV stimuli.
Following the AV, V, and A runs, there was a second AV run corresponding to the fourth functional imaging run. After completion of the second AV run, the experimenter entered the room and placed a 3-button mouse under the participant’s right hand. The participant then completed 2 behavioral tasks intended to measure perception of ApVk. Prior to placement of the mouse, participants were not aware that they would be making behavioral responses. Participants indicated with a button press which stimulus they heard most frequently: Consistent perception of ApVk as “pa,” “ka”, or “ta” would result in hearing that stimulus twice as often as the others. Following the frequency judgment was an active AV run (aAV) corresponding to the fifth functional imaging run. During the aAV run participants responded “pa,” “ka,” or “ta” using 3-button alternative forced choice (AFC) to provide a continuous measure of perception of aApVk and recognition of congruent aAV/pa/, aAV/ka/, and aAV/ta/ in the scanning environment (see van Wassenhove et al. 2005 for behavioral results in the AV, A, and V conditions out of the scanning environment).
In a sixth and final run, the participants were asked to produce speech. Participants saw “pa,” “ka,” or “ta” printed on the screen for 1.5 s. Participants then repeated each syllable at a conversational rate for 12 s until they saw the word “STOP” on the screen. Participants were asked to speak at a normal volume (i.e., to avoid trying to compete with the noise of the scanner), to not open their mouths more than they might in a quiet conversation, and to avoid excessive movement.
Stimuli were always 1.5 s in length. Audio was delivered to the participants at a sound pressure level of 85 decibels (db-SPL) through headphones containing MRI-compatible electromechanical transducers (Resonance 300 Technologies, Inc., Northridge, CA). Participants viewed video stimuli through a mirror attached to the head coil that allowed them to see a screen at the end of the scanning bed. Speech productions were recorded through a microphone attached to the participant’s headphones. Participants were monitored with a video camera.
Imaging was performed with blood oxygen level-dependent fMRI at 3 Tesla (GE Medical Systems, Milwaukee, WI) using spiral acquisition (Noll et al. 1995). A volumetric T1-weighted inversion recovery spoiled grass sequence was used to provide anatomical images on which landmarks could be found and functional activation maps could be superimposed. The 2 AV runs were each 7 min in length (280 whole-brain images each; a whole-brain image was collected every 1.5 s in this and all conditions) and contained a combined total of 272 stimuli: 68 each of ApVk, AV/pa/, AV/ka/, and AV/ta/. The V and A runs were also 7 min in length (280 whole-brain images in each condition) and contained a total of 135 stimuli: 45 each of /pa/, /ka/, and /ta/. The aAV run was 12 min in length (480 whole-brain images) and contained a total of 136 stimuli: 34 each of aApVk, aAV/pa/, aAV/ka/, and aAV/ta/. In each of these runs, stimuli were presented in a randomized event-related manner with a variable interstimulus interval (ISI; mean ISI = ~3 s; minimum ISI = 0 s; maximum ISI = 12 s). The speech production run was 7 min in length (244 whole brain images) and contained 15 randomized blocks of speaking—5, 12-s blocks during which participants spoke “pa,” “ka,” or “ta.”
Images were spatially registered in 3D space by Fourier transformation of each of the time points and corrected for head movement, using the AFNI software package (Cox 1996). Resulting data were analyzed using deconvolution/regression. A deconvolution analysis was used to generate impulse response functions (IRFs) of the fMRI signal on a voxel-wise basis (Glover 1999). There were separate regressors for each time point of ApVk, AV/pa/, AV/ka/, AV/ta/, A/pa/, A/ka/, A/ta/, V/pa/, V/ka/, and V/ta/. For the aAV run, regressors were aApVk when it was classified as “ta,” aApVk when it classified was “ka” (see below), aAV/pa/, aAV/ka/, aAV/ta/, and an “other” category in which aApVk was classified as “pa” or aAV/pa/, aAV/ka/, or aAV/ta/ were classified as something other than “pa,” “ka,” or “ta,” respectively. In addition, the “other” category contained a randomly chosen subset of aApVk trials when classified as “ta” or “ka” so that the 2 regressors (i.e., aApVk when classified as “ta” and aApVk when classified as “ka”) each contained an equal number of trials. These were then fit using a linear least squares model to each time point of the hemodynamic response. Each stimulus had 12 such regressors, one for each time repetition (TR) between 0 and 11 (i.e., 0–18 s). This analysis resulted in an estimate of the hemodynamic response for each stimulus relative to a baseline state without a priori assumptions about the IRF. Also incorporated were 3 regressors of interest (“pa,” “ka,” and “ta”) for the speaking task. These regressors were waveforms with similarity to the hemodynamic response, generated by convolving a gamma-variant function with the onset time and duration of the blocks of interest. Finally, the model included one regressor each for the mean signal, linear, and quadratic trends, and 6 motion parameters for each of the 6 runs. Resulting regression coefficients were mean normalized.
Next, cortical surfaces were inflated (Fischl, Sereno, and Dale 1999), registered to a template of average curvature (Fischl, Sereno, Tootell, et al. 1999), and using SUMA (Saad et al. 2004) and AFNI packages (Cox 1996), surface-based analyses of variance (ANOVAs) were performed. ANOVA dependent variables were normalized regression coefficients from the deconvolution/regression analysis. These coefficients were smoothed the equivalent of 4 mm on the surface (Chung et al. 2005). One ANOVA contained 3 factors: syllables from the AV runs (ApVk, AV/pa/, AV/ka/, and AV/ta/) and TR following stimulus onset (0–11) as fixed factors and participant as a random factor. Similar ANOVAs were performed for the V and A runs with the exception that syllable contained only 3 levels (e.g., A/pa/, A/ka/, and A/ta/). A 2-factor ANOVA for the speech production run contained the fixed factor syllable (“pa,” “ka,” and “ta”) and participants as a random factor. Overlap analysis (i.e., intersection of active voxels) revealed where activation during AV (both excluding and including ApVk in the ANOVA), V, and A perception overlapped speech production of the same syllables. Overlap images were created using images each thresholded at P < 0.05 corrected using a false discovery rate (FDR) algorithm (Genovese et al. 2002) and logically conjoined. A paired t-test was conducted on the coefficients from the active run, comparing ApVk when classified as “ta” or “ka.” Again, images were corrected for multiple comparisons using FDR (corrected to P < 0.05).
Additional analyses within each individual were performed within regions of interest (ROIs). ROIs were created with Freesurfer’s automated parcellation (Fischl et al. 2004). Regions were then subparcellated using landmarks established by Caviness et al. (1996). The precentral gyrus and sulcus were subparcellated into dorsal and ventral sectors, and the superior temporal gyrus and sulcus was divided into 3 sectors (see Skipper et al. 2005 for anatomical definition of these regions). To test specific hypotheses we conducted both correlational and discriminant analysis (Haxby et al. 2001; see Results section for further explication as to why these 2 analyses were conducted and specific applications). To perform these analyses, we split the design matrix into odd and even trials and reran the deconvolution/regression analysis described earlier. Then, within each ROI, regression coefficients for each syllable for odd and even trials from significantly activated voxels across all AV syllables (P < 0.05 corrected using FDR) that were also active in the speech production condition (P < 0.05 corrected using FDR) were mean corrected by dividing each coefficient by the mean from the regression analysis. For a region to be considered for further analysis, it was required that at least 4 voxels be active within any given region and that more than half of the participants activate it.
First, pairwise correlations were performed on these voxels between ApVk and AV/pa/, AV/ka/, or AV/ta/ in each ROI by collapsing over odd and even trials by averaging coefficients. For example, in a given ROI there may have been 60 voxels that were active when participants observed ApVk, AV/pa/, AV/ka/, or AV/ta/ and that overlapped the production of “pa,” “ka,” or “ta.” These 60 voxels were then correlated in a pairwise manner. That is, the correlation between each of the 60 voxels for ApVk and AV/pa/, ApVk and AV/ka/, ApVk, and AV/ta/ was calculated. Friedman tests were used to test for differences between the correlation between ApVk and AV/pa/, ApVk and AV/ka/, and ApVk, and AV/ta/ for each of the voxels in each ROI. A Friedman test is a nonparametric test used to compare observations repeated within the same participants and is an alternative to a repeated measures ANOVA. The Friedman test ranks each correlation. Tukey-type nonparametric multiple comparison post hoc tests (Nemenyi) were used to test for differences between resulting ranks between ApVk and AV/pa/, ApVk and AV/ka/, ApVk, and AV/ta/ within each ROI. This analysis resulted in a measure the strength of the correlation of the distribution of activity associated with ApVk with the other syllables in an ROI. We performed this analysis both collapsing over time by averaging the coefficients associated with each point of the IRF in each ROI (e.g., Fig. 3A) and also on each individual time point of each IRF in each ROI (e.g., Fig. 3B–D).
Second, discriminant analysis was conducted on the odd and even trials (Haxby et al. 2001). Correlation coefficients were calculated within syllables (e.g., between even and odd AV/ta/) and across syllables (e.g., between even AV/ta/ with odd AV/ka/). Within-syllable correlation coefficients were then compared with each of the across-syllable correlation coefficients. If the within-syllable correlation was larger than that of the across-syllable correlation coefficient, the comparison was counted as a correct identification. If the within-syllable correlation was larger than all across-syllable correlation coefficients, it was identified as correct against all other syllables. A t-test was used to test whether the accuracy of identifying a syllable for the whole group exceeded chance for the pairwise comparisons (50%) and chance when corrected against all other syllables (25%).
Analysis of participants’ responses to the aApVk stimulus during the 3AFC task during run 5 indicated that participants typically labeled ApVk as either “ta” or “ka.” Therefore, K-means analysis was used to assign participants to 2 different groups based on their responses to aApVk. This resulted in a “ta” (i.e., “fusion”) group (N = 13) who responded “ta” when presented aApVk, and a “ka” (i.e., “visual capture”) group (N = 8) who responded “ka” when presented aApVk or who responded “ka” or “ta” with near equal likelihood when presented aApVk. Sixty-two percent of the “ta” group indicated that they heard “ta” most frequently during the frequency judgment and responded “ta” 83% of the time during the 3AFC when presented aApVk. Sixty-three percent of the “ka” group indicated that they heard “ka” most frequently and responded “ka” 61.5% of the time when presented aApVk during the 3AFC. Both groups responded “pa” less than 2% of the time when presented aApVk during the 3AFC. All participants were accurate ( >95% correct) in classifying aAV/pa/, aAV/ka/, and aAV/ta/. There were no differences in participants’ accuracy in classifying aAV/pa/, aAV/ka/, and aAV/ta/.
Unless otherwise noted, all analyses were conducted on the “ta” (i.e., “fusion”) group of participants. Analyses focused on this group because participants’ responses during the behavioral tasks and participants’ activation patterns were relatively more homogeneous than those of the “ka” group. That is, the “ka” group was associated with higher variability in both the behavioral responses and activation patterns relative to the “ta” group.
Above baseline activity for syllables for the AV, A, and V ANOVAs show activation of areas typically associated with both speech perception and speech production (orange and blue in Fig. 2). To investigate if passive viewing of the McGurk-MacDonald syllable elicited a different pattern than passive viewing of the congruent AV syllables, we contrasted ApVk with AV/pa/, AV/ka/, or AV/ta/ (Table 3 and Table 4). When contrasted, ApVk was significantly more active than AV/pa/ and AV/ka/ in more cortical areas than AV/ta/, especially in frontal areas (Table 3). ApVk was also significantly less active than AV/pa/and AV/ka/ in more cortical areas than AV/ta/, again, especially in frontal regions (Table 4). With respect to these frontal areas, ApVk differed from both AV/pa/ and AV/ka/ in the ventral aspect of the premotor cortex, whereas ApVk did not differ from AV/ta/ in this area.
Thus, the activation patterns associated with ApVk showed a smaller difference in the extent of activity when compared with AV/ta/ than when compared with AV/pa/ or AV/ka/. Though the relative lack of difference between ApVk and AV/ta/ is a null result, these findings indicate that the incongruent ApVk stimulus produces patterns of cortical activity that are more similar to that of a congruent AV/ta/ syllable, especially in frontal areas including PMv cortex (see Olson et al. 2002 for a similar result). This suggests that the motor system treats ApVk as if it were the perceived “ta” rather than the observed (i.e., /ka/) or heard (i.e., /pa/) speech. We more explicitly test this below.
We next looked for evidence that perception of observable mouth movements results in similar activity as during speech production. This involvement should be detected by comparing activity during AV speech perception with speech production. Stimulus-evoked cortical activation during the passive viewing of congruent AV syllables spatially overlapped with activation found during the speaking of the same syllables, as determined by the intersection of the activity patterns from the main effect of syllable for the AV and speaking ANOVAs (orange in Fig. 2A and Table 5; P < 0.05 corrected). Excluding the occipitotemporal areas, Figure 2A and Table 5 show that overlap occurred primarily in anterior superior temporal (STa) cortex, STp, POp, PMv and dorsal premotor (PMd) cortex, SMG, and SI/SII. Of these areas, Table 5 indicates that speech perception most extensively overlapped activity elicited by speech production, that is, overlap of greater than 50% of the activated voxels, in STa, STp, and PMv.
Because ApVk is not a naturally spoken syllable and could result in a different pattern of activity compared with the congruent AV syllables, it was excluded from the ANOVA used in the above logical intersection analysis. When the logical intersection analysis was repeated with the ApVk stimulus in the ANOVA, neither the activated areas nor the distribution of activity within those areas significantly changed.
Overlap analysis of the activity resulting from V and A stimuli with speech production was used to assess the hypothesis that the recruitment of areas of cortex involved in speech production during AV stimuli is largely due to the participation of the speech production system in the analysis of observable mouth movements. Results indicate that activity associated with the intersection of the V and speech production conditions was found in the same areas identified in the intersection of the AV and speech production conditions (orange in Fig. 2B and Table 5; P < 0.05 corrected). If anything, the V condition yielded a more robust pattern of overlap of activity with the speech production condition. Logical conjunction of activity resulting from the A and speech production conditions, however, showed little overlap except in temporal and parietal areas (orange in Fig. 2C and Table 5; P < 0.05 corrected).
The remaining analyses were conducted to understand the computational role of the motor system, operationally defined here as those regions active in both speech perception and production in the frontal lobe, in creating the AV percept. The experimental prediction is that the distribution of motor cortical activity associated with the perception of the “ta” McGurk-MacDonald effect will more closely resemble the distribution of activity for AV/ta/ (i.e., the stimulus corresponding to the participants’ “ta” perception) than AV/pa/ or AV/ka/ for the “ta” group of participants (Table 1). Alternatively, the distribution of activity for the perception of the McGurk-MacDonald stimulus as “ta” in the motor system could resemble the distribution of activity for AV/ka/ (i.e., the stimulus corresponding to the visual information about mouth movements in this stimulus) and/or AV/pa/ (i.e., the stimulus corresponding the audio component of the stimulus), suggesting that the motor system more veridically represents the visual or auditory input.
Pairwise correlations were calculated between the distribution of activity associated with ApVk and the activity separately associated with each of the AV/pa/, AV/ka/, or AV/ta/ stimuli in the passive task in frontal regions that overlapped speech production (see Table 5 for regions). A 2-way nonparametric Freidman test indicated a significant difference among the pairwise correlations (Friedman ranks test = 14.00, P = 0.001). A nonparametric post hoc test of the resulting ranks indicated that—for frontal regions that overlap speech production—activity for ApVk was significantly more correlated with the distribution of activity corresponding to AV/ta/ than it was with either AV/pa/ (Nemenyi = 4.43, 0.005 > P > 0.002) or AV/ka/ (Nemenyi = 4.72, 0.005 > P > 0.002) (Fig. 3A). Similarly, in frontal regions that overlap speech production, activity resulting from ApVk was more correlated with AV/ta/ than either AV/pa/ or AV/ka/ when performing the same analysis over the entire time course of activity for the syllables in these motor regions (P values < 0.05; see Fig. 3B for an example).
These analyses were repeated for the “ka” group. Though not significant, a trend was observed in which ApVk was more correlated with the distribution of activity corresponding to AV/ka/ than either AV/pa/ or AV/ta/ (Friedman ranks test = 2.25, P > 0.355; Friedman rank sum= 13, 19, and 16 for ApVk and AV/pa/, ApVk and AV/ka/, and ApVk and AV/ta/, respectively). This lack of significance was due to higher variability in both the responses and activation patterns for the “ka” group relative to the “ta” group.
Therefore, the distribution of cortical activity evoked by pVk in those frontal regions involved in speech production (for listeners who perceived ApVk as “ta”) was more similar in nature to that seen for the veridical AV/ta/ than it was to that associated with any other stimulus. This result is consistent with the finding presented above in which the activation patterns associated with ApVk showed a smaller difference in the extent of activity when compared with AV/ta/ than when compared with AV/pa/ or AV/ka/. Conversely, for listeners who perceived ApVk as “ka,” the trend in motor system activity was more like AV/ka/ activity than anything else. These results suggest that activity in frontal motor areas that participate in AV speech perception and production does not simply register visual and/or auditory information but rather represents hypotheses about an early integration of AV information.
Furthermore, that ApVk was more like a true AV/ta/ for participants who perceived “ta” and that ApVk was more like a true AV/ka/ for participants who perceived “ka” suggests that different hypotheses activate different motor plans resulting in different perceptions. That is, just as producing different syllables requires coordination of different muscles and is therefore mediated by nonidentical neuronal assemblies, the same seems to hold during motor hypotheses testing associated with AV speech perception. To further test this idea, the activation patterns for a subset of trials from the condition in which participants actively classified ApVk as “ta” or “ka” were compared. The classification of ApVk as “pa” was excluded from this analysis because this classification occurred on fewer than 2% of the trials. Statistical maps (P < 0.05 corrected; Fig. 4) show that when ApVk was classified as “ka,” significant activation occurred in the middle and inferior frontal gyri and insula. Classifying ApVk as “ta” or “ka” yielded cortical activity in spatially adjacent but distinct areas in right inferior and superior parietal lobules, left somatosensory cortices, left PMv, and left M1.
One interpretation of this result is that the observed topography in motor areas could be due to the motor response required of the participants when classifying ApVk or to the incongruent or unnatural nature of the ApVk stimulus; that is, the observed topography could be an artifact of the task rather than distinct motor hypotheses about AV stimuli. In order to address this concern a discriminant analysis was performed to assess the presence of topographic population codes in these regions for congruent AV/ta/, AV/ka/, and AV/pa/ stimuli, in the condition in which participants made no button responses. Discriminant analysis of activation patterns resulting from these syllables shows that they are distinguishable from one another in the same motor and somatosensory cortices in which the activation occurred during the active task shown in Figure 4 (P < 0.05).
The proposed model (Fig. 1) predicts that activity in auditory and somatosensory areas might initially (i.e., in early stages of stimuli processing) correspond to the physical properties of the stimulus impinging on the sensory system but subsequently (in later processing and through efference copy) correspond to the hypothesis. If so, activity associated with the “ta” McGurk- MacDonald effect in auditory and sensory areas involved in both speech perception and production should initially resemble the distribution of activity for AV/pa/ (the auditory stimulus) and later that of AV/ta/ (the fused percept) but not AV/ka/ (the visual stimulus). The activity resulting from the perception of McGurk-MacDonald effect as “ta,” however, should be less correlated with AV/ka/ because processing of the visual component of the stimulus by these areas is presumably not as robust as processing associated with the auditory component of the stimulus.
To test this prediction, the correlation analysis described in the previous Results section was performed in active temporal and parietal areas from the passive AV condition that were also active during speech production for the “ta” group. A 2-way nonparametric Freidman test indicated a significant difference between the pairwise correlations of ApVk with the other AV syllables for those participants who perceived ApVk as “ta” (Friedman test = 12.46, P = 0.001). Post hoc tests indicated that ApVk was more highly correlated with AV/ta/ than AV/ka/ (Nemenyi = 4.99, 0.005 > P > 0.002) but not significantly different from AV/pa/ (Nemenyi = 2.50, 0.2 > P > 0.1) (Fig. 3A). Similarly, looking over the entire time course of activity in these auditory and somatosensory regions, a 2-way nonparametric repeated measures ANOVA indicated a significant difference between the pairwise correlations of ApVk with the other AV syllables at the onset of activity (Friedman test = 17.08, P < 0.0001). Post hoc tests indicated that activity evoked by ApVk was more highly correlated with activity evoked by AV/pa/ than with activity evoked by AV/ka/ (Nemenyi = 3.33, 0.05 > P > 0.02) or AV/ta/ for the first 1.5 s of the hemodynamic response (Nemenyi = 5.82, P < 0.001). At later time points, however, activity was significantly more correlated with AV/ta/ than AV/pa/ or AV/ka/ (P values < 0.05; see Fig. 3C for an example).
Similarly, the distribution of activity for ApVk in visual areas was predicted to resemble the distribution of activity for AV/ka/ (the visual stimulus) and AV/ta/ (the “fused” percept) but not AV/pa/ (the auditory stimulus). That is, the visual system receives visual stimulation consistent with AV/ka/ but not AV/pa/ and shifts to a pattern consistent with the stimulus corresponding to the participant’s perception, AV/ta/. Indeed, a 2-way nonparametric Freidman test indicated a significant difference between the pairwise correlations of ApVk with the other AV syllables for those participants who perceived ApVk as “ta” (Friedman test = 11.23, P = 0.004). Post hoc tests indicated that ApVk was more highly correlated with AV/ka/ than AV/pa/ (Nemenyi = 4.72, 0.005 > P > 0.002) but not significantly different from AV/ta/ (Nemenyi = 2.77, 0.2 > P > 0.1) (Fig. 3A; see Fig. 3D for an example over the entire time course). This suggests that, like the auditory and somatosensory systems, the visual system shifts from a sensory-based activity pattern (i.e., from /ka/) to one that is more consistent with activity in the motor system.
Above it was shown that ApVk was more highly correlated with activity evoked by AV/ta/ over the entire time course of activity in motor regions. ApVk was, however, more highly correlated with activity evoked by AV/pa/ for the first 1.5 s of the hemodynamic response but was thereafter more correlated with AV/ta/ in auditory and somatosensory cortices. Here we test whether the strong correlation of ApVk with AV/ta/ in motor regions precedes this shiftin the correlation of ApVk with AV/pa/ to AV/ta/ in auditory and somatosensory cortices. Indeed, the correlation of ApVk evoked activity with AV/ta/ evoked activity in motor regions is significantly stronger than the correlation of ApVk with AV/ta/ in auditory and somatosensory areas for the first 4.5 s of the hemodynamic response following stimulus presentation (P values < 0.05). Thereafter, however, there is no significant difference between the correlations of ApVk with AV/ta/ in motor and auditory and somatosensory cortices.
The present results show that 1) certain cortical areas active during speech production are also active during both congruent and incongruent AV speech perception, and that this activity primarily occurs when mouth movements are observed by listeners (Figure 2 and Table 5); 2) the active classification of the McGurk-MacDonald stimulus as 2 different syllables (/ta/ or / ka/) evokes 2 different patterns of activity in frontal cortical areas involved in speech production (Fig. 4); 3) similarly, different congruent AV syllables, when passively viewed and heard, result in different patterns of activity in frontal cortical areas involved in speech production; 4) the pattern of activity evoked by the McGurk-MacDonald syllable in frontal cortical areas involved in speech production most resembles that evoked by the congruent AV syllable that corresponds to participants’ perception of the stimulus (Fig. 3); and 5) at the onset of the hemodynamic response, the pattern of activity evoked by the McGurk-MacDonald syllable in sensory areas most resembles the pattern of activity evoked by the congruent AV syllable that corresponds to sensory input. Later, it most resembles the pattern of activity evoked by the congruent AV syllable that corresponds to participants’ ultimate perception (see Fig. 3 for an example).
Our results show that aspects of the cortical motor system that are involved in speech production are also involved in perception, suggestive of a shared underlying system for AV speech perception and production (see (1) above) but also indicative of the mechanisms of this system. Motor areas involved in production, when activated differentially, result in different perceptual experiences of AV speech (see (2) and (3) above). This can be explained by a shared mechanism of perception and production in that producing different sounds requires the coordination of different muscle groups or the same groups in different proportions. Furthermore, our results suggest that this motor system activity is involved in determining the ultimate phonetic perception of AV stimuli because the same stimulus perceived in different ways is associated with 2 distinct patterns of activity in the motor areas involved in speech production (see (2) above).
In what follows we relate our interpretation of the results—that the motor system is involved in determining our perceptual experience of AV speech through a shared mechanism of perception and production—to 2 aspects of the model outlined in the Introduction (see also Fig. 1 and Table 1). Recall, ourmodel was formulated to go beyond the claim that a mirror or motor system is associated with both speech production and AV speech perception and to explain the mechanism by which motor system activity determines speech perception. We proposed that motor system activity be thought of as a multisensory hypothesis about the phonemes produced by a speaker and that this hypothesis predicts the acoustic and somatosensory consequences of executing a speech movement through efference copy.
We believe that our results show that activity in areas of the motor system associated with speech production during observation of AV speech is a hypothesis about a particular (phonetic) interpretation of the stimulus properties rather than an accurate or veridical representation of stimulus properties. Specifically, by traditional contrasts (Table 3 and Table 4), we showed that in terms of patterns of cortical activity, the McGurk stimulus was least different from the congruent AV stimulus / ta/ in PMv cortex. Consistent with this result, we used the McGurk-MacDonald eliciting stimulus to show that the response in the motor system associated with speech production was independent of the physical stimulus impinging on the sensory respecters but, rather, corresponded to the participants ultimate perception (i.e., the congruent AV stimulus /ta/; see (4) above).
At this point the argument could be made that activation of the motor system occurs after perception in that participants may be imitating or rehearsing the McGurk-MacDonald stimulus after they had perceived it. Similarly, one could argue that activation of the motor system constitutes perception. Contrary to this interpretation, however, we show that whereas the motor system reflects activity consistent with the ultimate perception of a syllable, the auditory and somatosensory systems shift from a sensory-based activity pattern to one that is more consistent with activity in the frontal motor system (see (5) above). Furthermore, activity in frontal motor areas involved in speech production precedes this switch in sensory areas from a sensory-based activity pattern to one more consistent with activity in the motor system (see Results). Thus, rather than interpreting motor system activity as imitation or rehearsal or as constituting perception, we interpret our results as consistent with our model. That is, in our model activity in the motor system associated with speech production constitutes a hypothesis and that this hypothesis is used to predict the acoustic and somatosensory consequences of executing that hypothesis through efference copy.
In the following section we further discuss the significance of these results and their interpretation with respect to theories of speech perception.
The model described in the Introduction for which we have garnered evidence was formulated in order to explain the neural mechanisms by which listeners come to understand speech in face-to-face contexts (Fig. 1, for similar theories see Stevens and Halle 1967; Callan et al. 2004; van Wassenhove et al. 2005). In so doing, this model addresses a specific theoretical problem that has been the focus of speech perception research for decades, namely, the “lack of invariance problem.” Lack of invariance refers to the observation that different acoustic speech signals map onto identical phonetic categories and, conversely, that identical acoustic signals map onto different phonetic categories (e.g., Liberman et al. 1967), meaning that sensory information alone is insufficient to determine the identity of a phoneme (see Nusbaum and Magnuson 1997).
Many theories of speech perception (Fant 1969; Stevens and Blumstein 1978; Klatt 1979; Liberman and Mattingly 1985; McClelland and Elman 1986) and indeed theories of perception more broadly (Hubel and Wiesel 1959, 1977; Barlow 1972; Barlow et al. 1972) argue for a direct mapping of stimulus signals (e.g., acoustic properties) onto perception (e.g., phonetic categories). These theories rest on the notion of tuned neural feature detectors that are presumed to respond selectively to successively more abstract transformations of stimulus input from thalamic systems to primary sensory cortices into secondary and associated sensory cortices until a memory system is contacted representing perception (Whitfield and Evans 1965; Abbs and Sussman 1971; Barlow 1972; Barlow et al. 1972).
With respect to speech perception, a direct-mapping approach cannot computationally manage the nondeterministic relationship between acoustic patterns and phonetic categories (Nusbaum and Magnuson 1997). Just as neural models of bat echolocation (previously thought to exist as an isolated modular system) account for echolocation through complex sensorimotor interactions over distributed neural systems (Suga and Schlegel 1972; Suga and Shimozawa 1974; Yan and Suga 1996), we suggest that the process of speech perception involves interactions over a distributed set of sensorimotor cortices through mechanisms of both feedforward and feedback processing. In the following section, we discuss how this conclusion specifically relates to 2 influential theories of speech perception, the motor theory of speech perception (Liberman and Mattingly 1985) and analysis-by-synthesis (Stevens and Halle 1967).
The motor theory of speech perception claims to solve the lack of invariance problem by positing that speech perception is directly mediated solely by a motor or gestural code (Liberman and Mattingly 1985). The theory states that speech perception occurs directly by reference to invariant motor programs for speech production. Thus, all speech is directly transduced into a gestural code. Liberman and Mattingly (1985) further state that there is no auditory processing of speech in service of phoneme perception.
Neuroimaging results from auditory (Wilson et al. 2004; Pulvermuller et al. 2006; Wilson and Iacoboni 2006) and visual (Olson et al. 2002) speech perception alone have been shown to evoke motor cortical activity, and this has been used to argue in favor of the motor theory of speech perception. The present study fleshes out the mechanism underlying these findings in showing that although speech-associated mouth movements (heard or observed) evoke motor cortical activity in speech production brain regions (Fig. 2 and Table 5), speech perception is not simply determined through feedforward or direct-mapping mechanisms. Instead, our results suggest that sensory input interacts with feedback or efference copy from the motor system involved in speech production (see Results and Fig. 3). Although this argues against the specific claims of the motor theory of speech perception (Liberman and Mattingly 1985), it also argues against purely sensory theories of speech perception (e.g., Fant 1969; Blumstein and Stevens 1981). Furthermore, it raises a question regarding the nature of the interaction between motor and sensory cortices in speech perception.
The “analysis-by-synthesis” model of speech perception developed by Stevens and Halle (1967) addresses the nature of these interactions. In this model, sensory information is used to generate knowledge-based hypotheses about the speech category that the speaker intended. That is, these candidate speech categories are internally “synthesized.” Matching between the synthesized hypothesis and sensory information is used to select the intended speech category. When the discrepancy between the hypothesis and sensory inputs is small, no further processing is necessary. In this case, the mapping between sensory information and phonetic categories is sufficiently deterministic (see Nusbaum and Magnuson 1997). When the discrepancy between the hypothesis and the sensory inputs is large, however, the motor system is recruited. That is, when there is substantial phonetic ambiguity or lack of invariance, the nondeterministic relationship needs to be resolved by additional information provided by the motor system.
There are neurobiological data to support the view that the motor system is actively recruited to resolve phonetic ambiguity associated with the lack of invariance. Callan et al. (2004) have shown that nonnative speakers evoke more activity in motor areas during the perceptual identification of a phonetic contrast that is ambiguous in their language compared with native speakers of a language in which that contrast is not ambiguous. Native speakers, by comparison, evoked greater activity exclusively in auditory cortices. Callan, Tajima, et al. (2003) have also shown that more ambiguous distinctions for nonnative speakers activate motor cortices to a much larger extent than less ambiguous distinctions.
In our neurobiologically specified version of this analysis-by-synthesis model, the motor system involved in speech production is recruited because it can be used to predict the sensory consequences of the synthesized hypotheses, which can then be matched with incoming sensory information to constrain interpretation. There is considerable behavioral (Knoblich and Flach 2001; Wolpert and Flanagan 2001) and neurophysiological (Zaretsky and Rowell 1979; Bell 1981; Sillar and Skorupski 1986; Sillar and Roberts 1988; Gossard et al. 1991; Wolf and Burrows 1995; el Manira et al. 1996; Blakemore et al. 1998; Wolpert et al. 1998; Bodznick et al. 1999; Iacoboni et al. 2001; Iacoboni 2005) data to support the proposal that prediction and efference copy are general properties of the motor system across vertebrates.
If the motor system is particularly good at prediction then it should be particularly attuned to observable mouth movements because this information can be used to predict auditory sensory information making speech perception easier in everyday environments (see van Wassenhove et al. 2005). This may explain why, in the present experiment, AV and visual speech perception yielded more robust activity in the motor system associated with speech production compared with the auditory condition alone (see Fig. 2).
In summary, the results of the present experiment point to a model of speech perception in which visual contextual information is used to test hypotheses about the identity of speech sounds. In our hypothesis-and-test or analysis-by-synthesis model, a hypothesis is specified in terms of the motor commands that might elicit the hypothesized movements. These motor commands can then be used to predict the sensory consequences of producing those movements through efference copy. The predicted sensory consequences of the motor hypothesis can be compared with or lend support to a particular interpretation of a stretch of utterance as a sensory representation of that utterance is being built. An interpretation is reached and perception occurs only after integrating the predicted sensory consequences of the motor representation and ongoing sensory processing. It is in this manner that the brain actually comes to see voices and hear (it’s own) lips, as suggested by McGurk and MacDonald (1976).
This work was supported by National Institutes of Health RO1-DC03378. Thanks to the following people who each contributed to this paper in some manner: Michael Arbib, Catherine Barber, Bernadette Brogan, E. Chen, Shahrina Chowdhury, Jean Decety, Fred Dick, Kim Fenn, Tom Hummer, Goulven Josse, Philippa Lauben and Leo Stengel, Matt Longo, Robert Lyons, Xander Meadow, Lisa Patterson, David Poeppel, Anjali Raja, Ana Solodkin, Ryan Walsh, and Nicholas Wymbs. Special thanks to Alison Wiener-Skipper for discussion and much support and Uri Hasson for discussion and advice.
Conflict of Interest: None declared.