|Home | About | Journals | Submit | Contact Us | Français|
A compelling example of auditory-visual multisensory integration is the McGurk effect, in which an auditory syllable is perceived very differently depending on whether it is accompanied by a visual movie of a speaker pronouncing the same syllable or a different, incongruent syllable. Anatomical and physiological studies in human and non-human primates have suggested that the superior temporal sulcus (STS) is involved in auditory-visual integration for both speech and non-speech stimuli. We hypothesized that the STS plays a critical role in the creation of the McGurk percept. Because the location of multisensory integration in the STS varies from subject to subject, the location of auditory-visual speech processing in the STS was first identified in each subject with fMRI. Then, activity in this region of the STS was disrupted with single-pulse TMS as subjects rated their percept of McGurk and non-McGurk stimuli. Across three experiments, TMS of the STS significantly reduced the likelihood of the McGurk percept, but did not interfere with perception of non-McGurk stimuli. TMS of the STS was only effective at disrupting the McGurk effect in a narrow temporal window from 100 ms before auditory syllable onset to 100 ms after onset, and TMS of a control location did not influence perception of McGurk or control stimuli. These results demonstrate that the STS plays a critical role in the McGurk effect and auditory-visual integration of speech.
A textbook example of how both the auditory and visual modalities are important for speech perception is the McGurk effect, in which an auditory syllable (phoneme) is perceived very differently depending on whether it is accompanied by a visual movie of a speaker pronouncing the same syllable or a different, incongruent syllable (McGurk and MacDonald, 1976). The superior temporal sulcus (STS) has been implicated in auditory-visual multisensory integration for both speech and non-speech stimuli (Calvert et al., 2000; Sekiyama et al., 2003; Beauchamp, 2005a; Miller and D'Esposito, 2005). However, neuroimaging studies that have compared incongruent auditory-visual speech, including McGurk syllables, to congruent speech or other baselines have reported differences in a broad network of brain regions in addition to the STS, including the supramarginal gyrus (Jones and Callan, 2003; Bernstein et al., 2008), the inferior parietal lobule (Jones and Callan, 2003), the precentral gyrus (Jones and Callan, 2003), the superior frontal gyrus (Miller and D'Esposito, 2005), Heschl's gyrus (Miller and D'Esposito, 2005) and the middle temporal gyrus (Callan et al., 2004). A recent fMRI study even claims that the STS is not especially important for auditory-visual integration (Hocking and Price, 2008). Interpreting these conflicting results is problematic because the BOLD signal is an indirect measure of neural activity with poor temporal resolution, limiting inferences about whether activity in an area is causally related to (or merely correlated with) a particular cognitive process. For instance, frontal areas could respond to incongruity in the stimulus but not be responsible for the dramatically altered percept that defines the McGurk effect.
TMS can be used to create a temporary virtual lesion and assess the necessity of a brain area for perception; delivering single-pulse TMS at various latencies allows the time window in which the area contributes to perception to be determined (Pascual-Leone et al., 2000; Walsh and Cowey, 2000). However, TMS stimulation of cortical association areas such as the STS does not produce phosphenes (unlike V1) or motor twitches (unlike M1) making localization by TMS ineffective. Because there is substantial intersubject variability in the anatomical location of the STS multisensory area (Beauchamp et al., 2004a), it is difficult to target the STS and other association areas with skull-based landmarks (Sack et al., 2009). This has spurred the development of methods that combine fMRI and TMS in individual subjects (Andoh et al., 2006; Sack et al., 2006; Sparing et al., 2008). Using these methods, we created temporary lesions in the STS and found a dramatic reduction in the likelihood of the McGurk effect, demonstrating that the STS is a cortical locus for the McGurk effect and auditory-visual integration in speech.
Experiments were conducted in accordance with the Committee for the Protection of Human Subjects of the University of Texas Health Science Center at Houston. 12 healthy volunteers (4 females, 2 left handed, mean age 25 years) with no history of neurological or sensory disorders participated in the study (nine in experiment 1, nine in experiment 2, six in experiment 3). Subjects were screened for the exclusion criteria for MRI and TMS and written informed consent was obtained from each subject prior to experimentation.
McGurk and non-McGurk stimuli were digitally recorded and edited (see supplementary online materials for a sample stimulus). In experiments one and three, a male speaker was used. The McGurk stimulus consisted of an auditory recording of “ba” and a digital video of the speaker enunciating “ga.” This produced the McGurk effect percept of “da.” The control stimulus consisted of an auditory recording of “ba” and a gray screen with white fixation crosshairs (but no digital video) producing the percept “ba.” In order to demonstrate that the disruption of the McGurk effect was robust to stimulus changes, in experiment two a different stimulus set was used with a female speaker. The McGurk stimulus consisted of an auditory recording of “pa” and a digital video of the speaker enunciating “ka” or “na” producing the McGurk effect percept of “ta” (Sekiyama, 1994). The control stimulus consisted of an auditory recording of “pa” and a digital video of the speaker enunciating “pa,” producing the percept “pa.” Some subjects do not exhibit the McGurk effect (MacDonald et al., 2000). Therefore, an informal screening was used at the first stage of subject recruitment. Four subjects who did not report the McGurk percept were excluded; the remaining 12 subjects participated in the experiments.
During the TMS experiments, seated subjects viewed visual stimuli on an LCD screen placed at eye level 65 cm from the subject. In-ear headphones were used to deliver auditory speech stimuli and subjects made button-press responses with their left hand. A biphasic TMS unit (Magstim Rapid; Magstim Co., Whitland, UK) with a 70 mm figure-of-eight coil was used to deliver TMS. The coil was positioned using an image-guided neuro-navigation system for frameless stereotaxy (Brainsight, Rogue Research, Montreal, Canada). The stimulation site was plotted as the coordinate on the surface of the brain closest to the TMS coil, calculated as the position where a line normal to the surface of the scalp first intersects the brain surface. The motor threshold intensity was determined for each subject (on average 68% ± 2% of machine output) and used throughout the session (Ro et al., 2004; Stokes et al., 2005; Balslev et al., 2007).
In experiments one and two, each run consisted of ten trials of randomly intermixed McGurk and control stimuli with and without TMS for a total of 40 trials. Each subject completed two runs, one with the TMS coil targeting the left STS and one with the TMS coil targeting a control site. Single-pulse TMS was delivered at the onset of the frame of the video nearest to the onset of the auditory syllable (experiment 1: 1155 ms after visual stimulus onset, 32 ms after auditory onset; experiment 2: 528 ms after visual onset, 19 ms after auditory onset).
In experiment three, single-pulse TMS was delivered to the STS in every trial, with the latency of the TMS varied in a range of eleven times corresponding to video frame times, from 298 ms before to 362 ms after auditory onset, with step size 66 ms (corresponding to two video frames). As in experiments 1 and 2, there were 10 trials per condition for a total of 110 trials.
Anatomical MRI scans were obtained from each subject using a 3-tesla whole-body MR scanner (Phillips Medical Systems, Bothell, WA) using a 16 channel head gradient coil. Images were collected using a magnetization-prepared 180 degrees radio-frequency pulses and rapid gradient-echo (MP-RAGE) sequence optimized for gray-white matter contrast with 1 mm thick sagittal slices and an in-plane resolution of 0.938 × 0.938 mm. AFNI software (Cox, 1996) was used to analyze MRI data. 3D cortical surface models were created with FreeSurfer (Fischl et al., 1999) and visualized in SUMA (Argall et al., 2006). Cortical surfaces were partially inflated using 500 iterations of a smoothing algorithm to better visualize the deeper sulcal areas (Van Essen, 2004). To allow reporting of the stimulation sites in standard coordinates, each individual subject brain was normalized to the N27 atlas brain (Mazziotta et al., 2001). For five subjects, the STS multisensory area was localized using an anatomical landmark: the inflection point in the posterior STS where it angles upwards towards parietal lobe (Beauchamp et al., 2008). For the remaining seven subjects, the STS multisensory area was localized using BOLD fMRI. Functional images were collected using gradient-echo, echo planar imaging (EPI) (TR = 2015 ms, TE = 30 ms, flip angle = 90°) with a voxel size of 3 × 2.75 × 2.75 mm. The localizer consisted of one 150 TR run with audio and video stimulus blocks. One hundred single syllable words from the MRC Psycholinguistic Database (Wilson, 1988) were spoken by a female speaker. Auditory stimuli consisted of auditory words presented during visual presentation of crosshairs. Visual stimuli consisted of word videos with no accompanying sound. Five blocks of auditory words and five blocks of visual words were presented in random order. Three subjects were also presented with blocks of simultaneous auditory (A) and visual (V) words. Each block consisted of ten consecutive stimuli, one per TR, followed by 5 TR's of fixation with crosshairs at the approximate location of the mouth. To reduce any visual after-image, a static image of a scrambled face was presented for 50 milliseconds after each video. Visual stimuli were projected onto a screen and viewed by the subject using a mirror. MRI-compatible pneumatic headphones were used to present auditory stimuli inside the scanner. During the course of scanning, MR-compatible eye tracking (Applied Science Laboratories, Bedford, MA) was used to ensure arousal and attention. Stimuli were presented using Presentation software (Neurobehavioral Systems, Albany, CA). The general linear model was used to detect voxels showing a significant response to auditory or visual speech, followed by a conjunction analysis of auditory ∩ visual activation. An automated parcellation routine determined all A ∩ V voxels in the STS (Fischl et al., 2004; Beauchamp et al., 2008). The center-of-mass of the resulting cluster of active voxels was used as the target site for STS TMS. Different statistical criteria for classifying multisensory voxels in the STS, such as the mean response to audiovisual (AV) words or the contrast of AV vs. mean(A,V), changed the extent but had little impact on the center-of-mass of the STS multisensory area (Beauchamp, 2005b).
Consistent with previous studies, the fMRI localizers revealed a focus of activity in the posterior STS that responded to both auditory and visual speech (Fig. 1A). There was a high degree of intersubject variability in the standard coordinates of the STS multisensory area, especially in the anterior-to-posterior direction (mean × = -56 ± 4 mm; y = -27 mm ± 12 mm SD; z = 8 ± 9 mm).
To verify the effectiveness of our stimuli, we examined subjects' percepts without TMS (Figs. 1B and 1C). The non-McGurk control stimuli consisted of an auditory syllable (experiment 1) or a congruent auditory-visual syllable (experiment 2). When presented with these stimuli, subjects almost always reported a percept that matched the auditory stimulus (mean likelihoods and reaction times in Table 1). The McGurk stimuli consisted of an incongruent auditory-visual syllable. For these stimuli, subjects rarely reported a percept that matched the auditory stimulus, instead reporting the McGurk percept of a fused syllable different from both the auditory and visual syllables.
When TMS was delivered to the STS, subjects were significantly less likely to report the McGurk effect (experiment 1: P = 5e-5; experiment 2: P = 0.004). A concern was that non-specific effects of TMS could introduce a possible confound. For instance, the brief click of the TMS pulse could somehow interfere with auditory perception. To address this concern, we stimulated a control TMS site dorsal to the STS, producing a similar behavioral experience for the subject. The mean co-ordinates of the control site in standard space were (x,y,z) = (-42, -19, 46) a distance of 39 ± 12 mm (SD) from the STS site (-60, -35, 16). TMS of the control site did not reduce the likelihood of perceiving the McGurk effect (experiment 1: P = 0.2; experiment 2: P = 0.5). A second concern was that TMS of the STS might interfere with speech perception in general. However, TMS of the STS did not affect discrimination of the control stimuli (experiment 1: P = 0.5; experiment 2: P = 0.3). If multisensory integration in the STS is the basis of the McGurk effect, the relevant neural computation must occur in a relatively narrow time window after the auditory and visual stimuli are delivered but before perception occurs. To test this idea, in the third experiment the likelihood of the McGurk effect was measured while single-pulse TMS was delivered to the STS at a range of times. There was a significant effect of stimulation time on the McGurk effect [F(10,50) = 4.66, P = 0.0001]. This was driven by a reduction in the McGurk effect at four time points, spanning 100 ms before onset of the auditory stimuli to 100 ms after onset of the auditory stimulus (P < 0.05 by Mann-Whitney U test). At other times, STS TMS did not significantly change the McGurk percept (Fig. 1D).
Subjects reported a variety of percepts during auditory-visual trials in which STS TMS disrupted the McGurk effect. The most common experience, reported 66% of the time, was a percept similar to auditory-only trials (e.g. TMS delivered with auditory “ba” + visual “ga” resulted in the percept “ba” instead of the McGurk percept “da”). The second most common experience was a percept between the auditory and McGurk percepts (e.g. between “ba” and “da”). Other reports were of a hybrid percept (e.g. “b-da”) or a completely different syllable (e.g. “ha”).
These experiments demonstrate that temporary disruption of the STS with TMS causes a dramatic reduction in the perception of the McGurk effect. The role of the STS in auditory-visual integration has been called into question because the STS also shows fMRI responses during other tasks, such as visual-visual associations (Hocking and Price, 2008). However, BOLD fMRI is an indirect method of measuring neural activity with limited temporal precision. TMS allows for the perturbation of brain areas to demonstrate a causal link between brain and behavior (Pascual-Leone et al., 2000; Walsh and Cowey, 2000). Combining TMS with anatomical and functional MRI allows the same functional brain region to be targeted in each subject, greatly increasing the statistical power of TMS studies of association areas, like the STS multisensory area, that are difficult to accurately localize with other methods (Sparing et al., 2008; Sack et al., 2009). A privileged role for the STS in auditory-visual integration is demonstrated by the finding that temporary disruption of the STS interferes with the McGurk effect, which depends on the interaction between the auditory and visual modalities.
TMS of the STS disrupted the McGurk effect only when single-pulse TMS was delivered in a 200 ms window spanning 100 ms before to 100 ms after auditory stimulus presentation, supporting the notion that TMS disrupts a specific neural computation in the STS—auditory-visual integration—that is time-locked to stimulus presentation. This finding is consistent with behavioral results showing that the McGurk effect is robust to auditory-visual asynchronies within a 200 ms integration window (van Wassenhove et al., 2007) and results from electrophysiological recording demonstrating strong responses in STS beginning ~100 ms after stimulus presentation in monkeys (Schroeder and Foxe, 2002; Barraclough et al., 2005) and humans (Canolty et al., 2007; Puce et al., 2007). TMS disruption of the STS after the auditory and visual syllables have been integrated should not, and did not, affect perception.
What is the neuronal architecture of the STS that produces the McGurk effect? High-resolution fMRI and single-unit recording studies have shown that cortex in the STS contains a patchy distribution of neurons that respond to auditory, visual, or auditory-visual stimuli (Beauchamp et al., 2004b; Dahl et al., 2009). Multi-voxel pattern analysis of BOLD fMRI data has demonstrated that activity in a region that includes the STS can discriminate between individual auditory syllables (e.g. “ba” vs. “da”) (Formisano et al., 2008; Raizada et al., 2009). These results suggest an architecture in which the STS contains small patches of neurons that respond to specific syllables. Activity across multiple syllable patches would be compared using a winner-take-all algorithm, with the most active patch determining perception. Each patch might receive input from neurons in visual and auditory association areas coding for specific visemes and phonemes. During presentation of congruent auditory-visual speech, input from auditory and visual neurons would be integrated, improving sensitivity. During presentation of incongruent McGurk stimuli, this process could result in unexpected percepts. For instance, if an STS patch representing “da” received input from both auditory “ba” and visual “ga” neurons, the patch would have a large response during presentation of the “ba” + “ga” McGurk stimulus, producing a “da” percept. Disrupting activity in the STS would eliminate this multisensory integration. In future studies, it will be important to test this model and refine our understanding of the temporal and spatial organization of multisensory responses in the STS.
This research was supported by National Science Foundation Cognitive Neuroscience Initiative Research Grant 0642532 to MSB. SP was supported by NIH T32HD049350 (Harvey S. Levin, PI). AN was supported by a NIH TL1RR024147. Partial funding for purchase of the MRI scanner was provided by NIH S10 RR19186. We thank Tony Ro and Nafi Yasar for assistance with TMS and Vips Patel for assistance with MR data collection.