|Home | About | Journals | Submit | Contact Us | Français|
Incongruent auditory and visual stimuli can elicit audiovisual illusions such as the McGurk effect where visual /ka/ and auditory /pa/ fuse into another percept such as/ta/. In the present study, human brain activity was measured with adaptation functional magnetic resonance imaging to investigate which brain areas support such audiovisual illusions. Subjects viewed trains of four movies beginning with three congruent /pa/ stimuli to induce adaptation. The fourth stimulus could be (i) another congruent /pa/, (ii) a congruent /ka/, (iii) an incongruent stimulus that evokes the McGurk effect in susceptible individuals (lips /ka/ voice /pa/), or (iv) the converse combination that does not cause the McGurk effect (lips /pa/ voice/ ka/). This paradigm was predicted to show increased release from adaptation (i.e. stronger brain activation) when the fourth movie and the related percept was increasingly different from the three previous movies. A stimulus change in either the auditory or the visual stimulus from /pa/ to /ka/ (iii, iv) produced within-modality and cross-modal responses in primary auditory and visual areas. A greater release from adaptation was observed for incongruent non-McGurk (iv) compared to incongruent McGurk (iii) trials. A network including the primary auditory and visual cortices, nonprimary auditory cortex, and several multisensory areas (superior temporal sulcus, intraparietal sulcus, insula, and pre-central cortex) showed a correlation between perceiving the McGurk effect and the fMRI signal, suggesting that these areas support the audiovisual illusion.
The merging of senses can modulate our overall perceptions. For example, multisensory perceptual changes have been demonstrated with simple stimuli such as visual light bursts paired with auditory tones [Shams et al., 2005; Watkins et al., 2006] and with more complex language paradigms where auditory speech discrimination is improved when the speaker’s mouth is seen [Grant and Seitz, 2000; Sumby and Pollack, 1954]. In prior neuroimaging studies, multisensory integration has been demonstrated in an extended network including the superior temporal sulcus (STS), intraparietal sulcus (IPS), inferior frontal gyrus (IFG), premotor cortex, and superior and inferior colliculi [Calvert and Lewis, 2004; Champoux et al., 2006; Musacchia et al., 2006; Wallace et al., 1993]. Integration effects are most noticeable when one sensory system is impaired or when a component of the audiovisual signal is degraded [Callan et al., 2003; Stein and Meredith, 1993; Sumby and Pollack, 1954]. For example, in the hearing impaired (HI) population, visual contributions take on a larger role in speech perception [Grant and Seitz, 2000; Grant et al., 1998]. Advances in understanding the neural mechanisms of audiovisual integration have the potential to improve speech perception and production in hearing impaired and cochlear implant patients.
Integration effects are contingent upon the phonetic and temporal congruency of stimuli. Congruent audiovisual information can improve accuracy and shorten reaction time compared to unimodal stimuli [Dodd, 1977; Giard and Peronnet, 1999; Schroger and Widmann, 1998; Sumby and Pollack, 1954]. In contrast, the incongruent pairing of auditory and visual stimuli slows reaction times and increases error rates in speech perception [Giard and Peronnet, 1999]. There is evidence that the incongruency of audiovisual stimuli can be reflected in many cortical areas, with greater activity for incongruent compared to congruent stimuli [Bushara et al., 2001; Jones and Callan, 2003; Miller and D’Esposito, 2005; Ojanen et al., 2005; Pekkola et al., 2006; Raij et al., 2000]. It is possible that incongruent audiovisual speech stimuli take more time and neural resources to process because most natural speech is congruent and temporally aligned.
Studies of congruent and incongruent audiovisual information processing in normal hearing subjects may clarify which additional areas are activated during processing of incongruent stimuli and how this relates to perception of speech and language.
During natural speech, visual cues typically precede auditory information by 80–100 ms [Munhall and Vatikiotis-Bateson, 2004]. Visual lip movements suggest a limited selection of possible auditory speech articulations [Sams et al., 1991], and when subsequent auditory information is incongruent with visual information, the perception is susceptible to mistakes in syllable identification. A well-known example of this type of perceptual mistake is the McGurk effect [McGurk and MacDonald, 1976]. The McGurk effect is an audiovisual illusion whereby incongruent audiovisual combinations alter the phonetic perception of speech syllables. For example, the specific combination of visual /ka/ and auditory /pa/ results in the perception of a third syllable /ta/, distinct from either stimulus. The velar movement for the syllable /ka/ is ambiguous, since both /ka/ and /ta/ have a similar visual appearance. Combined with the auditory /pa/, subjects are more likely to “hear” /ta/ than /ka/. This perceptual event is variable across subjects. Some individuals perceive every McGurk stimulus as /ta/ while others experience the illusion on a minority of presentations. In addition, not every incongruent audiovisual pairing results in this type of fusion. Switching the syllables (visual /pa/ and auditory /ka/) results in an unfused awareness of both stimuli in nearly all subjects, sometimes described as /pka/. The McGurk effect has been extensively used in studies of multisensory speech processing as a measurable and behaviorally relevant index of audiovisual integration [Grant and Seitz, 1998; Grant et al., 1998; Jones and Callan, 2003; Jones and Jarick, 2006; Möttönen, 2002; Munhall et al., 1996; Sams et al., 1991]. This illusion enables researchers to maintain a constant stimulus (incongruent audiovisual information) while measuring a reproducible alteration in multisensory perception. We sought to investigate whether brain activity reflects the individual auditory and visual stimuli or the fused illusory percept.
Adaptation fMRI is a powerful technique that can detect subtle differences in neural processing. A population of neurons will show decreased activity in response to similar repeated stimuli and a subsequent “release from adaptation” when presented with a novel stimulus [for a review, see Grill-Spector et al., 2006]. In functional studies, adaptation could be responsible for the blood oxygen level dependent (BOLD) signal decreases, faster reaction times, and improved accuracy often noted in repetition priming experiments [Grill-Spector et al., 2006; Schacter and Buckner, 1998; Wiggs and Martin, 1998]. After adaptation, small subsequent stimulus changes cause an increase in the BOLD signal [Grill-Spector et al., 1999]. This is thought to reflect increased activity in small populations of neurons within a voxel that are more sensitive to the new stimulus as opposed to the previous stimulus [Grill-Spector and Malach, 2001; Grill-Spector et al., 1999; Grill-Spector et al., 2006]. This type of activity is not detectable in voxel-wise subtraction techniques because subpopulations of neurons within the same voxel may be active in two different conditions.
The current study is the first to investigate cortical activity in response to multisensory speech stimuli using the McGurk effect and adaptation fMRI. We presented trains of congruent stimuli to induce adaptation followed by a congruent or an incongruent stimulus. The incongruent stimuli were designed to induce a release from adaptation, reflected in an increased BOLD signal. We hypothesized that McGurk incongruent stimuli (sometimes perceived as a congruent syllable) should result in a lesser degree of release from adaptation compared to the other incongruent condition. In addition, we asked whether the degree of perceptual fusion across subjects correlated with the change in BOLD signal in the activated brain areas.
Subjects included 16 healthy, right-handed, English speaking adults (11 female, mean age 29.6 years, range 19–47 years). Each subject gave informed consent prior to participation. The study protocol and methods were approved by the institutional human studies review boards at the Massachusetts General Hospital and the Massachusetts Eye and Ear Infirmary.
Stimuli were short audiovisual movie clips. The visual component was the lower half of a female face mouthing the syllables /pa/ and /ka/ [Sams et al., 1991]. The stimuli subtended a visual angle of 11°, projected on a screen in the scanner through mirrors. The auditory component was the same person voicing the syllables /pa/ and /ka/ at a comfortable listening level for each subject (~75-dB SPL) presented through MRI-compatible headphones fitted around subjects’ ears (MR CONFON, Magdeburg, Germany). The duration of all auditory stimuli was 172 ms and the duration of visual stimuli was 590 ms. The voice onset time (VOT) was 270 ms for audiovisual pairs including auditory /pa/ and 150 ms for audiovisual pairs including auditory /ka/. Initial lip movements began at ~50 ms. Mouth opening and voice onset occurred at 300 ms. Each movie clip lasted 850 ms.
Auditory and visual components were paired to make four different types of stimuli (Table I): (i) congruent pa (visual /pa/ auditory /pa/), (ii) congruent ka (visual /ka/ auditory /ka/), (iii) the incongruent stimulus likely to evoke the McGurk effect (IncM = Incongruent McGurk, visual /ka/ auditory /pa/), and (iv) the converse incongruent stimulus that does not produce the McGurk effect (IncN = Incongruent Non-McGurk, visual /pa/ auditory /ka/). Subjects were required to push one button if they thought the stimuli were “congruent” (voice and lips saying the same syllable) and a different button if they thought the stimuli were “incongruent” (voice and lips saying different syllables). The IncN combination was designed as a control that would elicit an unfused percept in all subjects. Only one aspect of the stimulus—auditory or visual—was changed to produce each type of incongruent movie. For example, a trial consisting of three congruent /pa/ movies followed by an IncN movie (visual /pa/ auditory /ka/) could be created by changing only the auditory component of the final movie from /pa/ to /ka/. This allowed independent investigation of small auditory and visual stimulus changes on activations in multisensory and primary sensory cortices.
Each trial consisted of a train of four audiovisual video clips. To induce adaptation, three congruent/ pa/ movies began each trial, followed by (i) a congruent /pa/, (ii) a novel congruent /ka/, or (iii–iv) an incongruent movie (IncM or IncN). The total time to present all four movies in each trial was 3,400 ms. During the 4,600 ms intertrial interval, a still face was visible with no sound, and subjects responded with a right hand button push whether they thought the fourth movie was congruent (index finger response) or incongruent (middle finger response), regardless of the syllable. This task ensured that subjects always paid the same level of attention to the fourth movie. Subjects were informed that they may hear and see a variety of syllables during the study. Pilot data revealed that any baseline stimulus other than a still face (cross hair, oval shape, scrambled face) resulted in a large response to the novelty of face presentation, an effect that is undesirable in the present paradigm to investigate subtle differences in syllable perception. In addition, the cross hair baseline condition resulted in a robust “off” response in areas thought to be involved in the resting state network [Gusnard et al., 2001]. Setting this resting state as an absolute baseline can lead to a consistent set of deactivations when an auditory or visual task is presented [Mazoyer et al., 2001; Shulman et al., 1997]. Any cognitive subtraction required by the baseline stimulus could produce misleading activations in voxels not involved in the speech perception task. Therefore we chose a still face for our “baseline” condition and used adaptation fMRI to reduce this confounding activity. There were six randomized presentations of each trial type per run, and runs lasted 4 min and 48 s. Each subject completed between four and eight runs.
Functional brain images were acquired in a 3 Tesla MRI scanner (Tim Trio, Siemens Medical Solutions, Erlangen, Germany) with a 12-channel head coil array. Using clustered volume acquisition [Edmister et al., 1999], the stimuli were presented between echo planar images (EPIs) without acoustic scanner interference. The whole-head acquisition occurred from 2.6 to 4.6 s after the last stimulus in a trial. A 2D gradient-echo EPI sequence was used to collect T2*-weighted image volumes with the following parameters: TR 2,000 ms with 8 s in between onsets of consecutive EPIs, TE 30 ms, flip angle 90°, slice thickness 5 mm with 1-mm gap, 25 slices, voxel size 3.1 × 3.1 × 5 mm3. The long TR was chosen to allow for measurement of the hemodynamic response in the auditory cortex without confounding auditory stimulation caused by the response to scanner noise. The sparse sampling design also enabled us to present audiovisual stimuli with minimal contamination by scanner noise [Bandettini et al., 1998; Hall et al., 1999].
In the same session, two anatomical whole-brain T1-weighted MP-RAGE images were acquired for each subject. The cortical surfaces were reconstructed using Free-Surfer software (https://surfer.nmr.mgh.harvard.edu/) on the basis of the anatomical images, incorporating motion correction, intensity normalization, skull stripping, and hemisphere inflation. EPI images were motion-corrected and spatially smoothed using a kernel of full-width-half-maximum 8 mm, reconstructed using FS-FAST, and coregistered to and overlaid on the inflated surface of the anatomical scan [Dale et al., 1999; Fischl et al., 1999].
Behavioral data including reaction times were analyzed for each subject. The number of in-scanner trials was limited by time constraints due to the long TR and adaptation design. Separate behavioral trials outside the scanner were conducted to obtain more trials, and in all possible cases in-scanner and out-of-scanner data were combined in determining a subject’s fusion rates. A response of “congruent” for congruent trials and “incongruent” for both the McGurk and non-McGurk incongruent trials was scored as correct in determining overall performance. When the IncM stimulus was keyed as congruent, the trial was considered “fused,” indicating that the subject experienced the McGurk effect. We expected that most subjects would respond “incongruent” for the IncN stimuli and those responses to McGurk stimuli would be more varied. In addition, response reaction times were extracted from available out-of-scanner data. Reaction time data were analyzed separately for subjects who experienced the McGurk effect greater than 50% of the time and those who rarely experienced the McGurk effect. A student’s t test was performed to look for differences between the two groups.
Individual BOLD data were analyzed using a general linear model. At the group level, a fixed effects analysis was carried out and the averaged data were overlaid using spherical intersubject averaging onto an average anatomical brain from the same subjects. We compared trials that ended in an incongruent IncM or IncN stimulus to trials with only congruent /pa/ stimuli, and performed a direct comparison of the two incongruent stimuli (IncN-IncM). The comparisons between incongruent and congruent trials were corrected for multiple comparisons using the false discovery rate (FDR) method at a level of P = 0.01 and the color maps are depicted with thresholds at P < 0.05 and P < 0.01. The subtraction of IncN-IncM is thresholded at P < 0.05 and P < 0.01, uncorrected. Cluster size for areas exceeding 20 voxels with a significance corresponding to P < 0.01 were collected along with significance of the t test, anatomical area, and Talairach coordinates.
To further investigate the relationship between BOLD signal and perception of the McGurk effect, an ROI analysis was performed on a group of regions that were activated in the group statistical maps. Standard ROIs were delineated anatomically based on the Freesurfer parcellation and then morphed from the standard space into each individual’s brain via a spherical surface [Fischl et al., 2004]. This allowed ROIs to be standardized across subjects. The auditory cortex was subdivided into primary auditory cortex (Heschl’s gyrus) and the nonprimary auditory areas in the superior temporal gyrus and supratemporal cortex (STG). For each subject, the BOLD percent signal change was extracted for each ROI in both the IncM and IncN conditions. A regression analysis was then performed comparing “percent fusion” (percent of McGurk incongruent trials keyed as congruent) and BOLD percent signal change. Plots were generated depicting the relationship with superimposed trendlines. The P-value of the t-statistic is reported.
Behavioral data were pooled from all runs for each subject. One subject had insufficient behavioral data and was excluded from further analysis. The remaining 15 subjects correctly categorized the IncN stimuli (iv) as incongruent in 87.8% of trials. The percent of McGurk stimuli reported as congruent for each subject is reported as “percent fusion” and plotted in Figure 1. The degree of fusion was ranked and the subject number reordered. The curve shows that subject responses fell along a continuum from a minimal fusion rate of 1% (most McGurk stimuli perceived as incongruent) to a maximum of 91% (most McGurk stimuli perceived as congruent). An analysis of reaction times revealed that those who responded to the McGurk stimuli as incongruent more than 50% of the time had a slightly longer processing time than those subjects who responded that the McGurk stimuli were congruent more than 50% of the time (1,100 ms compared to 930 ms, P < 0.05).
Figure 2 top row shows the group analysis for a subtraction between incongruent non-McGurk trials (IncN) and congruent /pa/ trials. The IncN condition included an auditory change only (from congruent /pa/ to incongruent visual /pa/ auditory /ka/). There was a release from adaptation (increased BOLD signal) in the IncN condition in widespread areas including auditory (both Heschl’s gyrus and STG), multisensory (STS, insula, precentral gyrus, intraparietal sulcus, anterior cingulate/medial superior frontal lobe, fusiform, and lingual gyri), and primary visual areas in both hemispheres. Talairach coordinates for significant clusters are listed in Table II.
Figure 2 middle row shows the group analysis for subtraction between incongruent McGurk trials (IncM) and congruent/pa/trials. The IncM condition included a visual change only, from congruent /pa/ to incongruent visual /ka/ auditory /pa/. A similar network of auditory (STG) and multisensory areas (STS, precentral gyrus, right insula, IPS, medial superior frontal lobe, and left fusiform gyrus) showed release from adaptation; however, this appeared to be less robust than the IncN comparison. Much less BOLD activity was seen in primary visual cortex, despite the visual stimulus change.
To ensure that the type of change (auditory or visual) did not act as a confounding variable in our results, we also compared incongruent trials (IncN and IncM) to the congruent trials that consisted of three congruent /pa/ syllables followed by a congruent/ka/(CppCkk). In this case, the congruent condition resulted in a change in both auditory and visual stimuli. This allowed us to compare conditions in which there was an auditory change with an unfused perception (IncN) to trials in which there was an auditory change with a fused perception (CppCkk). There was a similar pattern of activation in multisensory areas such as the insula, STS, and anterior cingulate, and in primary visual cortex for this subtraction (not shown). Similarly a comparison of a visual change with variable fused perception (IncM) to a visual change with fused perception (CppCkk) resulted in activity in multisensory and primary visual areas. These results suggest that the activations shown in Figure 2 are not due to the change in auditory or visual stimulus alone but are related to perceptual differences.
To further investigate the difference between McGurk and non-McGurk conditions, Figure 2 bottom row shows IncM trials (three congruent /pa/ followed by the McGurk visual /ka/ auditory /pa/) subtracted from IncN trials (three congruent /pa/ followed by the non-McGurk visual /pa/ auditory /ka/). This comparison revealed significant differences in the left STS, right insula, right IPS, bilateral anterior cingulate, and bilateral primary visual cortex.
For both auditory and visual changes, a release from adaptation was also seen in the sensory cortex of the opposite modality. For the IncN comparison with a change in only the auditory stimulus, there was a release from adaptation in pericalcarine areas bilaterally (Fig. 2, top row). For the IncM condition with only a visual change, there was a release from adaptation in the STG bilaterally (Fig. 2, middle row).
Figure 3 shows the results of the ROI correlation analysis. The thumbnails at the sides illustrate the ROIs plotted on an individual brain. There were significant correlations between the subject’s behavioral data and the BOLD response to the McGurk stimulus (IncM vs. congruent /pa/) in primary and multisensory cortices. In the left primary auditory cortex (Heschl’s gyrus), left STG (auditory association cortex), left STS, left IPS (not shown), left insula, and right precentral gyrus there was an inverse relationship between percent fusion and percent BOLD signal change in the IncM condition. In the left calcarine sulcus there was a positive correlation between the degree of McGurk fusion and BOLD response in the IncN condition.
In the present study we investigated neural correlates of phonemic perception using the McGurk audiovisual illusion in an adaptation fMRI design. Our experimental question was fundamentally different than asking where in the brain multisensory integration takes place. Prior research on the McGurk effect has been directed either at characterizing the behavioral phenomenon or at understanding the neural correlates of multisensory integration. Our study is unique in showing a correlation between perception of the McGurk effect and a measurable BOLD response, highlighting a network of cortical regions in the brain where multisensory speech perception takes place, independent from changes in the physical stimuli themselves.
Localizing the areas that correlate with behavioral perception lends further insight into the process of audiovisual speech comprehension. By adulthood, normal hearing individuals have acquired years of experience processing multisensory speech, and most develop a tendency to integrate simultaneous auditory and visual information even when there are discrepancies between the two modalities [Fowler, 2004]. When simultaneous auditory and visual inputs are perceived as incongruent, additional neural processing may be required. This is particularly relevant for HI patients, where altered auditory signals (from hearing aids or cochlear implants) may be perceived to be incongruent with visual information, requiring more time and neural resources for speech comprehension.
Studies that alter the congruency of multisensory conditions have consistently found greater activation for incongruent compared to congruent stimuli [Bushara et al., 2001; Jones and Callan, 2003; Miller and D’Esposito, 2005; Ojanen et al., 2005; Pekkola et al., 2006; Raij et al., 2000]. Our study extends these observations by demonstrating an incongruency effect for perceptually unfused (IncN) compared to perceptually fused (IncM) stimuli. The reaction times support this concept as well, with increased processing time required for subjects to respond to stimuli which were perceived as incongruent compared to the time required to respond to stimuli which were perceived as congruent syllables. In the present study, the whole brain analysis revealed that there was a greater BOLD percent signal change for incongruent stimuli than for congruent stimuli, and that McGurk stimuli (incongruent but sometimes perceived as congruent) fell in between. Incongruent audiovisual stimuli (IncN and IncM) induced a release from adaptation in both primary sensory and multisensory cortical areas, and the degree of adaptation release was reflected in the percept.
Prior anatomical and functional imaging studies have demonstrated that the STS is important for multisensory processing [Beauchamp et al., 2004; Benevento et al., 1977; Falchier et al., 2002; Kaas and Collins, 2004; Kaas and Hackett, 2000; Miller and D’Esposito, 2005; Watkins et al., 2006], particularly for language related stimuli [Callan et al., 2003; Calvert et al., 2000; Campbell et al., 2001; Levänen et al., 2001; Raij et al., 2000; van Atteveldt et al., 2004]. Our results showing greater STS activity for incongruent stimuli are, however, contrary to some fMRI reports [Calvert et al., 1999, 2000] in which the STS showed increased activity for congruent stimuli and decreased activity for incongruent stimuli. These studies may not be directly comparable to our findings because they compare the multisensory response to the predicted response from the addition of independent unimodal auditory and visual stimuli. In contrast, out study investigated the effect of incongruent stimuli on neurons already adapted to congruent audiovisual stimuli, probing for a response to subtle syllable changes.
Additional regions that showed greater activity for incongruent compared to congruent stimuli were the STG (auditory association area), IPS, fusiform gyrus, insula, and precentral gyrus—all areas that have been previously implicated in multisensory processing [Benevento et al., 1977; Bushara et al., 2001; Calvert, 2001; Calvert and Lewis, 2004; Kaas and Collins, 2004; Kaas and Hackett, 2000; Miller and D’Esposito, 2005]. Incongruency in phonetically conflicting compared to matching vowels has been linked to activity in Broca’s area and the left premotor cortex [Ojanen et al., 2005]. Perceptual incongruency due to temporal offset of audiovisual speech stimuli has also been shown to activate an extensive multisensory network compared with synchronous stimuli [Jones and Callan, 2003; Miller and D’Esposito, 2005]. In an event-related fMRI study on the perception of audiovisual temporal synchrony, Miller and D’Esposito  reported distinct patterns of neural activity in the primary auditory cortex, STS, IPS, and inferior frontal gyrus. In our study, adaptation revealed a similar multisensory network with differential responses based on the phonetic perception of congruency.
In addition to multisensory effects, subjects showed different patterns of activity in primary sensory cortices for incongruent audiovisual stimuli. These effects differed based on percepts. The BOLD response to a change from congruent /pa/ to the IncM stimulus was small, reflecting an extension of the adaptation effect to similar nonidentical stimuli [Sawamura et al., 2006] or a more general adaptation to the congruency of audiovisual stimuli. When presented with a stimulus consistently perceived as incongruent (IncN), there was a release from adaptation in both primary auditory and visual areas as well as multisensory areas. This effect could be mediated by top-down differential feedback loops from multisensory cortex to primary sensory areas.
Prior anatomical and functional research has identified a network of areas involved in audiovisual processing with projections that extend from primary sensory areas to association areas and vice versa, along with direct connections between primary auditory and visual cortex [Benevento et al., 1977; Kaas and Hackett, 2000; Falchier et al., 2002]. Our results support previous work regarding visual influence on auditory processing during speech discrimination [Bertelson et al., 2003; Calvert and Campbell, 2003; Calvert et al., 1997; Fingelkurtz et al., 2003; Finney et al., 2001; Hayes et al., 2003; Jones and Jarick, 2006; Levänen et al., 2001; MacSweeney et al., 2002; Massaro, 2004; Möttönen, 2002; Pekkola et al., 2005; Petitto et al., 2000]. However, natural speech discrimination is inherently multimodal, and perceptual differences likely reflect multisensory integration rather than merely visual interference on the auditory processing stream [Grant et al., 1998]. Interestingly, our results show a clear response in the calcarine sulcus during changes in the auditory speech component despite an identical visual stimulus (IncN) (Fig. 2 top row). Similar effects of auditory input on early visual processing have been shown in studies using basic audiovisual stimuli such as bleeps and light flashes [Watkins et al., 2006] or colored circles and tones [Molholm et al., 2002]. In addition, previous work in blind subjects [Roder et al., 2002] demonstrated activity in V1 when subjects listened to spoken sentences. It is likely that the same anatomical projections from multisensory areas to V1 are present in normal individuals [Falchier et al., 2002]; however, this influence on the earliest visual processing centers has not previously been shown in healthy adults for language related stimuli. These effects may have been subtracted out in previous voxel-wise fMRI comparisons because V1 is strongly activated by still faces (often serving as the control condition), whereas other extrastriate visual areas such as V5 and MT are not [Callan et al., 2003; Calvert and Campbell, 2003]. By using adaptation to moving faces and observing neural responses to perceptual phoneme changes, we were able to see these effects extending beyond V5/MT and into earlier visual areas.
Once this network of primary sensory and multisensory areas involved in the perception of incongruent McGurk stimuli was identified via whole brain analysis, we probed regions of interest more intensely for an association between behavioral fusion and BOLD activity. The behavioral curve of responses to the McGurk illusion (Fig. 1) shows a continuum of perceptual change from subjects who experience the McGurk effect greater than 90% of the time to those who rarely to never experience the fusion of the two audiovisual syllables into a coherent third syllable. This graded effect is similar to psychometric curves in other areas of neuroscience, particularly those involved with perception [Raizada and Poldrack, 2007; Xu, 2008]. The degree of variation among subjects allowed for further analysis of the relationship between perception and brain function by correlating the strength of the behavioral effect with the strength of BOLD activity in regions of interest, as shown in Figure 3. The correlation analysis revealed an inverse relationship between behavioral fusion and BOLD percent signal change in the McGurk condition in multisensory areas such as the left STS, insula, IPS, and the right precentral gyrus. The primary auditory cortex (Heschl’s gyrus) and nonprimary auditory areas (STG) also demonstrated an inverse relationship. This result is consistent with earlier MEG work by Sams et al. showing activity in primary auditory cortex during the McGurk effect [Sams et al., 1991]. Activity in these brain regions appears to be modified by perception of audiovisual congruency. Subjects who experienced McGurk fusion more often had less release from adaptation while those who perceived most McGurk stimuli as incongruent had a robust BOLD response. The inverse relationship supports the above findings that audiovisual stimuli perceived as incongruent induce a greater release from adaptation. This is true when only the perception of congruency changes despite a constant incongruent audiovisual stimulus. This finding implies that the increase in BOLD signal is due to additional neural resources required to clarify a discrepant perception, rather than a simple response to a change in components of the stimuli.
In the left calcarine sulcus, there was a positive correlation between McGurk perception and BOLD signal change for the IncN condition. Subjects who perceived the McGurk stimuli as congruent had more BOLD activity during the IncN condition, while those that perceived the McGurk stimuli as incongruent showed a smaller BOLD signal change for this condition. This result could be due to additional release from adaptation for subjects who perceive the McGurk stimuli as congruent and are then exposed to the IncN incongruent stimuli. Alternatively, this result could represent inhibitory top-down effects on the primary visual cortex following presentation of perceptually incongruent stimuli, decreasing reliance on the perceived incongruent or inconsistent visual modality. Alternative imaging methods with better temporal resolution may be required for further investigation of this effect.
Recent work by Raizada and Poldrack  also showed alterations in BOLD response that were linked to behavioral perception. The authors measured selective amplification in the brain during perception of categorical phonetic perception. Using a behaviorally weighted general lineal model, they found correlated activity in six regions including the left inferior supramarginal gyrus, right cerebellum, anterior cingulate, left IPS, left middle frontal cortex, and right prefrontal cortex. These areas proposed to be important for phonetic perception overlapped with our results in the anterior cingulate, IPS, and middle frontal cortex. Our results also included a greater network of speech processing areas, perhaps because the task was directly related to understanding and judging speech syllables.
Jones and Callan  have also investigated the effects of audiovisual perception on brain functioning. The authors studied the effects of altering congruency (using congruent and incongruent McGurk stimuli) and temporal synchrony. The McGurk effect was strongest when the audiovisual stimuli were temporally synchronous. Similar to our current findings, a comparison of the synchronous conditions showed greater activity in the right supramarginal gyrus and left inferior parietal lobule for incongruent compared to congruent stimuli. They did not find a correlation between perceptual performance and activity in the STS or auditory cortex. The authors did not report the degree of McGurk fusion across subjects, and it is possible that their subject group was homogenous in the degree to which they were susceptible to the McGurk effect. In our study subjects’ response to the McGurk effect varied across a continuum from 1% to 91% fusion. Because there is wide variation in the degree to which subjects perceive the McGurk effect, a correlation was apparent in our experiment.
Perceptual fusion depends on the syllable pair, the timing offset, and on the individual subject [Champoux et al., 2006; Massaro, 2004; McGurk and MacDonald, 1976]. The average McGurk fusion rate in this study (49%) was comparable with earlier published fusion rates ranging from 40% to near 100% [Hayes et al., 2003; McGurk and MacDonald, 1976; Möttönen, 2002; Munhall et al., 1996; Olson et al., 2002; Saldana and Rosenblum, 1993; Sams et al., 1991]. While our stimulus model was a native Finnish speaker and subjects were native English speakers, it is unlikely that differences in fusion between subject groups were due to cultural differences in syllable production [see Munhall and Vatikiotis-Bateson, 2004 for a review of the spatial constraints of speech information], and indeed most subjects did not report any difficulty in identifying the two syllables /pa/ and /ka/. The effect of task must also be taken into account when evaluating multisensory results. Prior work on the McGurk effect sought to investigate the interference caused by visual information on a subject’s correct identification of the auditory syllable [Bertelson et al., 2003; Massaro, 1998; Rosenblum and Saldana, 1992; Saldana and Rosenblum, 1993]. Subjects were instructed to concentrate on what they heard to reduce any effect of selective attention to the visual stimuli. In the current study, subjects focused equally on auditory and visual information in order to make a decision about whether the auditory and visual stimuli were congruent. It is possible that this resulted in greater bimodal interaction or a greater contribution from the visual system. These task effects, however, do not account for the new finding of V1 modulation by a change in auditory syllable (Fig. 2, top row). Directing more attention to the visual stimulus would instead be expected to reduce the influence of the auditory syllable on visual processing. In comparing any incongruent stimulus to the baseline of four congruent movies, the effect of differential attention to novel stimuli must also be considered. However, additional comparisons of the incongruent conditions to a congruent change (congruent /pa/ to congruent /ka/) also demonstrated greater BOLD activity for the incongruent conditions. Thus, the effect of attention to a novel stimulus cannot explain the current findings.
The adaptation paradigm may be instrumental in revealing these small signal changes based on percept. In adaptation fMRI, a population of neurons exposed to similar repeated stimuli will gradually decrease their activity, with a subsequent “release from adaptation” when presented with a novel stimulus [Grill-Spector et al., 1999, 2006]. The adaptation effect has been demonstrated to detect differences in retinotopic spatial and orientation specificity [Boynton and Finney, 2003; Grill-Spector et al., 1999; Murray et al., 2006], object shape and format [Kourtzi and Kanwisher, 2001; Vuilleumier et al., 2005; Winston et al., 2004], identity and face perception [Winston et al., 2004], and higher order discrimination of words and phonetic categories [Jaaskelainen et al., 2004; Raizada and Poldrack, 2007; Wheatley, 2006]. Studies have suggested that the reduced activity seen with repetition or adaptation is due to similarity in physical stimuli [Roberts and Summerfield, 1981; Shigeno, 2002]; however, cross-modal adaptation in language tasks [Buckner et al., 2000; Carlesimo et al., 2003; Jaaskelainen et al., 2004; Kim et al., 2004; McKone and Dennis, 2000; Schacter et al., 2004] implies that adaptation can occur in response to percepts [Carlesimo et al., 2003; Kim et al., 2004] since the visual and auditory stimuli do not share basic sensory properties. If distinct populations of neurons contained within the same imaging voxel respond differentially to percept, this result would be cancelled out in a standard voxel-wise subtraction fMRI analysis [Calvert, 2001; Grill-Spector and Malach, 2001; Grill-Spector et al., 1999, 2006].
These findings have implications for the clinical evaluation of hearing impaired patients. Grant et al.  reported significant benefits for audiovisual stimuli relative to auditory speech alone in hearing impaired subjects. Pre-lingually deaf children with cochlear implants who show the greatest degree of audiovisual benefit also produce more intelligible speech [Lachs et al., 2001]. If hearing loss is chronic or slowly progressive, patients develop an expectation of unreliable auditory information and are more likely to favor the visual modality when audiovisual information is incongruent. Over the long term, this selective neural processing could lead to changes in brain connectivity between multisensory and primary areas. Hearing aids and cochlear implants may be capable of reversing this pattern [Lee et al., 2001, 2003; Roland et al., 2001]. If patients retain cortical plasticity, particularly in areas responsible for the perception of audiovisual speech, they may be expected to achieve greater success with these devices. Pre-operative functional imaging of cochlear implant candidates could be a helpful prognostic tool. Further research in this area is indicated.
The authors thank Don Eddington, Daniel Goldenholtz, Sari Levänen, and Jyrki Ahveninen for helpful comments on the experiments and the manuscript, to Mikko Sams and Riikka Möttönen for stimulus design, and to Dan Wakeman, Deirdre Foxe, Balaji Lakshmanan, and Mary Foley for technical assistance.
Contract grant sponsor: Otolaryngology Research Training Grant (MEEI); Contract grant number: T32 DC000020-18; Contract grant sponsor: NIH; Contract grant number: R01 NS048279-01A1; Contract grant sponsor: The Finnish Cultural Foundation; Contract grant numbers: K08 MH067966, 5P41RR14075
Contract grant sponsors: The MIND Institute, The Academy of Finland, The Sigrid Jusélius Foundation.