In the present study we investigated neural correlates of phonemic perception using the McGurk audiovisual illusion in an adaptation fMRI design. Our experimental question was fundamentally different than asking where in the brain multisensory integration takes place. Prior research on the McGurk effect has been directed either at characterizing the behavioral phenomenon or at understanding the neural correlates of multisensory integration. Our study is unique in showing a correlation between perception of the McGurk effect and a measurable BOLD response, highlighting a network of cortical regions in the brain where multisensory speech perception takes place, independent from changes in the physical stimuli themselves.
Localizing the areas that correlate with behavioral perception lends further insight into the process of audiovisual speech comprehension. By adulthood, normal hearing individuals have acquired years of experience processing multisensory speech, and most develop a tendency to integrate simultaneous auditory and visual information even when there are discrepancies between the two modalities [
Fowler, 2004]. When simultaneous auditory and visual inputs are perceived as incongruent, additional neural processing may be required. This is particularly relevant for HI patients, where altered auditory signals (from hearing aids or cochlear implants) may be perceived to be incongruent with visual information, requiring more time and neural resources for speech comprehension.
Studies that alter the congruency of multisensory conditions have consistently found greater activation for incongruent compared to congruent stimuli [
Bushara et al., 2001;
Jones and Callan, 2003;
Miller and D’Esposito, 2005;
Ojanen et al., 2005;
Pekkola et al., 2006;
Raij et al., 2000]. Our study extends these observations by demonstrating an incongruency effect for perceptually unfused (IncN) compared to perceptually fused (IncM) stimuli. The reaction times support this concept as well, with increased processing time required for subjects to respond to stimuli which were perceived as incongruent compared to the time required to respond to stimuli which were perceived as congruent syllables. In the present study, the whole brain analysis revealed that there was a greater BOLD percent signal change for incongruent stimuli than for congruent stimuli, and that McGurk stimuli (incongruent but sometimes perceived as congruent) fell in between. Incongruent audiovisual stimuli (IncN and IncM) induced a release from adaptation in both primary sensory and multisensory cortical areas, and the degree of adaptation release was reflected in the percept.
Prior anatomical and functional imaging studies have demonstrated that the STS is important for multisensory processing [
Beauchamp et al., 2004;
Benevento et al., 1977;
Falchier et al., 2002;
Kaas and Collins, 2004;
Kaas and Hackett, 2000;
Miller and D’Esposito, 2005;
Watkins et al., 2006], particularly for language related stimuli [
Callan et al., 2003;
Calvert et al., 2000;
Campbell et al., 2001;
Levänen et al., 2001;
Raij et al., 2000;
van Atteveldt et al., 2004]. Our results showing greater STS activity for incongruent stimuli are, however, contrary to some fMRI reports [
Calvert et al., 1999,
2000] in which the STS showed increased activity for congruent stimuli and decreased activity for incongruent stimuli. These studies may not be directly comparable to our findings because they compare the multisensory response to the predicted response from the addition of independent unimodal auditory and visual stimuli. In contrast, out study investigated the effect of incongruent stimuli on neurons already adapted to congruent audiovisual stimuli, probing for a response to subtle syllable changes.
Additional regions that showed greater activity for incongruent compared to congruent stimuli were the STG (auditory association area), IPS, fusiform gyrus, insula, and precentral gyrus—all areas that have been previously implicated in multisensory processing [
Benevento et al., 1977;
Bushara et al., 2001;
Calvert, 2001;
Calvert and Lewis, 2004;
Kaas and Collins, 2004;
Kaas and Hackett, 2000;
Miller and D’Esposito, 2005]. Incongruency in phonetically conflicting compared to matching vowels has been linked to activity in Broca’s area and the left premotor cortex [
Ojanen et al., 2005]. Perceptual incongruency due to temporal offset of audiovisual speech stimuli has also been shown to activate an extensive multisensory network compared with synchronous stimuli [
Jones and Callan, 2003;
Miller and D’Esposito, 2005]. In an event-related fMRI study on the perception of audiovisual temporal synchrony,
Miller and D’Esposito [2005] reported distinct patterns of neural activity in the primary auditory cortex, STS, IPS, and inferior frontal gyrus. In our study, adaptation revealed a similar multisensory network with differential responses based on the phonetic perception of congruency.
In addition to multisensory effects, subjects showed different patterns of activity in primary sensory cortices for incongruent audiovisual stimuli. These effects differed based on percepts. The BOLD response to a change from congruent /pa/ to the IncM stimulus was small, reflecting an extension of the adaptation effect to similar nonidentical stimuli [
Sawamura et al., 2006] or a more general adaptation to the congruency of audiovisual stimuli. When presented with a stimulus consistently perceived as incongruent (IncN), there was a release from adaptation in both primary auditory and visual areas as well as multisensory areas. This effect could be mediated by top-down differential feedback loops from multisensory cortex to primary sensory areas.
Prior anatomical and functional research has identified a network of areas involved in audiovisual processing with projections that extend from primary sensory areas to association areas and vice versa, along with direct connections between primary auditory and visual cortex [
Benevento et al., 1977;
Kaas and Hackett, 2000;
Falchier et al., 2002]. Our results support previous work regarding visual influence on auditory processing during speech discrimination [
Bertelson et al., 2003;
Calvert and Campbell, 2003;
Calvert et al., 1997;
Fingelkurtz et al., 2003;
Finney et al., 2001;
Hayes et al., 2003;
Jones and Jarick, 2006;
Levänen et al., 2001;
MacSweeney et al., 2002;
Massaro, 2004;
Möttönen, 2002;
Pekkola et al., 2005;
Petitto et al., 2000]. However, natural speech discrimination is inherently multimodal, and perceptual differences likely reflect multisensory integration rather than merely visual interference on the auditory processing stream [
Grant et al., 1998]. Interestingly, our results show a clear response in the calcarine sulcus during changes in the auditory speech component despite an identical visual stimulus (IncN) ( top row). Similar effects of auditory input on early visual processing have been shown in studies using basic audiovisual stimuli such as bleeps and light flashes [
Watkins et al., 2006] or colored circles and tones [
Molholm et al., 2002]. In addition, previous work in blind subjects [
Roder et al., 2002] demonstrated activity in V1 when subjects listened to spoken sentences. It is likely that the same anatomical projections from multisensory areas to V1 are present in normal individuals [
Falchier et al., 2002]; however, this influence on the earliest visual processing centers has not previously been shown in healthy adults for language related stimuli. These effects may have been subtracted out in previous voxel-wise fMRI comparisons because V1 is strongly activated by still faces (often serving as the control condition), whereas other extrastriate visual areas such as V5 and MT are not [
Callan et al., 2003;
Calvert and Campbell, 2003]. By using adaptation to moving faces and observing neural responses to perceptual phoneme changes, we were able to see these effects extending beyond V5/MT and into earlier visual areas.
Once this network of primary sensory and multisensory areas involved in the perception of incongruent McGurk stimuli was identified via whole brain analysis, we probed regions of interest more intensely for an association between behavioral fusion and BOLD activity. The behavioral curve of responses to the McGurk illusion () shows a continuum of perceptual change from subjects who experience the McGurk effect greater than 90% of the time to those who rarely to never experience the fusion of the two audiovisual syllables into a coherent third syllable. This graded effect is similar to psychometric curves in other areas of neuroscience, particularly those involved with perception [
Raizada and Poldrack, 2007;
Xu, 2008]. The degree of variation among subjects allowed for further analysis of the relationship between perception and brain function by correlating the strength of the behavioral effect with the strength of BOLD activity in regions of interest, as shown in . The correlation analysis revealed an inverse relationship between behavioral fusion and BOLD percent signal change in the McGurk condition in multisensory areas such as the left STS, insula, IPS, and the right precentral gyrus. The primary auditory cortex (Heschl’s gyrus) and nonprimary auditory areas (STG) also demonstrated an inverse relationship. This result is consistent with earlier MEG work by Sams et al. showing activity in primary auditory cortex during the McGurk effect [
Sams et al., 1991]. Activity in these brain regions appears to be modified by perception of audiovisual congruency. Subjects who experienced McGurk fusion more often had less release from adaptation while those who perceived most McGurk stimuli as incongruent had a robust BOLD response. The inverse relationship supports the above findings that audiovisual stimuli perceived as incongruent induce a greater release from adaptation. This is true when only the perception of congruency changes despite a constant incongruent audiovisual stimulus. This finding implies that the increase in BOLD signal is due to additional neural resources required to clarify a discrepant perception, rather than a simple response to a change in components of the stimuli.
In the left calcarine sulcus, there was a positive correlation between McGurk perception and BOLD signal change for the IncN condition. Subjects who perceived the McGurk stimuli as congruent had more BOLD activity during the IncN condition, while those that perceived the McGurk stimuli as incongruent showed a smaller BOLD signal change for this condition. This result could be due to additional release from adaptation for subjects who perceive the McGurk stimuli as congruent and are then exposed to the IncN incongruent stimuli. Alternatively, this result could represent inhibitory top-down effects on the primary visual cortex following presentation of perceptually incongruent stimuli, decreasing reliance on the perceived incongruent or inconsistent visual modality. Alternative imaging methods with better temporal resolution may be required for further investigation of this effect.
Recent work by
Raizada and Poldrack [2007] also showed alterations in BOLD response that were linked to behavioral perception. The authors measured selective amplification in the brain during perception of categorical phonetic perception. Using a behaviorally weighted general lineal model, they found correlated activity in six regions including the left inferior supramarginal gyrus, right cerebellum, anterior cingulate, left IPS, left middle frontal cortex, and right prefrontal cortex. These areas proposed to be important for phonetic perception overlapped with our results in the anterior cingulate, IPS, and middle frontal cortex. Our results also included a greater network of speech processing areas, perhaps because the task was directly related to understanding and judging speech syllables.
Jones and Callan [2003] have also investigated the effects of audiovisual perception on brain functioning. The authors studied the effects of altering congruency (using congruent and incongruent McGurk stimuli) and temporal synchrony. The McGurk effect was strongest when the audiovisual stimuli were temporally synchronous. Similar to our current findings, a comparison of the synchronous conditions showed greater activity in the right supramarginal gyrus and left inferior parietal lobule for incongruent compared to congruent stimuli. They did not find a correlation between perceptual performance and activity in the STS or auditory cortex. The authors did not report the degree of McGurk fusion across subjects, and it is possible that their subject group was homogenous in the degree to which they were susceptible to the McGurk effect. In our study subjects’ response to the McGurk effect varied across a continuum from 1% to 91% fusion. Because there is wide variation in the degree to which subjects perceive the McGurk effect, a correlation was apparent in our experiment.
Perceptual fusion depends on the syllable pair, the timing offset, and on the individual subject [
Champoux et al., 2006;
Massaro, 2004;
McGurk and MacDonald, 1976]. The average McGurk fusion rate in this study (49%) was comparable with earlier published fusion rates ranging from 40% to near 100% [
Hayes et al., 2003;
McGurk and MacDonald, 1976;
Möttönen, 2002;
Munhall et al., 1996;
Olson et al., 2002;
Saldana and Rosenblum, 1993;
Sams et al., 1991]. While our stimulus model was a native Finnish speaker and subjects were native English speakers, it is unlikely that differences in fusion between subject groups were due to cultural differences in syllable production [see
Munhall and Vatikiotis-Bateson, 2004 for a review of the spatial constraints of speech information], and indeed most subjects did not report any difficulty in identifying the two syllables /pa/ and /ka/. The effect of task must also be taken into account when evaluating multisensory results. Prior work on the McGurk effect sought to investigate the interference caused by visual information on a subject’s correct identification of the auditory syllable [
Bertelson et al., 2003;
Massaro, 1998;
Rosenblum and Saldana, 1992;
Saldana and Rosenblum, 1993]. Subjects were instructed to concentrate on what they heard to reduce any effect of selective attention to the visual stimuli. In the current study, subjects focused equally on auditory and visual information in order to make a decision about whether the auditory and visual stimuli were congruent. It is possible that this resulted in greater bimodal interaction or a greater contribution from the visual system. These task effects, however, do not account for the new finding of V1 modulation by a change in auditory syllable (, top row). Directing more attention to the visual stimulus would instead be expected to reduce the influence of the auditory syllable on visual processing. In comparing any incongruent stimulus to the baseline of four congruent movies, the effect of differential attention to novel stimuli must also be considered. However, additional comparisons of the incongruent conditions to a congruent change (congruent /pa/ to congruent /ka/) also demonstrated greater BOLD activity for the incongruent conditions. Thus, the effect of attention to a novel stimulus cannot explain the current findings.
The adaptation paradigm may be instrumental in revealing these small signal changes based on percept. In adaptation fMRI, a population of neurons exposed to similar repeated stimuli will gradually decrease their activity, with a subsequent “release from adaptation” when presented with a novel stimulus [
Grill-Spector et al., 1999,
2006]. The adaptation effect has been demonstrated to detect differences in retinotopic spatial and orientation specificity [
Boynton and Finney, 2003;
Grill-Spector et al., 1999;
Murray et al., 2006], object shape and format [
Kourtzi and Kanwisher, 2001;
Vuilleumier et al., 2005;
Winston et al., 2004], identity and face perception [
Winston et al., 2004], and higher order discrimination of words and phonetic categories [
Jaaskelainen et al., 2004;
Raizada and Poldrack, 2007; Wheatley, 2006]. Studies have suggested that the reduced activity seen with repetition or adaptation is due to similarity in physical stimuli [
Roberts and Summerfield, 1981;
Shigeno, 2002]; however, cross-modal adaptation in language tasks [
Buckner et al., 2000;
Carlesimo et al., 2003;
Jaaskelainen et al., 2004;
Kim et al., 2004;
McKone and Dennis, 2000;
Schacter et al., 2004] implies that adaptation can occur in response to percepts [
Carlesimo et al., 2003;
Kim et al., 2004] since the visual and auditory stimuli do not share basic sensory properties. If distinct populations of neurons contained within the same imaging voxel respond differentially to percept, this result would be cancelled out in a standard voxel-wise subtraction fMRI analysis [
Calvert, 2001;
Grill-Spector and Malach, 2001;
Grill-Spector et al., 1999,
2006].
These findings have implications for the clinical evaluation of hearing impaired patients.
Grant et al. [1998] reported significant benefits for audiovisual stimuli relative to auditory speech alone in hearing impaired subjects. Pre-lingually deaf children with cochlear implants who show the greatest degree of audiovisual benefit also produce more intelligible speech [
Lachs et al., 2001]. If hearing loss is chronic or slowly progressive, patients develop an expectation of unreliable auditory information and are more likely to favor the visual modality when audiovisual information is incongruent. Over the long term, this selective neural processing could lead to changes in brain connectivity between multisensory and primary areas. Hearing aids and cochlear implants may be capable of reversing this pattern [
Lee et al., 2001,
2003;
Roland et al., 2001]. If patients retain cortical plasticity, particularly in areas responsible for the perception of audiovisual speech, they may be expected to achieve greater success with these devices. Pre-operative functional imaging of cochlear implant candidates could be a helpful prognostic tool. Further research in this area is indicated.