|Home | About | Journals | Submit | Contact Us | Français|
Visible speech enhances the intelligibility of auditory speech when listening conditions are poor , and can even modify the perception of otherwise perfectly audible utterances . This audiovisual perception is our most natural form of communication and one of our most common multisensory phenomena. However, where and in what form the visual and auditory representations interact is still not completely understood. While there are longstanding proposals that multisensory integration occurs relatively late in the speech processing sequence , there is considerable neurophysiological evidence that audiovisual interactions can occur in the brain stem and primary auditory and visual cortices [4,5]. One of the difficulties testing such hypotheses is that when the degree of integration is manipulated experimentally, the visual and/or auditory stimulus conditions are drastically modified [6,7] and thus the perceptual processing within a modality and the corresponding processing loads are affected . Here we used a novel bistable speech stimulus to examine the conditions under which there is a visual influence on auditory perception in speech. The results indicate that visual influences on auditory speech processing, at least for the McGurk illusion, necessitate the conscious perception of the visual speech gestures, thus supporting the hypothesis that multisensory speech integration is not completed in early processing stages.
In the present studies we held audiovisual stimulus conditions constant and allowed subjective organization of the percept to determine the extent of multisensory integration. This was achieved through the use of a dynamic version of Rubin's vase illusion . In our stimulus an irregular vase rotated and its changing profile produced a talking face profile (see Figure 1). The face articulated the nonsense utterance /aba/ while the accompanying acoustic signal was a voice saying the nonsense utterance /aga/. Two visual and two auditory percepts occur with these stimuli. Visually, the faces appeared to be the figure and the vase was the background or vice versa. Auditorily, subjects heard either the recorded audio track, /aga/, or heard the so-called combination McGurk effect, /abga/ . In this illusion, both consonants are ‘heard’ even though only the /g/ is present in the acoustic signal. This percept results from visual influences on auditory perception. When subjects only heard the acoustic signal, /aga/, there was no phonetic influence of the visual information. Three experiments are presented here.
Experiment 1 looked at the association of the McGurk illusion and the perception of either the vase or face. Complete independence of these percepts would suggest that visual influences on auditory speech perception might occur at an early stage of processing, either subcortically or in primary sensory cortex. Recent work on figure-ground perception indicates that, beyond the simple competition between low-level processing units, figural assignment may involve widespread recurrent processing [e.g., 10] and biased competition between high-level shape perception units . If audiovisual integration in speech is not sensitive to the suppression of face perception in the bistable stimulus it must precede or be independent of this process. Alternatively, complete association of face perception and perception of the McGurk illusion would suggest that audiovisual integration of speech depended on categorical stimulus representations for object perception. Two different stimuli were presented to subjects. In the first condition, the vase rotated and its shape produce a profile of an articulating face saying the utterance /aba/ (Figure 1A: moving face, moving vase). In the second condition, the vase rotated but the face profile remained constant (Figure 1B: still face, moving vase). This was achieved by subtle changes to the 3D vase in this condition such that its visible rotation did not produced any profile changes. Such a stimulus could only be produced using animation. Each of these stimuli were combined with a recording of /aga/. The control condition was not expected to produce the McGurk effect since there was no visual information for a consonant. Subjects watched single tokens and gave two responses. First they reported whether they perceived a vase or a face then they told the experimenter whether they heard /aga/ or /abga/.
For the moving face, moving vase stimulus, the results show a strong association between consciously perceiving the face and perceiving the McGurk effect (Figure 2); 66 percent of the responses shared this perceptual pattern. Only 9 percent of the responses reported the McGurk effect when the vase was the percept. The control stimulus (still face, moving vase) produced a quite different pattern of responses. Approximately 90 percent of the speech responses were percepts of the auditory stimulus /aga/. These responses were split between the vase and face percepts with a slight bias toward perceiving the face. The /abga/ responses (~10%) were split between the face and vase percepts. This three-way interaction was reliable by Chi Square test (p<.001). When the 2×2 response contingencies tables were evaluated separately for each stimulus, the moving face, moving vase showed a reliable association between face perception and the perception of the McGurk combination (Fisher Exact Probability test, p<.05) while the stimulus with a still face and moving vase showed no association (p>.5).
The small number of /bg/ percepts in the moving face, moving vase condition when the vase was reported was approximately equal to the number of /bg/ percepts for the still face, moving vase condition (~10%). This common response rate suggests that this may be simply response bias or error. While motion in a suppressed image in binocular rivalry can still elicit motion aftereffects  and contribute to the perception of apparent motion , the moving face seems to require conscious perception in order to influence auditory speech.
The presence of vase motion alone produced a large number of face percepts. This is not associated with audiovisual integration as virtually no McGurk effects were observed for this condition. When the two motion conditions in Experiment 1 are contrasted we see strong evidence for the importance of dynamic facial information and its conscious perception as prerequisites for audiovisual speech perception. These findings are consistent with studies showing that awareness that an acoustic signal is speech is a prerequisite for audiovisual integration .
Two control experiments were carried out to help clarify the results. Experiment 2 tested further how motion influenced vase/face perception and in addition how sound influenced this percept. Three different levels of movement of the stimulus were shown with and without the speech soundtrack. In one condition, a static frame of the vase and face was shown for the duration of the dynamic stimuli. This frame was identical to the left most frame in Figure 1A. The other two conditions were identical to the visual conditions tested in Experiment 1.
Figure 3 shows the mean proportion of face percepts for the three movement conditions as a function of whether a speech utterance was played along with the visual stimuli. A robust effect of movement condition is evident (F(2,24) = 36.4, p<.001) while only a modest influence of the presence of sound can be seen (F(1, 24) = 3.8, p=.06) and no interaction. The presence of motion dramatically decreased the percentage of vase percepts from the high of 76 percent in the static image condition to a low of 28 percent in the moving face, moving vase stimulus. Each of the three motion conditions were reliably different from each other (p<.01). The presence of auditory speech increased the percentage of face percepts but by less than 10 percent on average.
From a pictorial viewpoint, the stimulus was biased toward perceiving the vase by the surface texture information and three-dimensional rendering of the vase . The still image's high proportion of vase percepts reflects this. When auditory speech and any motion (either the vase alone or both vase and face) were presented, the proportion of vase percepts consistently decreased from the silent, still image-condition high water mark. The onset of motion in an image during binocular rivalry  or higher velocity  in an image tends to increase the likelihood of that image dominating perception. The reduction in vase percepts in the still face, moving vase condition is inconsistent with these findings. The independence of facial form and motion pathways  suggests a possible high-level associative account. Nevertheless, the moving face, moving vase condition is the only visual condition in which face percepts dominate (>50%). The influence of sound was modest and relatively consistent across the different visual conditions. If early audiovisual interactions were driving the visual percepts, the moving face condition would have been expected to show the strongest influence of the presence of sound. This was not the case. The results suggest that the conditions determining the perception of the unimodal stimulus (vision) are primarily determining multisensory integration .
Experiment 3 was carried out to test whether perceptual alternations could be accounted for simply by an alternation between eccentric (face) and central (vase) fixations. The distributions of gaze fixation positions associated with either of the two reported percepts were compared using an analysis derived from signal detection theory. For each subject, we found that the distribution of fixation positions associated with each percept overlapped extensively and only in 0.1% of the cases (22/23506) could the gaze distributions be considered as significantly different. This finding is consistent with the report that the changes in perception of the Rubin's face-vase stimulus are not associated with changes in eye positions  and with work showing that the McGurk effect is not dependent on whether the visual speech is viewed using central (foveal) or paracentral vision [e.g., 21].
Recent evidence indicates that the attentional state of the subject influences audiovisual integration of speech [22, 23]. The McGurk effect is reduced under high attention demands. Further, subjects appear to have perceptual access to the individual sensory components as well as the unified multisensory percept . Findings such as these contradict the view that multisensory integration is pre-attentive and thus automatic and mandatory  and are consistent with the involvement of higher order processes in phonetic decisions. The evidence that auditory processing is influenced by visual information subcortically as early as 11 ms following acoustic onset for speech stimuli  or cortically in less than 50 ms  for tone stimuli is, at first look, difficult to reconcile with such findings.
One possible solution is that multisensory speech processing involves interaction between auditory and visual information at many levels of perception yet the final phonetic categorization, and ultimately audiovisual integration, takes place quite late. Multisensory processing may involve rapid attentional mechanisms that modulate early auditory or visual activity , promote spatial orienting  or provide contextual modulation of activity . Yet, the dynamic structure of speech may require integration over longer timescales than the speed at which vision and audition can initially interact. The production of human speech is quite slow with the modal syllable rate being approximately 3-6 Hz . It has long been recognized that information for speech sounds does not reside at any instant in time but rather is extended over the syllable . Thus, even within a modality the temporal context of information determines its phonetic identity. For audiovisual speech of the kind presented here, the information for consonant identity is extended in time  and perception requires extended processing to integrate this perceptual information.
It remains to be seen whether this conclusion extends to all audiovisual speech phenomena. Vision can influence auditory speech perception in at least two distinct ways . The first involves correlational modulation. Visible speech strongly correlates with some parts of the acoustic speech signal . The acoustic amplitude envelope and even the detailed acoustic spectrum can be predicted by the visible speech articulation. This redundancy may permit early modulation of audition by vision, for example, by the visual signal amplifying correlated auditory inputs .
The second way in which visible speech influences auditory speech is by providing complementary information. In this case, vision provides stronger cues than the auditory signal or even information missing from the auditory signal. This latter case is the situation that best describes the perception of speech in noise and the combination McGurk effect. In both of the examples the correlation between auditory and visual channels is broken because of the loss of information in the auditory channel. For the combination McGurk, a /b/ could be plausibly produced during the intervocalic closure in /aga/ with minimal or without any auditory cues. The strong cue of a visible bilabial closure provides independent information to the speech system. It is possible that such complementary visual information can only be combined with the auditory signal late during phonetic decision making after both modalities carry out considerable processing.
In experimental settings, the natural correlation between auditory and visual speech can also be broken by having the visible speech provide contradictory cues for the auditory signal. This is the case for the standard fusion McGurk effect  where an auditory /b/ is combined with a visual /g/ and /d/ is heard. Both modalities yield sufficient but contradictory cues for consonant perception though for the strongest effect the auditory /b/ must be a weak percept. Whether the perceptual system also makes a late phonetic decision under these conditions is unclear. The evidence from attention studies suggests that this is the case [22, 23].
Bistable phenomena in vision, audition, and multisensory processing are well accounted for by ideas of distributed competition involving different neural substrates and perceptual processes [35, 36, 37]. Audiovisual speech perception may share this form of distributed processing. However, the data presented here indicate that multisensory decision making in speech perception requires high-level phonetic processes including the conscious perception of facial movements. The unique stimuli used in these experiments will be an important tool in further characterizing the network of processes involved in this multisensory perception.
The studies were approved by the Queen's University General Research Ethics Board and all subjects gave informed consent before participating in the research.
Audiovisual stimuli were created using a dynamic version of the Rubin Vase illusion . Experiments 1 and 2, used an animated version (Figure 1) of a vase created with Maya (Autodesk) with the face profile determined by the same video sequence as Experiment 1. In both stimuli, the vase was irregular and as it rotated, its edge would produce a different face profile. Figure 1a shows three frames from the movie in which the vase rotates and its changing shape produced a face profile that articulates the utterance /aba/. The face profile matches the original movie exactly on a frame-by-frame basis. Figure 1b shows 3 frames from the control movie in which a slightly different vase rotates but its changing shape produces no change in the face profile. The difference in profile changes between 1a and 1b is due to subtle differences in the animated vase 3D shape between the two conditions. In Experiment 3, a video of a rotating, custom-constructed vase was edited using the profile of a female speaker saying the utterance /aba/.
12 subjects were presented with 2 types of stimuli in single trials. The stimuli were the two dynamic stimuli from Experiment 2 (rotating vase that produced an articulating face, rotating vase that produced a still face) both presented with the audio track, /aga/. Subjects were asked to indicate whether they saw a face or vase. Only a single response was permitted for each trial. After reporting this, they were instructed to record whether the sound they perceived was most like /aga/ or /abga/. Following 10 warm up trials, the subjects were presented with 60 experimental trials, 30 of each condition in randomized order.
14 subjects were presented with 6 types of stimuli in single trials. Three visual stimuli (still frame, moving face, moving vase: rotating vase that produced an articulating face, and still face, moving vase: rotating vase that produced a still face) were presented with either the audio track, /aga/, or silence. Each trial was composed of a single rotation of the vase or in the case of the still frame a period equaling the duration of the dynamic trials. The subjects' task was to indicate whether they saw the face or the vase first, then indicate each time it changed within a trial. Following 12 warm up trials, each of the stimuli was presented five times with order randomized across stimulus type making 30 experimental trials in total. When subjects reported more than one state in a single trial both responses were included in the analyses as separate responses for that condition. Subjects generally reported only one perceptual state for the bistable stimulus in each trial with the overall average number of states reported being 1.05 states per trial. There were no differences in the number of states seen across the conditions.
We tested 7 subjects on a behavioural task in which the stimulus was displayed in loops of 10 continuous utterances. Subjects responded after each loop with a keypress indicating whether they heard a /b/ sound or not. In addition to the behavioural task, we examined whether the varying audiovisual percept of the bistable stimulus depended on the subject's gaze fixation positions by monitoring horizontal and vertical eye position of the subjects while they view the stimulus during repeated trials and reported when their percept changed.
Horizontal and vertical eye position were sampled at a rate of 1 kHz using the search-coil-in-magnetic-field technique  with an induction coil that consisted of a light coil of wire embedded in a flexible ring of silicone rubber (Skalar) that adheres to the limbus of the human eye, concentric with the cornea . The search coil was positioned in the dominant eye of the subjects and only after the surface of the eye had been anesthetized with a few drops of anesthetic (Tetracaine Hcl, 0.5%). Details of this method were described previously .
The distributions of fixations for the signal detection analysis were computed in the following manner. At each millisecond in each type of utterance (i.e., ones perceived either as /bg/ or /g/), we calculated separately the probabilities that the positions of horizontal and vertical gaze fixation were greater than a position criterion, which was incremented in 1-deg steps across the image from either the left margin of the image or its bottom margin. The ensuing fixation position probabilities (for each percept) were then plotted against each other in a receiver operating characteristic (ROC) curve, and the area under each curve (AUROC) computed to capture the amount of separation between the two distributions of fixation positions. This quantitative measure gives the general probability that, given one draw from each distribution, the fixation positions from the distributions associated with the two percepts would be distinct.
The National Institute on Deafness and other Communication Disorders (grant DC-00594), the Natural Sciences and Engineering Research Council of Canada. Bryan Burt assisted in data collection. The authors acknowledge the artistic work of Patty Petkovich, Rob Loree, and particularly Mike Watters for creation of the vase stimuli.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.