|Home | About | Journals | Submit | Contact Us | Français|
Watching the lips of a speaker enhances speech perception. At the same time, the 100-ms response to speech sounds is suppressed in the observer’s auditory cortex. Here, we used whole-scalp 306-channel magnetoencephalography (MEG) to study whether lipreading modulates human auditory processing already at the level of the most elementary sound features, i.e., pure tones. We further envisioned the temporal dynamics of the suppression to tell whether the effect is driven by top-down influences. Nineteen subjects were presented with 50-ms tones spanning six octaves (125–8000 Hz) (1) during “lipreading”, i.e. when they watched video clips of silent articulations of Finnish vowels /a/, /i/, /o/, /y/, and reacted to vowels presented twice in a row, (2) during a visual control task, (3) during a still-face passive control condition, and, in a separate experiment with a subset of nine subjects, (4) during covert production of the same vowels. Auditory-cortex 100-ms responses (N100m) were equally suppressed in the lipreading and covert speech-production tasks compared with the visual control and baseline tasks; the effects involved all frequencies and were most prominent in the left hemisphere. Responses to tones presented at different times with respect to the onset of the visual articulation showed significantly increased N100m suppression immediately after the articulatory gesture. These findings suggest that the lipreading-related suppression in the auditory cortex is caused by top-down influences, possibly by an efference copy from the speech-production system, generated during both own speech and lipreading.
Our senses interact and usually support each other. For example, watching the lips of a speaker enhances speech perception in noisy conditions (Sumby and Pollack, 1954). On the other hand, the cortical 100-ms response to speech sounds (N100/N100m) is suppressed during audiovisual, compared with auditory-only, presentation in both electroencephalographic (EEG; Klucharev et al., 2003, Besle et al., 2004, van Wassenhove et al., 2005, Stekelenburg and Vroomen, 2007) and magnetoencephalographic (MEG) recordings (Jääskeläinen et al., 2004). Lipreading-related suppression specific to formant components of speech sounds has also been found (Jääskeläinen et al., 2008), with modulation of hemodynamic activity even in primary auditory cortex (Calvert et al., 1997; MacSweeney et al., 2000; Pekkola et al., 2005), suggesting effects already at the level of elementary sound features.
The suppression of the neural population-level N100/N100m response with stimulus repetition has been attributed to active inhibition (Loveless et al., 1989). Hypothetically, the lipreading-related suppression could be due to top-down inhibitory influences that increase frequency specificity in the auditory system (Jääskeläinen et al., 2007), possibly via direct anatomical connections from visual areas (Falchier et al., 2002; Rockland and Ojima, 2003; Cappe and Barone, 2005). Alternatively, the suppression might be explained by subcortical projections (Cappe et al., 2009a), or by back-projections from heteromodal cortical areas (Lewis and Van Essen, 2000). Still another possibility is that a rather similar efference copy signal is sent from the speech-production system both during articulation and lipreading, because Broca’s region is activated in both (for a review, see Nishitani et al., 2005). This view is supported by a study showing that both silent articulation and lipreading modify perception of speech sounds similarly (Sams et al., 2005).
Recently, Skipper and coworkers (2007) addressed the role of efference copy signals in audiovisual speech perception by visually presenting /ka/, dubbed with auditory /pa/, to produce a McGurk illusion: perception of /ta/. The fMRI pattern in the auditory cortex initially resembled that of /pa/, but later matched that elicited by /ta/, thus paralleling categorization at the behavioral level as well as the neuronal activity patterns in frontal speech-production areas. Articulation-related efference copy signals suppress the auditory-cortex responses to both self- and externally produced sounds, as N100m to phonetic stimuli is suppressed during both overt and covert speech production (Numminen and Curio, 1999; Curio et al., 2000). While both lipreading and speech production may suppress auditory-cortex reactivity, it still remains unclear whether both these effects can be explained by similarly because no studies have directly compared their specificity to sound features, such as frequency bands important to speech (Warren et al., 1995).
Here, we hypothesized that lipreading modulates auditory processing already at the level of the most elementary sound features, pure tones. We envisioned that the modulation could be different for frequencies critical for speech perception compared with other frequencies. We further hypothesized that lipreading and covert self-production of vowels have similar suppressive effects on the auditory-cortex reactivity, suggesting that the N100m suppression is caused by an efference copy from the speech-production system.
Twenty healthy subjects participated voluntarily in the study, out of which one subject was dropped due to technical problems. All subjects included in the analysis (N = 19) were right-handed, native Finnish speakers with normal hearing and normal or corrected-to-normal sight (10 women, 9 men, age 20–32 years, mean ± standard deviation 23.7 ± 3.2 years). The subset of subjects (N = 9) with an additional covert speech production task included 4 women and 5 men (21–32 years, mean ± s.d. 23.7 ± 3.3 years). The subjects gave an informed consent before the experiment and were not paid for their participation. The experiment was run in accordance with the Helsinki Declaration, and the MEG recordings had a prior approval by the Ethics Committee of the Hospital District of Helsinki and Uusimaa, Finland.
The visual stimuli used in the experiment were similar to ones used by Pekkola et al. (2005). Figure 1 depicts the summary of the stimuli and experimental paradigm. During the lipreading condition, video clips of a woman articulating Finnish vowels /a/, /i/, /o/ or /y/ were presented through a back-projector screen located 100 cm in front of the subject. The face extended ~5.9° × 7.8° of visual angle (width of the mouth ~1.7°). Each single vowel clip lasted for 1.28 seconds and was extended with 1–4 frames (0.04–0.16 s) of the still face to induce jitter to the presentation (i.e., the stimulus onset asynchrony, SOA, for each vowel was variable). These short video clips were concatenated in pseudorandom order to form a long, continuous video. One tenth of the time, two identical vowels followed each other, constituting a target stimulus. During the “expanding rings” control condition, a blue ring with a diameter corresponding to 1.0° visual angle was overlaid on the still face. The ring was manipulated to change its shape to one of four directions: horizontal, vertical, or tilted ±45 degrees. The ring transformation took place at roughly the same pace as the mouth openings during the lipreading condition. Similarly, the short video clips of ring transformations were extended with still frames to induce jitter, and combined to one long presentation in pseudorandom order: 10% of the clips were targets. During the still-face and covert-speech-production conditions, only the still face was continuously shown on the screen.
Auditory stimuli were identical in all conditions: 50-ms sine-wave tones with a frequency of 125, 250, 500, 1000, 2000, 4000, and 8000 Hz and an interstimulus interval of 1005 ms were presented, in random order, through ear inserts (Etymotic Research Inc., Elk Grove Village, IL). Each tone had 5-ms Hann-windowed rise and fall times. Sound files were generated with Matlab (R14, MathWorks Inc., Natick, MA, USA) using a 44.1-kHz sampling rate with 16-bit precision. Random playback order of the tones was controlled so that two consecutive tones were at least two octaves apart. The sounds were played 55 dB above individual hearing threshold, measured separately at 1000 Hz for both ears.
Auditory and visual stimulus presentation rates differed (tones presented at a fixed rate of ~1 Hz, video clips on average at ~0.7 Hz; see Figure 1) so that the stimuli were at constantly varying synchrony with respect to the other modality and thus could not be fused together to form an audiovisual object. Both stimuli were delivered using Presentation software (v10.1, Neurobehavioral systems, Inc., Albany, CA, USA). Each of the three conditions (still face, lipreading, expanding rings) was presented in short 6–7 min interleaved blocks with counterbalanced order across subjects. At least 100 artifact-free MEG epochs were collected for the online average.
The subjects were instructed to perform a one-back task during both the lipreading and expanding-rings conditions by lifting their right index finger whenever they detected the target, two identical vowels or ring transformations following each other: the response was detected with an optical response pad. During the still-face condition, the only instruction was to keep the gaze focused on the mouth area of the face. All nineteen subjects were measured in three different experimental conditions: (1) lipreading, (2) expanding rings, and (3) still face. A subset of nine subjects additionally participated in the (4) covert-speech-production condition, where the subjects were instructed to covertly produce the same Finnish vowels that were presented visually during the lipreading condition while the same still face of a woman was shown on the screen. The subjects were further instructed to avoid movements of the head and mouth to minimize artifacts caused by muscular activity, and to keep roughly the same pace as during lipreading and expanding rings (i.e., one vowel every 1.5 s).
The reaction times were measured from the onset of the video clips. As each video clip started with frames showing still face (see Figure 1), the visual movement did not start at 0 ms, and thus the onset times of visual motion differed slightly between the lipreading and expanding-rings conditions. The correction was 360 ms for all ring transformations in the expanding-rings condition (edited to occur in exact synchrony) and 440, 400, 400, and 440 ms for vowels /a/, /i/, /o/, and /y/, respectively (1–2 frames difference to expanding-rings).
MEG was measured with a 306-channel whole-head neuromagnetometer (Vectorview, Elekta Neuromag Ltd, Helsinki, Finland) in a magnetically shielded room. This device has 102 sensor elements, each with two orthogonal planar gradiometers and one magnetometer. The sampling rate for the recording was 601 Hz, and the passband was 0.01–172 Hz. Additionally, one electro-oculogram channel with electrodes placed below and on the outer canthus of the left eye was recorded to detect eye blinks and eye movements. The signals time-locked to auditory events were averaged offline, with epochs exceeding 3000 fT/cm or 150 µV rejected as containing extracerebral artifacts. Each epoch lasted for 700 ms, starting 200 ms before the stimulus onset. All amplitudes were measured with respect to a 100-ms pre-stimulus baseline. The averaged MEG signals were low-pass filtered at 40 Hz.
Before MEG recording, the 3D locations of preauricular points and nasion were digitized to obtain a right-handed head-coordinate frame. After this, locations of four head-position indicator coils, fixated on the scalp, were digitized. The coils were energized in the beginning of each recording session, providing information about head position with respect to the MEG sensors. Finally, extra points along the subjects’ scalp were digitized to obtain a better head shape for later co-registration with the individual MR image and to estimate head size and the origin of the spherical head model used in dipole fitting.
The cortical current sources of the MEG signals were modeled as two equivalent current dipoles (ECDs) that were fitted, using a spherical head model, to left- and right-hemisphere planar-gradiometer data (Hämäläinen et al., 1993). For each subject and condition, ECDs were estimated for the N100m responses elicited by the 1000-Hz tones. Thereafter, the dipole locations and orientations were kept fixed and MEG signals across other auditory stimuli were projected to these dipoles to yield N100m source waveforms for each subject, condition, and stimulus. The N100m peak strengths and peak latencies were determined from the individual source waveforms using semi-automatic peak-seeking algorithm. Grand-average source waveforms were calculated by averaging the individual source waveforms.
The peak strengths and peak latencies of the current dipoles were statistically analyzed using the non-parametric Kruskal-Wallis test for the main effects. For specific effects, Mann-Whitney U tests were used. The tests were conducted separately for the whole three-condition dataset (N = 19 subjects) and for the four-condition subset of subjects with covert-speech-production condition (N = 9), here referred to as p 4cond. All statistical analyses were done in SPSS (version 15.0 for Windows, SPSS Inc., Chicago, IL, USA).
The impact of the onset time of the visual stimulus on the auditory responses was studied in the lipreading condition by selectively averaging the responses according to the time difference (lag) between the tone and the visual articulation. As the auditory and visual stimuli were presented asynchronously, the lags were evenly distributed. Then, the subsets of epochs that were presented during overlapping 300-ms sliding windows (later referred to as ranges) were pooled together and averaged. These averaged MEG signals were projected through the same per-subject current dipoles as in the normal analysis to obtain source waveforms across frequencies, hemispheres, and ranges. Thereafter, the ECD peak strengths and latencies were analyzed using Kruskal-Wallis and Mann-Whitney U tests.
As MRI images were not available for all subjects, we adopted a different method of normalizing the head-coordinate system to a stereotactic space, suitable for group-level studies (Steinstraeter et al., 2009). The procedure included finding, by means of least-squares fitting for each individual, a sphere to the digitized anatomical landmarks (nasion and preauricular points), the locations of the four coils, and a number of extra points on the scalp (7–34, median 15 points). In this fitting procedure, points below nasion were discarded. For normalization, the head-coordinate system was first 3D-rotated to match the MNI space obtained from “colin27” MRI image. Second, the coordinate system was transformed so that the spheres of “colin27” template and the MEG coordinates coincided. Third, the sphere size was matched to the sphere from the MNI template. These steps were combined to a 4 × 4 matrix defining an affine transform, which was then used to convert the dipole locations from the MEG head-coordinate system to the MNI space.
Figure 2 displays grand-average source waveforms for the 1-kHz tones (for a single-subject field pattern of the responses, see Supplemental Fig. S1). A clear N100m response peaks at ~100 ms in all conditions, without latency jitter, but with amplitude reduction during lipreading and covert speech production. N100m was suppressed at all tested sound frequencies (Fig. 3).
During the lipreading task and covert speech production task, the sources of auditory responses to task-irrelevant tones were on average 20–25% (6–7 nAm) weaker than in the still-face condition across all frequencies (frequencies pooled together). The sources overall were 40% (8–9 nAm) stronger in the right hemisphere than in the left, and 18–120% (5–20 nAm) stronger for the 1000-Hz tone than for other frequencies used, resulting in an inverted V-shape curve for source strengths as a function of frequency (Fig. 4). Table 1 summarizes the results of Kruskal-Wallis statistical tests for the main effects of task condition, frequency and hemisphere. Paired Mann-Whitney U tests showed no significant differences between the expanding-rings and still-face conditions, but confirmed significant differences between expanding-rings vs. lipreading and still-face vs. lipreading (see Table 1). Further, for the four-condition subset, differences were statistically significant between expanding-rings vs. covert-speech and still-face vs. covert-speech, but not between lipreading and covert-speech condition.
The suppression of N100m in the lipreading and covert-speech conditions became even clearer when the source strengths were computed with respect to the passive still-face condition (Fig. 4, bottom). During both lipreading and covert self-production, N100m was on average suppressed by 7 nAm (range 3–11 nAm), approximately similarly at all frequencies (see Table 2)
In a subsequent analysis, only non-overlapping time ranges (−200−100 ms; 100–400 ms; 400–700 ms; 700–1000 ms; 1000–1300 ms) were selected. Figure 5 shows that following the mouth opening gesture in vowel clips, at 700–1000 ms when the mouth was still open, the strength of the auditory response transiently decreased by 10% (~2 nAm). In other words, the general suppression effect observed during lipreading transiently increased. A paired Mann-Whitney U test showed a significant difference between time windows 400–700 ms vs. 700–1000 ms (p = 0.011) and a nearly significant difference between 700–1000 ms vs. 1000–1300 ms (p = 0.059), as depicted in Figure 5.
Latency differences were found depending on tone frequency: at low-frequency (125–500 Hz) tones and at the highest 8-kHz tone, N100m peaked 5–25 ms later, forming a U-shaped curve as a function of frequency (Fig. 6). At middle frequencies (1000–4000 Hz), N100m peaked at 93–97 ms. No consistent differences were found between the task conditions. The frequency dependency of latency was statistically significant (p < 0.001; p 4cond < 0.001). Further, N100m peaked on average 5 ms later in the left than right hemisphere across all conditions (p = 0.003), but this effect failed to reach significance in the subset of subjects tested on all four conditions (average latency prolongation for the left hemisphere 1.7 ms, p 4cond = 0.34).
Figure 7 shows that following the mouth opening in the video clip, N100m peak latencies were delayed by ~2 ms at the same 700–1000 ms range where the sources were significantly weaker (see Fig. 5). The Mann-Whitney U test showed a significant difference between the 400–700 ms vs. 700–1000 ms windows (p = 0.029).
The N100m source locations did not differ significantly across conditions (p > 0.95; Kruskal-Wallis test; see Supplemental Figure S4, for mean coordinates, see Supplemental Table S1). The mean locations of the dipoles corresponded to nonprimary auditory areas (Brodmann area 42) in the supratemporal plane. The dipole coordinates showed a trend to higher inter-subject variation in the left than in the right hemisphere but this effect did not reach statistical significance (Levene’s test).
The subjects detected targets similarly during the lipreading (mean hit rate (HR) ± SEM 83.7 ± 2.8%) and expanding-rings tasks (86.1 ± 3.0%), with no significant difference in performance. The subjects responded ~230 ms slower during the lipreading task (1293 ± 26 ms vs. 1066 ± 25 ms for expanding-rings; p < 0.001), when the reaction time was calculated from the video clip start. During the lipreading task, the opening of the mouth on the screen allowed vowels to be identified 40–80 ms (1–2 frames) later than when the ring deformation direction could be judged during the expanding-rings condition (see Methods). Reaction time (RT) difference, corrected for visual motion onset, was ~170 ms (879 ± 26 ms vs. 706 ± 25 ms; p < 0.001). The discrimination index d’ measure showed a statistically significant (p = 0.046) difference between lipreading (d’ = 3.70 ± 0.17) and expanding rings (d’ = 4.23 ± 0.17) task conditions. As the d’ difference between task conditions was significant, we tested whether any behavioral measure (RT, HR or d’) would be correlated with the amplitude suppression. We calculated the correlation between the behavioral measures (RT, HR, d’) and N100m amplitude suppression, but no significant correlations emerged (highest correlation for d’ difference vs. amplitude suppression at 1000 Hz; p = 0.35, Spearman’s ρ = 0.226, explaining ~5% of the variance). In tests for possible effects of task difficulty on N100m suppression between the lipreading vs. expanding-rings condition, the results remained significant when the behavioral measures (RT, HR, d’) were entered as covariates in a separate ANCOVA test.
In the present study, we observed that the auditory cortical neuromagnetic N100m response was robustly suppressed during lipreading as compared with a visual control task. The suppression was more prominent in the left than the right hemisphere of our right-handed subjects, and it involved all tested sound frequencies that ranged from 125 to 8000 Hz. Because the N100m response arises from the supratemporal plane, lipreading modulated auditory processing at a relatively low cortical level. Notably, the transient N100m suppression effect was time-locked to the mouth-opening gesture in the video clip (Fig. 5), and the N100m peak latencies were also prolonged 300–600 ms after mouth-opening (Fig. 7), implying a cross-modal inhibitory effect that is partially time-locked to the phase of articulation. Despite differences in species, stimuli, and methodology, the present results resemble those documented in nonhuman primates (Ghazanfar et al., 2005), where the multisensory integration effect (enhancement vs. suppression) seen in local field potentials (LFPs) depended on the voice-onset times relative to the visual stimulus. Ghazanfar et al. (2005) found response enhancement more likely with short voice-onset times and response suppression more likely with longer voice-onset times.
Convergence of multisensory information in early auditory cortices has important functional consequences (reviewed by Schroeder et al., 2003), as it can integrate information from different levels of cortical processing and enhance behavioral performance, for instance detection of speech in noise. What has remained obscure is the origin of the top-down inputs that cause N100/N100m suppression during lipreading. At least three possibilities exist: 1) Visual information is relayed to auditory cortex from the visual system, including the multisensory posterior superior temporal sulcus (Schroeder and Foxe, 2002; Cappe and Barone, 2005; Kayser and Logothetis, 2009); 2) the suppression effects during lipreading are due to an efference copy from the speech-production system (Sams et al., 2005; Skipper et al., 2007); 3) visual information is relayed via subcortical routes, e.g. via medial pulvinar or non-specific thalamic inputs, such as the medial interlaminar nuclei (Cappe et al., 2009a; for a review, see Sherman and Guillery, 2002; Hackett et al., 2007; Cappe et al., 2009b).
Previous human studies have documented suppressant effects of both overt and covert speech production on the N100m amplitude (Numminen and Curio, 1999; Curio et al., 2000; Houde et al., 2002), but they have not examined the relationship between lipreading and covert speech production. In the present study, N100m was similarly suppressed when the subjects were lipreading silent vowel articulations and when the subjects covertly self-produced the same vowels (Fig. 4). Thus, these results agree with the view that the suppression of auditory cortex is caused by an efference copy (Paus et al., 1996; Curio et al., 2000; Houde et al., 2002; Martikainen et al., 2005; Heinks-Maldonado et al., 2005, 2006; Christoffels et al., 2007) from the speech-production system. Tentatively, such an efference copy could arise during lipreading when the observers do not speak themselves but their inferior frontal gyrus is activated through “mirroring” of the other person’s actions (Rizzolatti and Arbib, 1998; Nishitani and Hari, 2002; Rizzolatti and Craighero, 2004). In this specific case, the efference copy could also increase the signal-to-noise ratio of auditory processing through modification of auditory cortex response patterns (Heinks-Maldonado et al., 2005, 2006) during both monitoring of own speech production and lipreading. It is important to realize that suppression of mass-action level responses such as the evoked responses recorded in the present study might reflect more selective and efficient responses from neurons with sparse population coding (Wang et al., 2005; Hromádka et al., 2008; Otazu et al., 2009).
In the present study, we estimated the source locations of the auditory cortical N100m responses using a fixed two-dipole model. The locations were in line with previous studies showing N100m generation in the posterior supratemporal plane (for a review, see Hari, 1990). Obviously, a fixed two-dipole model is an oversimplification, as the neuromagnetic N100m is generated by multiple, both temporally and spatially overlapping, distributed sources (Sams et al., 1993). Further, the applied identical source location for tones of different frequencies simplifies the underlying functional organization of the auditory cortex where multiple tonotopic fields are known to exist (Kaas and Hackett, 2000; Rauschecker et al., 1995; Pantev et al., 1995; Lütkenhöner et al., 2003; Talavage et al., 2004). Since N100m reflects an auditory processing stage that occurs after brainstem and middle-latency cortical responses, we cannot even exclude the possibility of contributions from lower levels of the auditory pathway (Papanicolaou et al., 1986; Musacchia et al., 2006).
The effects observed here for the neuromagnetic N100m response were for incongruent stimuli, as the lipreading task had no relevance for the asynchronously presented tones with different frequencies. This finding contradicts previous findings where audiovisual interactions were observed only when visual motion preceded the sound presentation (Stekelenburg and Vroomen, 2007). The suppressions of auditory-evoked N100m responses were probably not caused by visual attention or visual motion (Hertrich et al., 2009) alone, as shown by the lack of suppression during the control task with expanding rings compared with the silent lipreading task. Further, a concurrent visual task has previously been shown to have no effect on N100 amplitude to auditory stimuli, but the effect has been restricted to visually-evoked responses (Woods et al., 1992) or enhancements of auditory responses at latencies over 200 ms (Busse et al., 2005). Interestingly, a memory task using visual stimuli actually increases the N100m amplitude to task-irrelevant tones (Valtonen et al., 2003). Together, these results suggest that visual attention itself or an increase in visual attentional demand should actually enhance rather than suppress tone-evoked responses.
Our analysis on temporal asynchrony between visual articulations and tones showed that the auditory cortex was transiently modulated by the dynamic mouth opening gesture, in addition to an ongoing suppression by the lipreading task. The time scale of this effect was in line with the relatively long temporal window of integration of several hundred milliseconds for visual speech input (e.g., Massaro et al., 1996; van Wassenhove et al., 2007). Importantly, this suppression effect could in part explain discrepancies between previous studies showing either auditory response enhancement (Giard and Peronnet, 1999; Hertrich et al., 2007), suppression (Klucharev et al., 2003; Besle et al., 2004; Jääskeläinen et al., 2004; van Wassenhove et al., 2005), or no effect (Miki et al., 2004) during audiovisual stimulation compared with auditory-only responses. Some of the recent studies in humans have addressed this issue by presenting audiovisual stimuli with variable asynchrony in addition to simultaneous presentation (Stekelenburg and Vroomen, 2005; van Atteveldt et al., 2007). Our findings support the notion that when analyzing the audiovisual integration effects elicited by both congruent and incongruent stimuli the synchrony of the auditory and visual stimuli should be carefully controlled. Theoretically, the timing-dependent suppression effect could also be due to reduced synchrony of neuronal signaling underlying the N100m response, in line with recent results in monkeys demonstrating resetting of the phase of ongoing auditory cortex activity by somatosensory input (Lakatos et al., 2007; Schroeder et al., 2008). However, the present study was not designed to effectively address this hypothesis.
In conclusion, the observed transient modulation after mouth opening, together with the similarity of the suppressant effects caused by covert speech and lipreading, suggests that an efference copy signal from the speech-production system underlies the N100m response suppression during lipreading.
This study was financially supported by the Academy of Finland (National Programme for Centers of Excellence 2006–2011, grant nos 213464, 213470, 213938, FiDiPro program), the Finnish Graduate School of Neuroscience, the Emil Aaltonen Foundation, the U.S. National Science Foundation (BCS-0519127) and the U.S. National Institutes of health (R01 NS052494).