|Home | About | Journals | Submit | Contact Us | Français|
Seeing the image of a newscaster on a television set causes us to think that the sound coming from the loudspeaker is actually coming from the screen. How images capture sounds is mysterious because the brain uses different methods for determining the locations of visual vs. auditory stimuli. The retina senses the locations of visual objects with respect to the eyes, whereas differences in sound characteristics across the ears indicate the locations of sound sources referenced to the head. Here, we tested which reference frame (RF) is used when vision recalibrates perceived sound locations.
Visually guided biases in sound localization were induced in seven humans and two monkeys who made eye movements to auditory or audio-visual stimuli. On audio-visual (training) trials, the visual component of the targets was displaced laterally by ~5°. Interleaved auditory-only (probe) trials served to evaluate the effect of experience with mismatched visual stimuli on auditory localization. We found that the displaced visual stimuli induced ventriloquism aftereffect in both humans (~50% of the displacement size) and monkeys (~25%), but only for locations around the trained spatial region, showing that audio-visual recalibration can be spatially specific.
We tested the reference frame in which the recalibration occurs. On probe trials, we varied eye position relative to the head to dissociate head- from eye-centered RFs. Results indicate that both humans and monkeys use a mixture of the two RFs, suggesting that the neural mechanisms involved in ventriloquism occur in brain region(s) employing a hybrid RF for encoding spatial information.
The ventriloquism effect involves the perception that a sound arises from the location of a visual stimulus, even when the two cues are actually in different places (Jack and Thurlow, 1973; Alais and Burr, 2004). In the ventriloquism aftereffect, repeated pairings of spatially mismatched visual and auditory stimuli produce a shift in perceived sound location that persists when the sound is presented alone (Canon, 1970; Recanzone, 1998; Woods and Recanzone, 2004). These effects pose a computational puzzle because the brain uses different methods for localizing visual and auditory stimuli: the retina provides a map of the visual scene with respect to the eyes, whereas differences in sound loudness and arrival time across the two ears indicate the locations of sounds with respect to the head (Brainard and Knudsen, 1995; Razavi et al., 2007). Here, we tested which of these two reference frames (RFs) is used by the brain when visual stimuli recalibrate perceived sound locations.
Persistent visually driven biases in perceived sound location were induced in seven humans and two monkeys. Analogous experimental procedures were used in order to assess the similarity of audio-visual recalibration across species. Such comparisons are important for determining whether physiological studies in non-human primates can provide insight into multisensory spatial processing in humans. Subjects made eye movements to audio-visual or auditory-only stimuli (Knudsen and Knudsen, 1985). On audio-visual (training) trials, the visual component of the stimuli was displaced laterally. Interleaved auditory-only (probe) trials served to evaluate how exposure to mismatched audio-visual stimuli affected sound localization.
First, we tested whether training in a sub-region of audio-visual space causes local, but not global changes in localization. We used one initial eye fixation position on training trials and presented the discrepant audio-visual stimuli from a restricted spatial range (upper panel of Fig. 1A). Because the aftereffect was spatially specific, we could test the reference frame of the recalibration by shifting fixation on probe trials. Specifically, on interleaved auditory-only probe trials, we varied initial eye position with respect to the head (which was fixed) and presented sounds from locations spanning both the same head-centered locations and the same eye-centered locations as on the training trials (lower panel of Fig. 1A).
If visually induced spatial plasticity occurs in a brain area using a head-centered RF, then shifts in perceived sound location should occur only for sounds at the same head-centered locations (in Fig. 1B, solid blue line matches the red line). Conversely, if plasticity occurs in an eye-centered RF, then visually induced shifts should occur only for sounds at the same eye-centered locations (dotted blue line is shifted to the left of the red line). A third possibility is that the neural mechanism involves an intermediate mixture of both RFs (a “hybrid” frame). The predicted outcomes for head- and eye-centered RFs are displayed in the lower panel of Fig. 1B, which summarizes the potential effect as the difference between the induced bias on trials involving the training fixation and the induced bias on trials involving the non-training fixation point.
Subjects made eye movements from a visual fixation point to a broadband noise delivered from loudspeakers in darkness. On training trials (upper panel in Fig. 1A), visual stimuli were presented simultaneously with the sounds, using light-emitting diodes displaced from the locations of the speakers. On randomly interleaved probe trials (lower panel in Fig. 1A), only the auditory stimuli were presented (50% of all trials).
Seven human subjects (4F, 3M) and two adult male rhesus monkeys participated. The human and animal experimental protocols were approved by the institutional review committees at Boston University and Duke University, respectively.
Subjects were seated in a quiet darkened room in front of an array of speakers and LEDs (Fig. 1). To keep the head-centered RF fixed, the subjects’ heads were restrained (humans: chin rest; monkeys: implanted head post). Subjects’ behavior was monitored and responses were collected by an infrared eye-tracker (humans) or implanted scleral eye coil (monkeys). The eye-tracking system was calibrated using visually-guided saccades to selected target locations at the beginning of each session.
Sounds were broadband noises with 10 ms on/off ramps [humans: 100 ms, 0.2 – 6 kHz, 70 dB SPL(A), monkeys: ~500–1000 ms, 0.5 – 18 kHz, 50 dB SPL(A)] presented from speakers mounted on the horizontal plane approx. 1.2 m (humans) or 1.45 m (monkeys) from the center of the listener’s head. Spacing between speakers was 7.5° (humans) or 6° (monkeys). The LEDs for the AV stimuli were mounted so that they were either horizontally aligned with the speakers or displaced (either to the left or to the right) by 5° (humans) or 6° (monkeys). They were turned on and off in synchrony with the corresponding speakers. Two additional LEDs 10° (humans) or 8° (monkeys) below the speaker array served as fixation locations (azimuths of +/−11.8° in humans, +/−8° in monkeys).
Trials began with the onset of one of the two fixation LEDs. After subjects fixated the LED for 150 ms (humans) or 500 ms (monkeys), the fixation LED was turned off and the AV or A-only stimulus was presented. The subjects performed a saccade to the perceived location of the stimulus (humans were instructed to look to the location of the auditory component of the stimulus; monkeys were rewarded for a saccade that ended within a 16°-wide rectangular window centered on the auditory component and covering the visual component on the AV trials). Training (AV) and probe (A-only) trials were randomly interleaved at a ratio of 1:1 (in the monkeys, 12.5% of the total trials were AV-aligned and presented from the ±30° locations, just outside the range of the A-only test trials, to keep the monkeys aware of the possible stimulus range and to reinforce spatial specificity of the induced aftereffect). Trials were run in blocks with a consistent AV pairing (leftward, rightward, or no shift). For the monkeys, multiple blocks were conducted per session, with shifts in a particular direction for that session interleaved with no-shift blocks. For the humans, each session contained only one block and the order of blocks was random across the subjects. Each monkey performed a total of 128–160 blocks of about 600 trials each. Each human performed 12 sessions of about 720 trials each.
Data from the first quarter of each block were excluded from the analysis to remove any rapid auditory localization adjustments observed at the start of each block. Basic analysis of the temporal profile of the aftereffect is provided in supplementary Figure S1.
One noticeable difference between the humans and the monkeys was that the monkey responses to the peripheral targets were centrally biased (by 2 to 6°; Figs. 2B and 2C) while no such bias was observed in the humans (Fig. 2A). Both the relatively large response bias and the larger response variability (Suppl. Fig. S2) of the monkey results compared to the human results may help explain the weaker aftereffect in the monkey data. Previous reports involving auditory saccades in monkeys have suggested that monkeys sometimes make two saccades to reach an auditory target (Jay and Sparks, 1990), and this would appear in our results as a central bias for targets peripheral to the fixation positions. In our study, monkeys sometimes but not always made more than one saccade toward the target. Therefore, for the monkeys, the first saccade (or the second saccade if the delay between the 1st and 2nd saccade was less than 300 ms) was considered a response. Since this 300 ms criterion was a conservative criterion, many second saccades were rejected, resulting in an overall central bias in the monkey responses.
Each experimental session started with a control block on which auditory-only (humans) or auditory-only and visual-only (monkeys) stimuli were presented in random order from different target locations. Performance on these control trials provided baseline data on the performance of both the monkeys and the humans on the auditory localization task (Supplementary Figure S2). The average standard deviation of the A-only responses was 3.0° for the humans and 4.3° and 5.0°, respectively, for monkeys F and W.
An almost-complete ventriloquism effect was observed in the AV training trials in both the humans and the monkeys. A connected triplet of green symbols at the top of each panel of Fig. 2 represents the responses to the AV training stimuli with a single target speaker and the three different visual adaptor locations (the actual target speaker location is not explicitly shown in the figure but can be easily determined by finding, for each circle, the nearest tick mark along the x-axis). For clarity, the symbols are offset vertically, so that the visually-induced shift appears as a tilt in the triplet of symbols for each target location. In both species and all conditions, the triangles are displaced towards the visual adaptor, with the magnitude of the displacement at least 80% of the imposed offset of the visual adaptor relative to the auditory stimulus.
Experience with spatially mismatched AV stimuli caused both humans and monkeys to mis-localize sounds in the direction of the previously presented visual stimuli. The red and blue symbols in Fig. 2 show responses to A-only targets starting at the training and non-training FP, respectively. As for the AV responses, the responses to the same A-only targets form triplets in which the triangles are vertically displaced from the corresponding circles for clarity. In the training region and with eyes at the training fixation, the effect of interleaved, mismatched AV stimuli was to shift the saccade endpoints to auditory-only stimuli by up to 2.7° (or 54% of the AV displacement) in humans and 1.4° (or 23%) in monkeys. Graphically, this can be seen by comparing the horizontal positions of the red triangles to the corresponding red circles in the gray training regions in Fig. 2 (also see Fig. 3, discussed below).
The first question we asked was whether the ventriloquism aftereffect generalized to locations that were not presented in training trials. We found that in both species, the interleaved ventriloquism trials affected localization judgments adjacent to the trained region, but only modestly so. For example, in the humans, one or more of the target locations directly adjacent to the set of targets in the training region also showed a shift in saccade endpoints, but the effect diminished with increasing distance from the trained region. Graphically, the red triplets are more vertically aligned the farther they are from the gray area in Fig. 2. In other words, training with mismatched auditory-visual spatial cues affected localization judgments locally, rather than globally. This spatial specificity was similar in both humans and monkeys, despite the fact that, unlike the monkeys, the humans received no “reinforcing” trials with coincident AV stimuli from locations at the edges of the test range. This consistent spatial specificity enabled us to explore the spatial frame in which auditory-visual stimuli are aligned.
These results were confirmed by performing two separate 3-way repeated measures ANOVAs (one on the human and one on the monkey A-only data), with the factors of target speaker location (9 levels), fixation point of the trials (Training vs. Non-training FP), and the direction of induced shift (Left vs. Right). The results of this analysis, summarized in Table 1, show that the main effect of location was always significant, confirming that the ventriloquism aftereffect is spatially specific and does not automatically generalize to the whole audio-visual field. The Location × FP interaction was also significant in both species, confirming that the reference frame of visual-auditory recalibration is not purely head-centered. However, visual inspection of the data shown in Fig. 2 shows that the reference frame is not purely eye-centered either. Specifically, if ventriloquism arose in this reference frame, it would produce a much stronger aftereffect at the three left-most locations in the non-training fixation data (blue symbols), as predicted by the blue dotted line in Fig. 1B; however, this was not observed.
To analyze the effect that moving the eyes from the training to the non-training initial fixation position has on reference frame of recalibration, the upper panels of Fig. 3 show the magnitude of the aftereffect after collapsing the data across the two directions of the induced shift (note that no main effect or interaction involving the direction factor were significant in the ANOVA analysis; Table 1). In both species, the peaks of the induced shift became smaller and moved leftwards when the initial fixation moved from the right, training FP to the left, non-training FP (compare the red and blue traces in the upper panels of Fig. 3). Thus, the observed results are inconsistent with visual-auditory recalibration occurring in solely auditory, and head-centered, brain regions. On the other hand, the leftward shift of the blue vs. the red traces was never as large as the angular distance between the two fixation points, as would be expected if the representation was purely eye-centered.
To compare the current results more directly to the predictions of the two models, a difference between the shift magnitudes from the two FPs was computed (black traces in Fig. 3) and compared to predictions based on the two models (orange traces in Fig. 3). Again, the results fall between the predictions of the two models, suggesting that both the head- and eye-centered signals contribute to visual calibration of auditory space, resulting in a mixed reference frame representation.
Since the monkey ANOVA in Table 1 only had two subjects, two additional one-way ANOVA analyses were performed separately, one for each monkey, on the difference data shown by triangles in Fig. 3B. In this analysis, the only factor was the target location and the data from each block were treated as a repeat. Again, these ANOVAs found a significant effect in both monkeys.
Here we show that when humans or monkeys repeatedly perform saccades to an auditory target presented simultaneously with a spatially displaced visual adaptor, a short-term adaptation takes place. This adaptation causes auditory location judgments to be biased towards the visual adaptor location, even on interleaved trials on which no visual adaptor is present. Specifically, saccades to auditory-only targets presented in an approximately 20°-wide horizontal sub-region of space centered on the locations trained with interleaved audiovisual targets were shifted by up to 1.5° (monkeys) or 2.5° (humans). The similarity in these across-species results, despite minor methodological differences (such as differences in the duration of the auditory stimuli, which were up to 1000 ms for the monkeys and only 100 ms for the humans; the presence of “reinforcing” trials at the test range edges for monkeys, but not humans; etc.), suggests that the mechanisms underlying audiovisual spatial calibration are similar in monkeys and humans. This in turn suggests that physiological studies of audiovisual spatial integration in monkeys can provide insight into human perception and behavior.
Overall, the size of the aftereffect, corresponding to 25 to 50% of the audiovisual adaptor displacement, is consistent with previous human studies involving either head pointing responses (Recanzone, 1998) or identifying target locations via a categorical button-press response (Bertelson et al., 2006). These past studies report adaptation from 30% (Bertelson et al., 2006) to 85% of the induced visual-auditory discrepancy (Recanzone, 1998). The similarity in these results is striking, given methodological differences. For instance, the current study and these past studies each differed in the training region width and in the spatial sampling of the training region (current study: 3 locations in a 20° region, Bertelson et al.: 5 locations in a 100° region, Recanzone: 15 locations in a 60° region). Comparison of these results suggest that while it is possible to induce a local ventriloquism aftereffect (as in the current study), the magnitude of the effect is weaker than when a large region of audiovisual space is trained using a dense spatial sampling of training locations (as in Recanzone, 1998).
Saccade shifts were observed for sounds originating near to, not just within, the training region. The shift magnitudes gradually decreased with increasing separation from the training region, showing that the ventriloquism aftereffect can cause a spatially specific recalibration. For our monkeys, the presence of the “reinforcing” AV trials from the edges of the test region (on 12.5% of the trials) may have contributed to the spatial specificity we found. However, we observed similar specificity in the humans, who were not presented with reinforcing trials. This, to our knowledge, is the first report showing that the ventriloquism aftereffect is spatially specific and that it can be induced in a sub-region of the audio-visual space. In contrast, a previous study in which bias was induced by compressing vision (Zwiers et al., 2003) found that shifts generalized to locations outside the adapted region without any decrease in the shift magnitude outside the trained region. It may be that spatial specificity is more likely to occur when audio-visual targets are only presented in a narrow range of space, such as we used, compared to the larger range used by Zwiers et al. (2003). Another possible explanation of this difference is that the plasticity induced in the current study was short-term (on the order of minutes to hours), not days (Zwiers et al., 2003); it may even be that different neural structures are affected by short- rather versus long-term training (Shinn-Cunningham, 2001).
In both species, the direction of eye gaze (i.e., the FP location) influenced the pattern of induced biases on the probe trials. It was not possible to align the eye-dependent bias patterns in either head- or eye-centered RF. Thus, results support the interpretation that visually guided spatial adaptation occurred in a mixed RF. This kind of mixed representation can arise in various ways (e.g., multiple structures might undergo adaptation, each using a different frame; the adapted neural structure might receive both head-centered and eye-centered signals; etc.). Moreover, if multiple structures undergo adaptation, the character of the representation may change over time. Consistent with this idea, supplementary Fig. S1 shows that the aftereffect arises in a predominantly head-centered representation early during training and a more mixed representation as training progresses.
Plasticity underlying this adaptation could in principle occur in the auditory pathway, association areas, the oculomotor pathway, or some combination of the above. The number of potential sites encompassed in this list is large, but information on the multimodal properties and reference frame is only available for a limited subset of the list. Some form of hybrid representation or mixed auditory and visual signals has been reported in several areas of auditory pathway, the posterior parietal cortex, and two areas responsible for planning saccades. Specifically, signals relevant to the integration of visual and auditory space such as overt visual responses (Porter et al., 2007) as well as eye-position dependent modulations of auditory responses (Groh et al., 2001; Zwiers et al., 2004; Porter and Groh, 2006) have been found in the inferior colliculus (IC) in the primate. Visual and eye position signals have also been reported in auditory cortex (Werner-Reiss et al., 2003; Fu et al., 2004; Brosch et al., 2005; Ghazanfar et al., 2005). In both the IC and A1, eye position modulation of neural response was sufficient to cause the representation to be classified as a hybrid of head- and eye-centered information, in conflict with classical views that auditory information is generally encoded in a head-centered reference frame. However, there are no studies of the reference frame of earlier areas on the auditory pathway, so it is not known how early auditory signals might be transformed from the native head-centered frame of auditory spatial cues (interaural time and level differences) into a reference frame more appropriate for integration with visual information.
In the parietal cortex, both visual (e.g. Andersen and Mountcastle, 1983) and auditory (Stricanne et al., 1996; Schlack et al., 2004; Cohen et al., 2005) space are represented. However, that representation is also a hybrid representation that reflects a mixture of head- and eye-centered information (Stricanne et al., 1996; Duhamel et al., 1997; Mullette-Gillman et al., 2005, 2009). The superior colliculus (SC) and frontal eye fields also contain both visual and auditory signals (Meredith and Stein, 1983; Wallace et al., 1993; Wallace and Stein, 1994), and are thought to be essential for the generation of all saccades (Schiller et al., 1979). Jay and Sparks (1987a; 1987b) examined visual and auditory sensory activity in the SC and reported that visual signals were eye-centered and auditory signals showed partially-shifting receptive fields (a type of hybrid reference frame). It is not known whether the auditory saccade-related activity (as opposed to sensory activity) is also hybrid or whether it is eye-centered, as is the case for visual saccades. However, the ventriloquism aftereffect is not likely to consist purely of saccade adaptation (e.g., as has been studied by Desmurget et al.,1998, or Hopp and Fuchs, 2004) because 1) previous studies have observed the aftereffect in paradigms that did not involve saccades (Recanzone, 1998; Zwiers et al., 2003; Bertelson et al., 2006) and 2) the main effect of a purely oculomotor adaptation would likely depend on whether the induced shift results in longer or shorter saccades (Hopp and Fuchs, 2004), which we did not observe.
Overall, the current results suggest that in both human and monkey, auditory-visual spatial recalibration occurs in a hybrid reference frame, after auditory spatial information has been partially transformed from a head-centered representation. Additional behavioral and neurophysiological studies (e.g., looking at the temporal profile of the ventriloquism aftereffect) are necessary to fully understand the mechanism and brain areas underlying the recalibration.
Thanks to Jessi Cruger for her help with the monkey data collection, and Nate Greene and Abigail Underhill for their help with preliminary studies. This work was supported by the following grants: Monkey experiments were supported by NEI R01EY016478, NSF 0415634, NINDS R01 NS50942 to JMG. Human experiments were supported by NIH R01 DC05778 to BGSC. NK received additional support for travel on this collaboration from NIH R03 TW007640 and KEGA 3/7300/09.