|Home | About | Journals | Submit | Contact Us | Français|
Objects and events in the real world are generally comprised of multiple sensory attributes, and vertebrate nervous systems have evolved to process information from different sensory modalities, and even sub-modalities, independently. However, this sensory information is combined to give rise to a unified percept. How this is done has been a topic of investigation for some time, dating back to early psychophysical studies, but it remains unclear how this unification process is performed. While early sensory cortical areas have traditionally been considered to be ‘unimodal’, more recent evidence clearly shows that neurons in these areas are influenced by more than one modality (e.g. see Schroeder and Foxe, 2005; Ghazanfar and Schroeder, 2006; Lakatos et al., 2007, 2008; Kayser et al., 2009) indicating that multisensory processing begins in early cortical areas. It is also unclear how and to what extent sub-cortical regions of the brain process multisensory information, for example in regions of the non-lemniscal auditory thalamus, the anterior dorsal thalamus, and the basal ganglia. This article will review in some detail several psychophysical experiments in human subjects integrating two sensory modalities, as well as single cortical neuron physiology experiments in macaque monkeys that explore potential neuronal mechanisms that underlie multisensory integration.
In primates, the auditory and visual systems are two of the sensory modalities that have distinct cortical representations and provide information about the external environment. These sensory signals are often associated with the same objects and events, and binding these two stimuli together is done naturally and effortlessly. One curious aspect of integrating these signals is in the relative differences in timing between the two modalities. Light travels approximately six orders of magnitude faster than sound in air, thus light will strike the retina virtually instantaneously in physiological terms, whereas sound does not begin to vibrate the tympanic membrane for several milliseconds at reasonable distances (e.g. about 15 msec at 5 meters). However, the physiological processing time is much faster in the auditory system compared to the visual system. The transduction process and the subsequent processing from the outer segments of the photoreceptors to the first action potential in the retinal ganglion cell can require tens of milliseconds. In contrast, the transduction process is much faster in the cochlea, with the spiral ganglion cells firing action potentials within milliseconds. While there is considerable variance in the first spike latency within both the primary visual and auditory cortex, in general primary auditory cortical neurons have much shorter latencies (e.g. Recanzone et al., 2000a) compared to primary visual cortical neurons (Maunsell and Gibson, 1992). These temporal disparities must be kept in mind when considering how the cerebral cortex ultimately integrates the two signals. Of interest is the relative spike latencies in multisensory cortical fields, such as in the parietal lobe or the superior temporal sulcus. In this case, the latencies are nearly identical (Barraclough et al., 2005; Cohen and Andersen, 2002; Linden et al., 1999; Mazzoni et al., 1996). This similarity in timing makes it tempting to speculate that even though multisensory interactions are observable in primary cortical areas (Schroeder et al., 2001; Schroeder and Foxe, 2002; Bizley and King, 2008; see also Schroeder et al. this volume) multisensory processing is quite different between lower and higher order cortical areas.
Temporal integration between these two senses has been widely studied and many experiments have produced interesting results. For example, in one study normal human subjects were asked to determine whether or not an auditory and visual stimulus was presented simultaneously. The results showed that there was a differential effect depending on which stimulus was presented first (Slutsky and Recanzone, 2001; Figure 1A). Using a 200 msec duration flash of light and either a tone or broadband noise, it was seen that the ability to segregate the visual from the auditory stimulus occurred at much smaller temporal disparities when the auditory stimulus was presented before the visual stimulus. This is evidenced by the asymmetry of the function, which is highlighted in Figure 1A by the short vertical lines corresponding to the time disparities of 50 msec (top), 100 msec (middle), and 150 msec (bottom). This result is consistent with natural stimuli as the visual information always arrives at the sensory epithelium first, as noted above, when the two modalities represent the same object or event. Indeed, if the auditory stimulus arrives before the visual stimulus, it cannot be from the same natural object or event and the nervous system would do well to segregate the two signals. A second feature of natural scenes is in the spatial overlap of the auditory and visual stimuli that arise from the same object or event. How these temporal effects are influenced by differences in spatial locations is shown in Figure 1B. In this case, the temporal disparities were limited to auditory leading (the ‘unnatural’ condition), but the subjects were asked if the auditory and visual stimuli came from the same location, regardless of whether they occurred at the same time. In this case, there was no influence of temporal disparity at large spatial separations, which is easily discriminated in normal subjects. However, with smaller spatial separations the ability to tell if the two stimuli were at the same or different locations was degraded as the temporal disparity was increased. These studies indicate that both time and space influence the percept of multisensory objects or events, and these interactions must be accounted for when investigating integration at the single neuron level.
Illusions are one tool that can be used to probe how multisensory integration occurs. There are multiple cases in which one sensory modality will dominate the percept of a multisensory object or event. A classic example is the ventriloquism effect, where the percept of an auditory stimulus is ‘captured’ by the spatial location of a visual stimulus (Howard and Templeton, 1966; Welch and Warren, 1980; Jack and Thurlow, 1973; Radeau and Bertelson, 1977, 1978). Similar spatial illusions have been described between the visual and somatosensory systems (e.g. Hay et al., 1965; see Spence et al., this volume). Early studies investigated the different factors that make the illusion relatively strong or weak, and three key parameters were revealed; the ‘compellingness’, the timing of the two stimuli, and the spatial disparity. For the compellingness, it is important that both the ventriloquist does not appear to be talking (e.g. the lips don’t move) and the puppet’s ‘voice’ is appropriate for its appearance. Timing is also a critical factor in the illusion, as the movement of the puppet’s mouth must be consistent with what is heard. Finally, there has to be a relatively small spatial disparity between the ventriloquist and the puppet. These three factors interact in complicated ways. For example, one can have a greater spatial disparity if complex and compelling stimuli are used such as elaborate puppets and complex speech signals. In contrast, when using simple tones and small spots of light, the visual stimulus must be well within the spatial acuity of the auditory system.
These observations have led to two similar ideas of how sensory integration occurs. The first is the modality specificity hypothesis (Welch and Warren, 1980; Welch 1999) which posits that the sensory modality with the greater acuity for the discrimination to be made will dominate the percept with respect to that discrimination. The visual system spatial acuity is quite high, much less than one degree depending on the task and stimulus parameters (e.g. see Cavonius and Robbins, 1973). In contrast, while it has been shown that individual humans can detect differences as small as one degree of acoustic space (e.g. Stevens and Newman, 1936), the ability to identify the precise location of a stimulus (e.g. point to it) is much greater (e.g. Recanzone et al., 1998) and depends on both the stimulus spectrum (Brown et al., 1980; Recanzone et al., 1998, 2000b) as well as the stimulus intensity (Altshuler and Comalli, 1975; Comalli and Altshuler, 1976; Recanzone and Beckerman, 2004; Su and Recanzone, 2001; Sabin et al., 2005; Miller and Recanzone, 2009). Thus, it is not surprising that the visual stimulus dominates the percept of the spatial location of the auditory stimulus. A second notion is that the stimulus integration follows Bayesian probabilities (Alais and Burr, 2004; Burr and Alais, 2006; Sato et al., 2007). This is conceptually similar but has the strength that more rigorous computational methods can be applied and directly tested.
The ventriloquism effect has been described in a large number of contexts and generally supports both the modality specificity and Bayesian probability hypotheses. If these conceptual ideas are true, then it should be the case that auditory stimuli would capture visual stimuli if the visual stimulus was less salient. For example, by making the spatial location of the visual stimulus less reliable (more blurred) than the auditory stimulus, the auditory percept could dominate. This is a technically challenging experiment for normal human subjects, but the available evidence suggests that this could be the case (e.g. see Burr and Alais, 2006). An alternative approach is to test individuals who show greater auditory spatial acuity compared to their visual acuity. This has been tested directly in a brain damaged patient that suffered from Balint’s syndrome (Phan et al., 2000). This subject, RM, had bilateral parietal lobe lesions as the result of two strokes separated by many months. Although he had 20/20 vision, he was functionally blind due to an inability to see more than one object at a time. This subject was tested in a simple “same/different” task, where two stimuli were presented one second apart and the subject was asked whether they were both from the same or different locations. The results from RM are shown in Figure 2A for visual stimuli (solid circles) and compared to age and gender matched (geriatric males) controls (open circles). It is clear that this subject had both qualitatively and quantitatively different spatial acuity. The age-matched controls never missed a trial on this task, whereas RM showed clear deficits on the left side, and severe deficits on the right side. In contrast, his localization accuracy was much better for auditory stimuli (Figure 2B). In this case, there remained a deficit relative to age-matched controls, but the ability to localize auditory stimuli on the right was better than his ability to localize visual stimuli over the same region of space.
We therefore considered whether presenting an auditory stimulus at the same location as the visual stimulus would improve his ability to localize visual stimuli. When asked to localize the visual stimuli and ignore the auditory stimuli, he showed a clear improvement on the right side (Figure 3A) in contrast to the age-matched controls, who showed no differences. In contrast, when subjects localized auditory stimuli and ignored the visual stimuli, there was a limited improvement in the age-matched controls and essentially no difference in RM (Figure 3B). Thus, subject RM showed a reverse pattern of capture as the auditory stimuli dominated the percept of the visual location and the visual stimuli had essentially no influence on auditory spatial perception.
One final finding from this study that was quite interesting was that there was a difference between the age-matched (56–64 years, mean 59.8) and the younger (25–28 years, mean 26.3) controls. The data from the younger controls is shown in Figure 3C. As with the older controls, there was essentially no influence of auditory stimuli on their visual localization. However, there was a much greater visual capture of their auditory spatial acuity compared to the older controls, indicating that there may be age-related effects of this spatial integration. Previous studies have indicated that natural aging results in worse sound localization processing (Abel et al., 2000), however, in this study there was no statistically significant difference between the younger and older control subjects. There was a significant difference in the size of the visual capture, however, suggesting that this may be one of the early signs of degradation of spatial localization as a function of normal aging.
One interesting feature of the ventriloquism effect is that it can be long lasting (Radeau and Bertelson, 1977, 1978; Recanzone, 1998; LeWald, 2002). This can be achieved in a relatively short time and has been demonstrated in a number of paradigms that create a mismatch between auditory and visual stimuli. An example of this phenomenon is shown in Figure 4 using relatively simple stimuli in normal human subjects (Recanzone, 1998). In this case, the subject sits in the dark and is asked to point their head toward the location of a 200 msec duration 750 Hz or 1500 Hz tone. The long ovals show the final head location for five representative locations tested in a naïve subject. This individual’s localization ability is a good example for all studied subjects in that the accuracy was best for midline locations and that the estimates tended to undershoot the most peripheral locations. Immediately following this procedure, the subjects performed a ‘training’ task. They were asked to remain in the dark and listen to a series of tones that are accompanied by the illumination of a simultaneously presented visual stimulus (a red LED). The subjects were instructed to ignore the visual stimuli and to press a button when the auditory stimulus was presented at a quieter intensity level. The auditory (and visual) stimulus varied in location across the region of space tested earlier. After approximately 20 minutes, the subjects then repeated the head pointing task using only the auditory stimuli as before. The manipulations in this experiment are the frequency of the auditory stimulus and the spatial disparity between the visual and auditory stimuli during the training period. If one uses a relatively small spatial disparity and the same stimulus frequency during the training period there is a shift in the spatial representation of auditory stimuli in those subjects. The small ovals beneath the larger ones in Fig. 4A shows this result for a spatial disparity where the visual stimulus was presented 8 degrees to the right of the auditory stimulus during the training period. After approximately 20 minutes of such exposure, there is a clear rightward shift in the location estimates in complete darkness. This occurs across locations and the undershoot of the estimates was no longer evident for the rightward locations in the post-training session. These data are summarized in Figure 4B, where the post +8 degree training is shown across all locations and across three subjects. The regression line has a slope of nearly one and the y-intercept is nearly +8, consistent with the spatial disparity during the training period. It is important to keep in mind that it takes approximately 20 minutes to derive the post-training localization functions and there was no evidence that the effect was fading over this time course. This long-lasting illusion is quite different from other after-effect illusions, such as the waterfall illusion in the visual system (see Wade and Ziefel, 2008). The waterfall illusion is where adaptation of visual motion in one direction leads to the percept of motion in the opposite direction once the visual stimulus is stationary. The ventriloquism aftereffect is different in that it is in the same, not the opposite, direction as the inducting stimulus, indicating that this is not the result of adaptation. A second difference is that it can last for at least tens of minutes as opposed to much shorter for the waterfall and similar illusions. This can be interpreted to mean that this disparity results in a shift in the representation of acoustic space. Interestingly, this aftereffect does not transfer across frequencies (Recanzone, 1998; LeWald, 2003). This is shown across three subjects in Figure 4C, where the same +8 degree disparity training was used for the same amount of time, but the localization and training frequencies differed by an octave. In this case, the regression line was not statistically significantly different from 1.0 nor was the y-intercept different from 0. However, other laboratories have shown that there is a transference across frequency (Frissen et al., 2005). In this case, using a larger number of subjects and a larger spatial disparity, training at 400 Hz or 6400 Hz resulted in transference to these and intermediate frequencies. Differences between the two studies include the size of the disparity (8 vs. 18 deg) and the size of the aftereffect (approximately 8 and 3 degrees for Recanzone 1998 and Frissen et al., 2005, respectively). It may be that the smaller aftereffects generated by larger disparities lead to a less salient and reliable auditory percept, which in turn could evoke quite different mechanisms, or alter the balance of different mechanisms (e.g. bottom-up vs. top-down) resulting in these different results. Future experiments will be necessary to resolve this apparent conflict.
The investigation of the neural mechanisms of this perceptual change has been challenging. Functional imaging techniques have been useful in defining different multisensory areas (e.g. see Kayser and Petkov, this volume), however they do not have the combined spatial and temporal resolution to probe the basic neuronal mechanisms underlying this class of illusions. For example, it can be shown that auditory cortical fields have more activity in one hemisphere compared to the other when the illusory percept is that the sound is in the contralateral hemifield (Bonath et al., 2007), but determining how activity is altered within which of several potential cortical fields is beyond the spatial resolution of ERP techniques. Similarly, since acoustic space represented as a population code across auditory cortical areas, particularly caudal belt fields (Miller and Recanzone, 2009), fMRI and MEG will likely be unable to reveal subtle changes in single neuron firing rates across the extent of an entire cortical area. In order to probe the underlying neural mechanisms of these illusions at the single neuron level, a good animal model where invasive studies can be conducted needs to be identified. One problem with the development of such an animal model is how to reward the animal for making ‘correct’ responses when the investigator predicts that it will experience the illusion. For example, one could have an animal make an eye movement to the perceived location similar to the paradigm described above, and then test the responses after the training period. The prediction would be that the eye movements would be biased toward one side of the actual stimulus location. The issue, of course, is whether to reward the animal when it makes an error that is predicted by the illusion, or to only reward technically correct trials. If the animal does not experience the illusion and one does the former, the animal could learn within the first few trials to look slightly to one side or the other in order to obtain the reward, making a convincing case that the illusion was produced when it in fact had not. Alternatively, the animal would continuously be unrewarded when accurately reporting it’s percept and would likely stop performing the task. A similar problem exists if the experimenter rewards only technically correct responses and the animal does experience the illusion. In either case, one cannot be assured that the animal actually experiences the illusion or not. One way around this problem is to change the paradigm such that there are multiple technically correct responses that are not influenced by the illusion, and have the illusory trials as probe trials. This has been done in the macaque monkey (Woods and Recanzone, 2004a,b), and the results are consistent with those seen in human subjects. In this case, the monkey fixated a central location and a single stimulus was presented in frontal space. The fixation point was extinguished and two target lights appeared on either side of the midline. The animal was required to make an eye movement to the target on the same side of the midline that it had perceived the auditory stimulus to originate from. In this way, multiple locations could be presented at disparities greater than the training disparity (in this case 4 degrees) and the two locations nearest the midline could be rewarded regardless of the monkey’s choice. A typical example of this is shown in Figure 5A. The open symbols show the pre-training data and the closed symbols show the post-training data. The two points marked with arrows show those locations that were rewarded regardless of the response. There is a clear effect after training with the light to the left, resulting in the monkey more likely to make leftward responses (fewer rightward responses) and shifting the psychometric function to the right. The same lack of transference with a different training frequency was also noted. In this particular experiment, the post-training data shown in Figure 5B was collected immediately after the training period at a different frequency, and there was no shift in the function. The aftereffect was still apparent over 20 minutes after the training period was over (Figure 5A). Figure 5C shows the pooled data across several sessions and again it is clear that there is little if any transference across one octave frequency difference. These data indicate that macaques experience the same aftereffect illusion as humans, and opens the door for using the macaque monkey as an animal model to probe where along the cortical hierarchy the neurons go from responding to the physical stimulus to being responsive to the actual percept during the illusion trials.
Given that there is no transference across frequency of the aftereffect, the underlying neuronal correlates are likely not to be influenced by both frequencies. The primate auditory cortex is organized in a core, belt, parabelt fashion (Rauschecker and Tian, 2001;Kaas and Hackett, 2001) with the core composed of the primary auditory cortex, the rostral (R) and the rostrotemporal fields (RT). These areas in turn project to belt cortical areas which then project to the parabelt. The core neurons have the most spectrally tuned neurons, with much more broadly tuned neurons in the belt and parabelt (Recanzone et al., 2000a). During the past decade there has been an increase in reports of visual modulation of auditory cortical neurons, yet it remains unclear how these stimuli would interact in a normal environment (e.g. see Kayser et al., 2009; see also Petkov et al., this volume). Recent studies in the alert macaque have indicated that, while neuronal spectral tuning broadens from the core to the belt (e.g. Recanzone et al., 2000a), spatial tuning improves in the belt, particularly the caudomedial (CM) and caudolateral (CL) areas (Recanzone et al., 2000b; Woods et al., 2006). Figure 6 shows representative examples of the spectral tuning functions for two neurons in A1 (Fig. 6A) and two neurons from CM (Fig. 6B and 6C) in a monkey performing a simple behavioral task. These examples were chosen as they are responsive to one or both of the frequencies used in the ventriloquism aftereffect paradigms described above (750 and 1500 Hz). For the A1 neurons, the spectral tuning is sharp enough that it would be rare for the same neuron to respond to both frequencies. This is expected for a region that underlies the aftereffect and the lack of transference. For CM neurons this was much less likely to be the case. Although sharply tuned neurons are encountered that would correlate well with the psychophysics (e.g. Fig. 6B), more commonly the neurons are broadly spectrally tuned (Fig. 6C, see also Rauschecker et al., 1997; Recanzone et al., 2000a). This implies that CM neurons are not good candidates as major contributors to the aftereffect illusion.
This physiological analysis must be met with caution as the spatial tuning functions of neurons in A1 and the belt areas, particularly CM and CL, would favor the belt fields as being more likely to underlie the aftereffect illusion. Figure 7 shows the spatial tuning functions of a representative A1 (Fig. 7A) and CM (Fig. 7B) neuron taken from Recanzone et al. (2000b). In these plots, each circle represents the firing rate on a single trial. For both cells, but particularly the A1 neuron, there is considerable variability across trials such that large spatial separations (up to 60 degrees across frontal space) can still result in similar firing rates. The same could be said of the CM neuron, although there is a clear separation in firing rates between the greatest distances tested. If one considers the differences in firing rates expected for a 4 or 8 degree change in the perception, it is difficult to see how this could be accomplished at the level of the single neuron. If one compares the entire population of neurons, it is also clear that only a small subset of cells could potentially differentiate a shift of 8 degrees. Figures 7C and 7D show the frequency distribution of the slopes of the regression line through data plots such as those shown in Figs. 7A and 7B. The vast majority of neurons have slopes corresponding to a 2% change in firing rate per degree or less, which would translate to either an 8% or 16% difference for 4 and 8 degree separations, respectively. Given that the standard deviation of the response is approximately 50% of the firing rate (Recanzone et al., 2000b), separations of 8 degrees demand a slope on the order of 6% per degree, which is beyond the upper limit measured in CM neurons.
At least two alternatives remain that could account for the neuronal correlate of these illusory shifts in the representation of acoustic space. The first is that populations of neurons could accurately encode acoustic space, and this would not be seen at the single neuron level (Werner-Reiss and Groh, 2008). Population encoding of acoustic space has been shown as a function of stimulus intensity across 360 degrees in space using a maximum likelihood algorithm (Miller and Recanzone, 2009). This algorithm was as accurate as human observers when applied to populations of neurons in the caudolateral field (CL), but not for A1 neurons. Area CM neurons were very close in accuracy but were not as good as the observers. Thus, either the initial processing is done in A1 and this is then transmitted in a frequency-specific manner to the caudal belt fields, or the caudal belt fields are processing spatial information independently of this select pathway, which results in normal localization for untrained frequencies. If this is the case, then the processing must occur in other cortical areas. Likely candidates include the multisensory areas of the superior temporal sulcus, such as the rostral region of STP, which contains both auditory and visual responses (Barraclough et al., 2005; Poremba et al., 2003; Beauchamp, 2005). It remains to be seen if the spatial tuning of these neurons is amenable to study and if it is possible to measure shifts in the spatial receptive fields at the appropriate spatial scales. It could also be the case that populations of STP neurons are encoding the frequency-specific shift in the spatial representations, but again this possibility remains experimentally untested.
One prediction from both the modality specificity hypothesis and the Bayesian probability model is that auditory stimuli should dominate the percept of visual stimuli when the reliability of the auditory percepts and estimates is greater than that for the visual percepts. Such stimulus parameters in normal subjects are in the temporal domain, where the auditory system has a much greater acuity compared to the visual system. One example is the ‘flicker-fusion’ threshold, which is the rate of visual stimulation that goes from being individual flashes to continuous illumination. The auditory correlate of this is the ‘flutter-fusion’ threshold, where individual clicks or tone pips become perceived as a single buzz-like sound. Early studies showed that one could improve the flicker-fusion threshold by pairing the visual stimuli to auditory stimuli (Shipley et al., 1964). There are a host of other studies demonstrating a strong influence of auditory stimuli on the timing of visual events (e.g. Knox, 1945; Ogilvie, 1956a, b; Gebhard and Mowbray, 1959; Welch et al., 1986; Stein et al., 1996; Sekuler et al., 1997). Another class of such studies is termed the temporal ventriloquism effect (e.g. Vroomen et al., 2005; Freeman and Driver, 2008; Parise and Spence, 2008, Getzmann, 2007; see Spence, this volume).
A recent study investigating the auditory-visual interactions in the temporal domain concentrated on how auditory stimuli could dominate the percept of visual temporal rate (Recanzone, 2003). In this study, subjects were presented with a sequence of four tone bursts or light flashes over a span of one second followed by a one second pause, and then four stimuli were presented again and the subject was required to indicate whether the rate of presentation of the second sequence was slower or faster than the first. Typical results are shown in Figure 8. On interleaved sessions, the subjects were instructed to discriminate either the visual or auditory stimuli and ignore the other modality. On subsets of trials the attended modality was presented alone as shown in Figure 8A. Using a 4 Hz base rate, which is well within the acuity for both visual and auditory temporal discriminations, human subjects are much better at discriminating the timing of auditory sequences compared to visual sequences. For the auditory case, every subject was correct on every trial at the four most different temporal rates, and only the two nearest the 4 Hz base showed any decrease in performance. In contrast, every subject missed at least one trial for all temporal rates in the visual modality.
The goal of the study, however, was to determine how auditory stimuli could influence the percept of the visual flashes. To accomplish this, on randomly interleaved trials while the subjects were attending to the visual stimuli, the auditory stimulus was presented at the same time (Figure 8B). If there was no influence, then the discrimination performance should be the same as when no auditory stimulus was presented (thin gray line of Fig. 8B). If there was a complete ‘capture’ of the visual stimulus, the points should lie along the same function as on trials when the subjects were attending the auditory stimulus and it was presented in isolation (thick gray line). The data clearly support the latter, as these subjects had statistically indistinguishable performance when discriminating visual stimuli in the presence of simultaneous auditory stimuli as when they discriminated auditory stimuli alone (open circles).
A second trial type that was randomly interleaved was when the auditory stimulus varied at the different rates but the visual stimulus was presented at 4 Hz during both sequences (Fig. 8C). In spite of there being no change in the visual temporal rate, subjects did not perform at chance. Rather, the subjects performed as though they were discriminating the auditory stimulus. A related metric is when the visual stimulus did change, but the auditory stimulus was presented at 4 Hz in both sequences. In this case, the performance of the subjects was dramatically decreased, with several subjects responding at chance. This was true even for those rates that most subjects performed correctly most of the time (3.5 and 4.5 Hz), giving rise to much flatter performance functions. The results of these experiments indicate that the auditory stimulus has a profound effect on visual temporal rate perception. In post-session interviews, subjects reported that they never thought that the two stimuli were not presented at the same rates, indicating that for these slight changes the auditory influence was complete.
The opposite effect was not observed, however, when the subjects were attending the auditory stimuli. There was no discernable difference between the responses to auditory stimuli regardless of what visual stimulus was presented. This effect was very robust across subjects (the error bars are on the order of the symbol size) and it persisted even under conditions where the subjects were well aware that the auditory and visual stimuli were coming from different sources. For example, the effects were the same if both stimuli were presented from directly in front of the subjects as when the visual stimulus was presented at 8 degrees to the left and the auditory stimulus at 90 degrees to the right. This is well beyond any type of ventriloquism effect but the temporal effect persisted. Interestingly, if there was a small spatial separation between the visual and auditory stimulus and a change in the rate between the two stimuli, subjects perceived the auditory stimulus to be at the location of the visual stimulus, and the visual stimulus rate to be the same as the auditory rate. Thus, both illusions can occur simultaneously, implying that different neural circuits are at play between the temporal and spatial channels.
Several control experiments were also performed to rule out the possibility that the subjects were covertly attending to the auditory stimulus when they were instructed to attend to and discriminate the visual stimulus. A second set of subjects were tested but these subjects were never instructed to attend to the auditory stimulus. These subjects were also allowed several sessions in which there was feedback so that they could recognize that the auditory and visual information was sometimes non-congruent. They were further shown specifically that the auditory stimulus could be different than the visual stimulus by using the 3.5 and 4.5 Hz rates when the visual stimulus remained at 4 Hz and the reverse. Nonetheless, when tested in sessions without feedback immediately after these demonstrations, the data were indistinguishable from the first set of subjects. Thus, it seems unlikely that the subjects were attending to the auditory stimulus, yet their perception of the visual temporal rate was still strongly influenced by the auditory stimuli.
Given that this effect was so strong, it raised the question of what could reduce or eliminate the auditory influence on the visual percept. Two additional sessions were performed to address this question. The first was to present the auditory stimulus at a much slower temporal rate, in this case nominally 0.5 Hz as the stimulus remained on the entire second of the stimulus sequence. This manipulation showed that the simple presence of an auditory stimulus was not sufficient to alter the perception of visual temporal rate (Figure 9A). Similarly, if the auditory stimulus rate was much faster than the visual rate (in this case 8 Hz), again there was no influence (Fig. 9B). Thus, these effects cannot be related to any sort of arousal or generalized phenomenon, but are specific for auditory stimuli that are near in frequency to the visual stimuli.
A second series of experiments were then performed to serve two functions. The first was to ensure that the subjects were attending to the stimuli as asked and the second was to probe over how great a frequency range the auditory stimuli could influence the percept of the visual temporal rate. In this case, the subjects were asked to attend to both the auditory and visual stimuli and to determine whether they were presented at the same or different rates. Only three stimulus rates were used in which the auditory and visual stimuli were presented simultaneously, 3.0, 4.0 and 5.4 Hz. There were eleven other trials in which the auditory stimulus was presented at 4.0 Hz and the visual stimulus was presented within the range of 3.0 – 5.4 Hz (Figure 10A). The results of this experiment showed that the auditory stimulus could influence visual temporal rate perception over a considerable range. Figure 10A shows the results from the visual alone trials (Fig. 8A) as a grey line. When visual stimuli are presented alone, most subjects responded correctly when the visual stimulus was presented at 4.5 Hz. In contrast, when the auditory stimulus was simultaneously presented at 4.0 Hz, the visual temporal rate that corresponded to being correctly discriminated as different 50% of the time was around 5 Hz, or a 25% increase in rate. For the lower temporal rates the 50% threshold was closer to 3.5 Hz, or a 12% decrease in rate. This difference is likely due to the better temporal acuity for visual temporal rate perception at lower rates. When subjects were asked what they perceived after the session, they reported that they felt that the two stimuli were presented at the same rate on most of the trials, indicating that they generally perceived the visual rate to be the same as the auditory rate.
A final aspect of this temporal ventriloquism is that aftereffects are also induced. In this paradigm, subjects were asked to determine whether the stimuli were presented at the same or different rates, as shown in Fig. 10A. Then, similar to the ventriloquism aftereffect training described above, subjects were asked to fixate a central spot and attend to a series of four light flash sequences. They were further instructed to make a response when the visual stimuli appeared to change in luminance. They were told that auditory stimuli would also be presented, but to ignore those stimuli. After approximately 20 minutes the subjects were asked to perform the pre-training paradigm again. The results of these experiments are shown in Figure 10B. In this case, the training period consisted of the auditory stimulus being presented at a higher rate than the visual stimulus. When tested in the post-training session, there is a clear shift to the right of the psychometric function, which is consistent with the subjects perceiving that the visual stimulus was actually presented at a faster rate than it really was. As with the ventriloquism aftereffect, this was long lasting and showed no signs of decay over the period in which these data were generated. Similarly, the effect was not noted when the temporal disparity during the training period was zero. Thus, it is likely that there are similar, if not identical, neuronal mechanisms that underlie this temporal aftereffect as there are for the spatial aftereffect.
Auditory and visual stimuli are effectively integrated by both humans and non-human primates under most naturalistic conditions. Illusions can be produced by altering the naturally occurring relationships between the two stimuli, and are largely driven by the sensory modality that has the highest acuity, or largest signal to noise ratio. Thus, the visual stimulus drives spatial percepts and the auditory system drives temporal percepts. These illusions can be used to probe potential underlying neuronal mechanisms, but to directly relate these illusions and percepts has proven to be quite difficult, due largely to technical reasons. Functional imaging techniques have yet to achieve the appropriate degree of spatial and temporal resolution to address these issues at the level of detail that is likely to be necessary to fully understand these mechanisms. Similarly, animal models are only now just being developed, and it may be that single neuron studies will also have technical shortcomings. In this case, it may be that the influences on single neurons will be slight and not statistically significant given the constraints of recording in alert and behaving animals, and that populations of neurons more appropriately underlie the encoding of these multisensory objects and events.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.