|Home | About | Journals | Submit | Contact Us | Français|
The brain should integrate related but not unrelated information from different senses. Temporal patterning of inputs to different modalities may provide critical information about whether those inputs are related or not. We studied effects of temporal correspondence between auditory and visual streams on human brain activity with fMRI. Streams of visual flashes with irregularly jittered timing (mean rate 4Hz) could appear on the right or left, with or without a stream of auditory tones that either coincided perfectly when present (highly unlikely by chance); or were non-coincident with vision (different erratic pattern with same mean rate); or an auditory stream appeared alone. fMRI revealed BOLD-increases in multisensory superior temporal sulcus (mSTS), contralateral to a visual stream when coincident with an auditory stream, and BOLD-decreases for non-coincidence relative to unisensory baselines. Contralateral primary visual cortex and auditory cortex were also activated by audio-visual temporal correspondence, as confirmed in individuals. Connectivity analyses indicated enhanced influence from mSTS upon primary sensory areas, rather than vice-versa, during audio-visual correspondence. Temporal correspondence between auditory and visual streams affects a network of both multisensory (mSTS) and sensory-specific areas, including even primary visual and auditory cortex, with stronger responses for corresponding and thus related audio-visual inputs.
Among the many signals entering our senses, some inputs to one sense (e.g. audition) may relate temporally and/or spatially to inputs entering another sense (e.g. vision) when they originate from the same object. Ideally the brain should integrate just those multisensory inputs that reflect a common external source, as may be indicated by spatial, temporal or semantic constraints (Stein and Meredith, 1993; Calvert et al., 2004; Spence and Driver, 2004; Macaluso and Driver, 2005; Schroeder and Foxe, 2005). Many neuroscience and human neuroimaging studies have investigated possible spatial constraints on multisensory integration (e.g. Wallace et al., 1996; Macaluso et al., 2000; McDonald et al., 2000; McDonald et al., 2003; Macaluso et al., 2004); or factors that may be more ‘semantic’ (e.g. for integration of matching speech-sounds and lip-movements (Calvert et al., 1997), or of visual objects with matching environmental sounds (Beauchamp et al., 2004a, 2004b; Beauchamp, 2005a).
Here we focus on possible constraints from temporal correspondence only (see Stein et al., 1993; Calvert et al., 2001; Bischoff et al., 2007; Dhamala et al., 2007). We used streams of non-semantic stimuli (visual transients and beeps) to isolate purely temporal influences. We arranged that audio-visual temporal relations should convey strong information that auditory and visual streams were related, or unrelated, by using erratic rapid temporal patterns which either matched perfectly between audition and vision (very unlikely by chance) or mismatched substantially, but with the same average rate. We anticipated increased brain activations for temporally-coincident audio-visual streams (as compared with non-coincident or unisensory streams, see Methods) in multisensory superior temporal sulcus (mSTS). This region is known to receive converging auditory and visual inputs (Kaas and Collins, 2004). Moreover, mSTS is thought to play some role(s) in multisensory integration (Benevento et al., 1977; Bruce et al., 1981; Cusick, 1997; Beauchamp et al., 2004b), and was influenced by audio-visual synchrony in some prior fMRI studies that used very different designs than here and/or more semantic stimuli (e.g. Calvert et al., 2001; Atteveldt et al., 2006; Bischoff et al., 2007; Dhamala et al., 2007).
There have been several recent proposals that multisensory interactions may affect not only established multisensory brain regions (such as mSTS), but also brain areas (or evoked responses) traditionally considered sensory-specific (e.g. Brosch and Scheich, 2005; Foxe and Schroeder, 2005 for reviews), though some past ERP-examples had proved somewhat controversial (see Teder-Sälejärvi et al., 2002). Here we sought to test directly with fMRI whether audio-visual correspondence in purely temporal patterning might affect (contralateral) sensory-specific visual and auditory cortices. This may be expected if temporal coincidence can render visual and auditory stimuli more salient or perceptually intense, as suggested by some psychophysical work (Stein et al., 1996; Frassinetti et al., 2002; Lovelace et al., 2003). Here we provide an unequivocal demonstration that audio-visual correspondence in temporal patterning can indeed affect even primary visual and auditory cortex (V1 and A1), as well as contralateral mSTS.
Twenty four neurologically normal subjects (10 female, mean age 24) participated after written informed consent in accord with local ethics. Visual stimulation was in the upper left hemifield for 12 subjects, in the upper right for the other 12. This was presented at the top of the MR-bore via clusters of 4 optic fibres arranged into a rectangular shape, and 5 interleaved fibres arranged into a cross shape, 20 above the horizontal meridian at an eccentricity of 180. Visual stimuli were presented peripherally, which may maximise the opportunity for interplay between auditory and visual cortex (see Falchier et al., 2002), and also allowed us to test for any contralaterality in effects for one visual field or the other. The peripheral fibre-optic endings could be illuminated red or green with a standard luminance of 40 cd/m2 and were 1.50 in diameter (see Fig 1c for schematics of the resulting colored ‘shapes’). Streams of visual transients were produced by switching between the differently colored cross and rectangle shapes (red and green respectively in Figure 1c, but shape-color was counterbalanced across subjects). Throughout each experimental run, subjects fixated a central fixation cross of ~0.20 in diameter. Eight red-green (cross/square) reversals occurred in a 2 second interval, with the SOA between each successive color-change ranging in a pseudorandom fashion from 100 to 500 ms (mean reversal rate of 4 Hz, with rectangular distribution from 2 to 10 Hz, but note that reversal rate was never constant for successive transients), to produce a uniquely jittered timing for each 2 sec segment. Auditory stimuli were presented via a piezo-electric speaker inside the scanner, just above fixation. Each auditory stimulus was a clearly audible 1 kHz sound-burst with duration of 10 ms at ~70 dB. Identical temporally-jittered stimulation sequences within vision and/or audition were used in all conditions overall (fully counterbalanced), so that there was no difference whatsoever in temporal statistics between conditions, except for the critical temporal relation between auditory and visual streams during multisensory trials (unisensory conditions were also included, see below).
The experimental stimuli (for the visual-only baseline, auditory-only baseline, and for audio-visual temporal correspondence (AVC) or non-correspondence (NC)) were all presented during silent periods (2 s) interleaved with scanning (3 s periods of fMRI acquisition) to prevent scanner-noise interfering with our auditory stimuli or perception of their temporal relation with visual flashes. In the AVC condition, a tone burst was initiated synchronously with every visual transient (see Fig 1a) and thus had exactly the same pseudorandom temporal pattern. During the NC condition (Fig 1b), tone bursts occurred with a different pseudo-random temporal pattern (but always having the same overall temporal statistics, including mean rate of 4 Hz within a rectangular distribution from 2 to 10 Hz), with a minimal protective ‘window’ of 100 ms now separating each sound from onset of a visual pattern-reversal (Fig 1b).
This provided clear information that the two streams were either strongly related, as in the AVC condition (such perfect coincidence for the erratic temporal patterns is exceptionally unlikely to arise by chance); or were unrelated, as for the NC condition. During the latter non-coincidence, up to two events in one stream could occur before an event in the second stream had to occur. The mean 4 Hz stimulation rate used here, together with the constraints (protective window, see Fig 1b) implemented to avoid any accidental synchronies in the non-corresponding condition, should optimise detection of audio-visual correspondence versus non-correspondence (see Fujisaki et al., 2006), while making these bimodal conditions otherwise identical in terms of the temporal patterns presented overall to each modality. All sequences were created individually for each subject using Matlab 6.5. Piloting confirmed that the correspondence versus non-correspondence relation could be discriminated readily when requested (mean percent correct 93.8%), even with such peripheral visual stimuli. Irregular stimulus trains were chosen, as this makes an audio-visual temporal relation much less likely to arise by chance alone, and hence (a)sychrony typically becomes easier to detect than for regular frequencies, or for single auditory and visual events rather than stimulus trains (see also Slutsky and Recanzone, 2001; Noesselt et al., 2005).
Two unisensory conditions (i.e. visual or auditory streams alone) were also run. These allowed our fMRI analysis to distinguish candidate multisensory brain regions (responding to either type of unisensory stream) from sensory-specific regions (visually- or auditorily-selective); see below.
Throughout each experimental run, participants performed a central visual monitoring task requiring detection of occasional brief (1 ms) brightening of the fixation point via button press. This could occur at random times (average rate 0.1 Hz) during both stimulation and scan periods. Participants were instructed to perform this fixation-monitoring task, and that auditory and peripheral visual stimuli were task-irrelevant. We chose this fixation-monitoring task to avoid the different multisensory conditions being associated with changes in performance that might otherwise have contaminated the fMRI data; because we were interested in stimulus-determined (rather than task-determined) effects of audio-visual temporal correspondence; and so as to minimize eye movements. Eye-position was monitored online during scanning (Kanowski et al., 2007).
fMRI data were collected in 4 runs with a neuro-optimized 1.5 GE scanner equipped with a head-spine-coil. A rapid sparse-sampling protocol was used (136 volumes per run with 30 slices covering whole brain; TR of 3s; Silent Pause of 2s; TE of 40 ms; Flip angle of 90; resolution of 3.5×3.5 mm; 4 mm slice thickness; FOV was 20 cm). Experimental stimuli were presented during the silent scanner periods (2 s scanner pauses). Each mini-block lasted 20 s per condition, containing 8 s (4 × 2) of stimulation (with each successive 2 s segment of stimuli then separated by 3 s of scanning). These mini-blocks of experimental stimulation in one of the four conditions or another (random sequence) were each separated by 20 s blocks, in which only the central fixation task was presented (unstimulated blocks).
After pre-processing for motion correction, normalisation, and 6mm smoothing, data were analysed in SPM2 by modelling the 4 conditions and the intervening unstimulated baselines with box-car functions. Voxel-based group-effects were assessed with a second-level random-effects analysis, identifying candidate multisensory regions (responding to both auditory and visual stimulation); sensory-specific regions (difference between visual minus auditory, or vice-versa); and the critical differential effects of coincident minus non-coincident audio-visual presentations.
Conjunction analyses assessed activation within sensory-specific and multisensory cortex (thresholded at p<0.001), within areas that also showed a significant modulation of the omnibus F-test at p<0.001 (see Beauchamp et al., 2005b) for clusters of more than 20 contiguous voxels. To confirm localization to a particular anatomical region (e.g. calcarine sulcus) in individuals, we extracted beta-estimates of BOLD-modulation for each condition, from their local maxima for the comparison AVC>NC, within regions-of-interest (ROIs) comprising early visual and auditory cortex, and within STS. These ROIs were initially identified via a combination of anatomical criteria (calcarine sulcus; medial part of anteriormost Heschl’s gyrus; posterior STS) and functional criteria in each individual (i.e. sensory-specific responses to our visual or auditory stimuli, for calcarine sulcus or Heschl’s gyrus respectively; or multisensory response to both modalities in the case of mSTS). We then tested the voxels within these individually-defined ROIs for any impact of the critical manipulation (which was orthogonal to the contrasts identifying those ROIs) of audio-visual correspondence minus non-correspondence. We also compared each of those two multisensory conditions to the unimodal baselines for the same regions on the extracted data.
Finally, we used connectivity analyses to assess possible influences (or ‘functional coupling’) between mSTS, V1 and A1, for the fMRI data. We first used the established ‘psychophysiological interaction’ (Friston et al., 1997) or PPI approach, which is relatively assumption-free. This assesses condition-specific covariation between a seeded brain area and any other regions, for the residual variance that remains after mean BOLD-effects due to condition have been discounted. Data from the LVF-group were left-right flipped to allow pooling with the RVF group for this, and to assess any effects that generalised across hemispheres (Lipschutz et al., 2002). PPI analyses can serve to establish condition-dependent functional-coupling (or ‘effective connectivity’) between brain regions, but do not provide information about the predominant direction of influence of information transfer. Accordingly, we further assessed potential influences between mSTS, V1 and A1 with a directed information transfer (DIT) measure, as recently developed (Hinrichs et al., 2006). DIT assesses predictability of one time-series from another, in a data-driven approach that makes minimal assumptions. If the joint time-series for, say, regions A and B predict future signals in time-series B, better than B does alone, this is taken to indicate that A influences B with a strength indicated by the corresponding DIT measure. If DIT from A to B is larger than vice-versa, this indicates directed information flow from A to B. Our DIT analysis used 96 time points (4 runs of 4 blocks with 6 points per block) per condition and region. From these data we derived the DIT values from the current samples of A and B to the subsequent sample of B, and vice versa, then averaged over all 96 samples. Here we used the DIT approach to assess possible pairwise relations between mSTS, V1 and A1 for their extracted time-series, assessing DIT-measures for all pairings between these (i.e. V1-A1, or V1-STS, or A1-STS) with paired t-tests.
Subjects performed the monitoring task on the fixation point (see Fig 1, plus Methods) equally well (mean 83% accuracy) in all conditions (all p > .2), with maintenance of central fixation also equally good across conditions (i.e. similar performance for all conditions (less than 2° deviation in 98% of trials)), as expected given the task at central fixation.
For fMRI analyses, the random-effect SPM analysis confirmed that unisensory visual streams activated sensory-specific occipital visual cortex; while auditory streams activated auditory core, belt and parabelt regions in temporal cortex (see Table 1). Candidate multisensory regions, activated by both the unisensory visual and unisensory auditory streams, included bilateral posterior STS, posterior parietal and dorsolateral prefrontal areas. However, within these candidate multisensory regions only STS showed the critical effects of audio-visual temporal correspondence (see Tab 2a and Figure 2a). Within the functionally-defined multisensory regions, audio-visual temporal correspondence (AVC) minus non-correspondence (NC) specifically activated (at p<.001) the contralateral mSTS (i.e. right mSTS for LVF group, peak at 60, -48, 12; left mSTS for RVF group, peak at -54, -52, 8; see Fig. 2a). Further tests on individually-defined maxima within mSTS confirmed that, contralateral to the visual stream, responses to AVC were significantly elevated not only relative to the NC condition, but also relative to either unisensory stream alone (p<0.03). Non-coincidence led instead to a reliably decreased response relative to either unisensory baseline (p<0.01; see bar graph for mSTS in Fig 2a). All individual subjects showed this pattern (see Fig 3a for illustrative single subject; all others available on request).
Importantly, an analogous pattern was found within sensory-specific cortices. For visual cortex, we found increased BOLD-responses for the AVC>NC comparison near the contralateral calcarine fissure (peaks at -12, -76, 0 and 12, -82, 12 for RVF and LVF groups respectively; both p<0.001, Fig. 2b and Tab 2b). Again, this effect was found for each individual subject, in the anterior lower lip of their calcarine fissure (see Figure 3b for illustrative single subject, all others available on request) representing the contralateral peripheral upper visual quadrant, where the visual stimuli appeared.
Finally, enhanced BOLD-response for AVC>NC stimulation was found also within sensory-specific auditory cortex, in the vicinity of Heschl’s gyrus, also peaking contralateral to the coincident visual hemifield (peaks at -48, -20, 10 and 50, -16, 8 for RVF and LVF group, both p’s<0.001; see Fig 2c and Tab 2c), albeit with some bilateral activations also found, yet a systematically contralateral peak. We found this pattern in 23 of the 24 individual subjects, within the medial part of anteriormost Heschl’s gyrus (typically considered as primary auditory cortex), often extending into posterior insula and planum temporale (see Fig 3c for illustrative single subject).
Mean parameter estimates (SPM ‘betas’, proportional to percent signal change) from individual peaks in contralateral calcarine sulcus and contralateral Heschl’s gyrus are plotted in Figures 2b and 2c (bar graphs) respectively. In addition to the clear AVC>NC effect, AVC also elicited a higher BOLD-signal than the relevant unisensory baseline (i.e. vision for calcarine sulcus, auditory for Heschl’s gyrus, each at p<0.008 or better); while the NC condition was significantly lower than those unisensory baselines (p<.007 or better).
Although our main focus was on comparing audio-visual correspondence minus non-correspondence (AVC>NC), the plots in Fig 2 show that AVC also elicited higher activity than either unisensory baseline in mSTS; while NC was lower than both these baselines there. This might reflect corresponding auditory and visual events becoming ‘allies’ in neural representation (due to their correspondence), while non-corresponding instead become ‘competitors’, leading to the apparent suppression observed for them in mSTS. A similar account might hold for the A1 and V1 results, where the most relevant unisensory baseline (audition or vision, respectively) was again significantly below the AVC condition, yet significantly above the NC. Alternatively one might argue that the level of activity for NC in A1 or V1 may correspond to the mean of the separate auditory and visual baselines for that particular area (NC did not differ from that mean for V1 and A1; though it did for STS). But this would still imply that combining non-corresponding sounds and lights can reduce activity in primary sensory cortices, relative to the preferred modality alone, even though temporally corresponding audiovisual stimulation boosts activity in both V1 and A1.
A potentially related finding was recently reported for audiotactile pairings in a study (Lakatos et al., 2007) that measured current-source-density distributions (CSD) in macaque primary auditory cortex. Responses to combined audiotactile stimuli, that differed from summed unisensory tactile and auditory responses, were found to indicate a modulatory influence of tactile stimuli upon auditory. The stimulus onset asynchronies producing either response enhancement or suppression for multisensory stimulation hinted at a phase-resetting mechanism in neural oscillations (see also Discussion).
In general, several different contrasts and analysis-approaches have been introduced in prior multisensory research. Although the present study focuses on fMRI measures, Stein and colleagues conducted many influential single-cell studies on the superior colliculus and other structures (Stein, 1978; Stein and Meredith, 1990; Stein et al., 1993; Wallace et al., 1993; Wallace and Stein, 1994, 1996; Wallace et al., 1996; Wallace and Stein, 1997). They suggested that, depending on the relative timing and/or location of multisensory inputs, neural responses can sometimes exceed (or fall below) the sum of the responses for each unisensory input (see also Lakatos et al., 2007).
Non-linear analysis criteria have also been applied to EEG data in some multisensory studies, that typically manipulated the presence/absence of co-stimulation in a second modality (e.g. Giard and Peronnet, 1999; Foxe et al., 2000; Fort et al., 2002; Molholm et al., 2002; Molholm et al., 2004; Murray et al., 2005), rather than a detailed relation in temporal patterning as here. Similar non-additive criteria have even been applied to fMRI data (e.g. Calvert et al., 2001). On the other hand, such criteria have been criticized for some (ERP) situations (e.g. see Teder-Sälejärvi et al., 2002). Moreover, Stein and colleagues subsequently reported that some of the cellular phenomena that originally inspired such criteria may in fact more often reflect linear rather than nonlinear phenomena (Stein et al., 2004), when considered at the population level.
Such considerations have led to proposals of revised criteria for fMRI studies of multisensory integration, including suggestions that a neural response significantly different from the maximal unisensory response may be taken to signify a multisensory effect (Beauchamp, 2005b). But most importantly, we note that the critical fMRI results we report here cannot merely reflect summing (or averaging) of two entirely separate BOLD responses to incoming auditory and visual events, as otherwise the outcome should have been comparable for corresponding and non-corresponding conditions. Recall that the auditory and visual stimuli themselves were equivalent and fully counterbalanced across our AVC and NC conditions; only their temporal relationship varied. Hence our critical effects must reflect multisensory effects that depend on the temporal correspondence of incoming auditory and visual temporal patterns.
Given the activation results, we seeded our PPI-connectivity-analysis at mSTS (see blue region in Fig 4d) in a spherical region (diameter 4 mm) surrounding the maximum found in the main analyses for each individual (see Fig. 4a for coordinates of the group average). Remarkably, this PPI analysis revealed that functional coupling of seeded mSTS, contralateral to the crossmodal coincidence, was specifically enhanced (showed stronger covariation) with early visual cortex (mean peak coordinates +/-4, -82, 6; p<0.008; see Fig 4e) and auditory cortex (+/-44, 22, 6; p<0.02; see Fig 4f) ipsilaterally to the mSTS-seed, in the context of audio-visual coincidence (versus non-coincidence). This modulation is not redundant with the overall BOLD activations reported above, since it reflects condition-dependent covariaton between brain regions, after mean activations by condition for each region have been discounted (see Methods; see also Friston et al., 1997). Nevertheless, these connectivity results closely resembled the activation pattern, in terms of the brain regions implicated (compare Fig 4a-c with Figs 4d-f; see also Fig 2), providing further evidence to highlight an mSTS-A1-V1 interconnected network, for the present effects of audio-visual temporal correspondence. Although several studies now implicate a role for multisensory thalamic nuclei in cortical response profiles (Baier et al., 2006, Lakatos et al., 2007), we did not observe any BOLD effects in the thalamus with the human fMRI method used here, only cortically.
The highly specific pattern of condition-dependent functional coupling with mSTS was found in visual cortex for all 24 individual subjects and in auditory cortex for 23 of 24 subjects (see Fig 5 for a representative subject, plus supplementary material for every single individual).
Since PPI analyses are non-directional in nature (see Methods), we further assessed possible influences between mSTS, V1 and A1 for the BOLD data, using a directed information transfer (DIT) measure (Hinrichs et al., 2006). Data from A1 and V1 were derived from the subject-specific maxima for overlap between the basic activation analysis and the PPI analysis (mean coordinates: V1: +/-8.9, -78.4, 6.9; A1: +/-47.6, 19.7, 7.1). Inferred information-flow from mSTS towards V1 and towards A1 was significantly higher than the opposite direction during audiovisual temporal coincidence (p<.05 in both cases, see Fig. 6), relative to temporal non-coincidence. No reliable condition-specific differences were found for any direct A1-V1 influences.
Thus, visual and auditory cortices not only showed activation by audio-visual temporal correspondence in the present fMRI data; over and above this, they also showed some functional coupling with mSTS, as shown when seeding the PPI analysis there revealed condition-specific effective connectivity with A1 and V1. Moreover, DIT analysis suggested a significantly increased influence from mSTS upon A1 and V1 specifically during audio-visual temporal correspondence, rather than direct A1-V1 influences, for these fMRI data.
We found that audio-visual temporal correspondence (AVC) can affect not only brain regions traditionally considered to be multisensory, as for contralateral mSTS; but also sensory-specific visual and auditory cortex, including even primary cortices. This impact of AVC on multisensory and sensory-specific regions was systematically contralateral to the peripheral stimuli, ruling out non-specific explanations such as higher arousal in one condition than another. Contralateral preferences for STS accord with some animal single-cell work (e.g. Barraclough et al., 2005).
A role for STS in audio-visual integration, as suggested in a highly specific manner by the present data, would accord more generally with single-cell studies (Benevento et al., 1977; Bruce et al., 1981; Barraclough et al., 2005), lesion data (Petrides and Iversen, 1978) and other recent human neuroimaging work (Miller and D’Esposito, 2005; Atteveldt et al., 2006; Watkins et al., 2006) that typically used more complex or semantic stimulus materials than here. But to our knowledge no previous human study has observed the systematic contralaterality found here; nor the clear effects upon primary visual and auditory cortex in addition to mSTS, due solely to temporal correspondence between simple flashes and beeps; nor the informative pattern of functional coupling that we observed.
Calvert et al. (Calvert et al., 2001) were among the first to implicate human STS in audio-visual integration via neuroimaging, when using analysis criteria derived from classic electrophysiological work. Several human fMRI studies have now used other criteria to relate STS to audio-visual integration, for semantically related objects and sounds (e.g. Beauchamp et al., 2004a; Miller and D’Esposito, 2005; Atteveldt et al., 2006). But here we manipulated only temporal correspondence between meaningless flashes and beeps, while ensuring all other temporal factors were held constant (unlike studies that compared, say, rhythmic to arhythmic stimuli). Atteveldt et al. (2006) varied temporal alignment plus semantic congruency between visual letter symbols and auditory phonemes, reporting effects in anterior STS with audio-visual temporal offsets of several hundred msec. However, their paradigm did not assess crossmodal relations in rapid temporal-patterning (i.e. their letters did not correspond in temporal structure to their speech sounds). Moreover, their study could not reveal the systematic contralaterality observed here, nor functional coupling between areas. Several prior imaging studies (Bushara et al., 2001; Bischoff et al., 2007; Dhamala et al., 2007) used tasks that explicitly required subjects to judge temporal audiovisual (a)synchrony, versus some other task, but may thereby have activated task-related networks, rather than highlighting stimulus-driven modulation as here. While our results converge with a wide literature in implicating STS in audio-visual interactions, they go well beyond this in showing specifically contralateral activations; determined solely by audio-visual temporal correspondence for simple non-semantic stimuli; while identifying inter-regional functional coupling also.
Several previous single-cell studies have considered the temporal ‘window of integration’ for a range of brain areas (e.g. Meredith et al., 1987; Avillac et al., 2005). Here, the average stimulus rate was always 4 Hz (rectangular distribution of 2-10 Hz), and the minimal protective temporal window separating auditory and visual events when non-corresponding was 100 ms. Such temporal constraints were evidently sufficient to modulate mSTS (plus A1 and V1, see below) in a highly systematic manner (see also Lakatos et al., 2007, for potentially related audio-tactile rather than audio-visual findings, thereby hinting at a rather general multisensory phenomenon). Having established the present robust effects of audio-visual temporal correspondence, using streams comprising 8 successive items within each modality, future research could examine whether our effects would remain with streams comprising less items, and/or whether these effects grow as the stream and any correspondence continues (as might be studied at higher temporal resolution with EEG/MEG, in addition to fMRI).
In addition to mSTS, we found that sensory-specific visual and auditory cortex (including calcarine sulcus and Heschl’s gyrus) showed effects of audio-visual temporal correspondence, primarily contralateral to the visual stream. Remarkably this pattern was confirmed in each of the 24 individuals (except one for Heschl’s gyrus). This provides a particularly clear example that multisensory factors can affect sensory brain regions traditionally considered to be unisensory. This has become an emerging theme in recent multisensory work, using different neural measures (cf. Giard and Peronnet, 1999; Macaluso et al., 2000; Molholm et al., 2002; Brosch et al., 2005; Miller and D’Esposito, 2005; Watkins et al., 2006; Kayser et al., 2007), although some past ERP-examples were critiqued (Teder-Sälejärvi et al., 2002).
Several aspects of neuroanatomical architecture have been considered as potentially contributing to multisensory interplay (e.g. Schroeder et al., 2003), including feedforward thalamo-cortical, direct cortical-cortical links between modality-specific areas, or feedback cortico-cortical connections. The V1 and A1 effects observed here might reflect back-projections from mSTS, for which there is anatomical evidence in animals (Falchier et al., 2002). Alternatively, they might in principle reflect direct connections between V1 and A1 (or thalamic modulation, though we found no significant thalamic effects using fMRI here). Some evidence for A1-V1 connections has now been found in animals, though these appear sparse in comparison with connections involving mSTS (Falchier et al., 2002). Some ERP evidence for early multisensory interactions involving auditory (and tactile) stimuli that may arise in sensory-specific cortices has been reported (Murray et al., 2005), as have some fMRI-modulations in high-resolution monkey studies (Kayser et al., 2007).
Here we approached the issue of inter-regional influences, for our human fMRI data, using two established analysis-approaches to functional coupling or ‘connectivity’ between regions: the psychophysiological interaction (PPI) approach, and the more directional directed-information-transfer (DIT) approach. The PPI-analysis revealed significantly enhanced coupling of seeded mSTS with ipsilateral V1 and A1, specific to the audio-visual temporal correspondence condition (AVC). The DIT-analysis revealed significantly higher ‘information-flow’ from mSTS to both A1 and V1, than in the opposite direction, during the AVC condition relative to the NC condition. DIT measures for ‘direct’ influences between A1 and V1 found no significant impact of audio-visual temporal correspondence versus non-correspondence. This appears consistent with mSTS modulating A1 and V1 when auditory and visual inputs correspond temporally. Future research could address this issue using neural measures with better temporal resolution (e.g. EEG/MEG, or invasive recording in a similar paradigm); and might also examine whether possible ‘attention capture’ by corresponding streams could contribute to feedback influences predominating. Any audio-visual temporal correspondence was always task-irrelevant here, but nevertheless increasing attentional load for the central task (Lavie, 2005) might conceivably modulate the present effects. Finally, the hypothesis of feedback influences from mSTS over A1 in particular was also suggested for a recent monkey single-cell study, by Ghazanfar et al. (2005), who reported increased neuronal firing rates within monkey A1 when congruent (and thus temporally corresponding) monkey lip movements and vocal expressions were used. Those authors hypothesized that the A1 enhancement might reflect feedback projections from STS, as suggested by the very different type of evidence here. Animal work suggests that visual input into auditory-belt areas arrives at the supragranular layer, in apparent accord with a feedback-loop, although other neighbouring areas in and around auditory cortex evidently do receive some direct visual and somatosensory afferents, plus inputs from multisensory thalamic nuclei (Schroeder and Foxe, 2002; Ghazanfar and Schroeder, 2006, Lakatos et al., 2007).
For the present human paradigm, the idea of feedback influences from mSTS upon visual and auditory cortex might be put to further direct test, by combining our new fMRI paradigm with selective lesion/TMS work. If mSTS does indeed impose the effects upon A1 and V1, a lesion in mSTS (on the appropriate side) should then presumably eliminate the present effects within intact A1 and V1. By contrast, if direct A1-V1 connections (or thalamo-cortical circuits) are involved, the observed influence of audio-visual temporal correspondence on V1/A1 could presumably survive an mSTS lesion. Finally, since our new paradigm uses simple non-semantic stimuli (flashes and beeps), and does not require any task other than monitoring fixation-point brightening, it could readily be applied to non-human primates, to enable more invasive measures to identify the pathways responsible for the observed effects. A recent monkey study on audiovisual integration (Kayser et al., 2007) introduced promising imaging methods for such an approach, but did not use the relatively subtle manipulation of correspondence in temporal patterning between modalities as introduced here (instead they varied presence/absence or salience of input in a second modality, rather than manipulating exactly how this corresponded or not in temporal patterning when present).
Our fMRI results reveal a systematic pattern, replicated across two separate groups of subjects and for the individuals within these, whereby audio-visual correspondence in temporal patterning modulates contralateral mSTS, A1 and V1. This provides a compelling example that multisensory relations can affect not only conventional multisensory brain structures (as for STS), but also primary sensory cortices, when auditory and visual inputs have a related temporal structure that is very unlikely to arise by chance alone, and therefore is highly likely to reflect a common source in the external world.
TN was funded by SFB-TR-31/TPA8, JR by DFG-ri-1511/1-3 ; HJH and HH by BMBF CAI-0GO0504 ; JD by the Medical Research Council (UK) and the Wellcome Trust. JD holds a Royal Society-Leverhulme Trust Senior Research Fellowship.
Fig. S1a: Enhanced BOLD-response of STS in presence of contralateral visual stimuli that are temporally corresponding (minus non-corresponding) with the auditory stream (Leftmost pair of columns =left hemisphere; Rightmost pair of columns = right hemisphere). S1b: Shows the overlap of enhanced BOLD-response in the calcarine sulcus (V1) contralateral to the coincident visual stimuli, with the enhanced functional coupling (PPI) between visual cortex and seeded STS in this multisensory context (see S1a), for each individual subject (average maximum at +/-8.4, -78.8, 7.0, p<.004). Leftmost pair of columns =left hemisphere; Rightmost pair of columns = right hemisphere. S1c: Shows the overlap of enhanced BOLD-response in Heschl’s gyrus (A1) contralateral to the visual stimuli that may corresponding temporally to the auditory stream or not, with the enhanced functional coupling (PPI) between Heschl’s gyrus and seeded STS (see S1a) in the corresponding multisensory context, for individual subjects (average maximum at +/-47.5, 20.3, 6.5, p<0.01). Leftmost pair of columns =left hemisphere; Rightmost pair of columns = right hemisphere.