Here we report early cross-sensory activations and audiovisual interactions in both A1 and V1 in humans. The current study is to our knowledge the first to utilize both MEG and fMRI in the same subjects for this purpose, which has the advantage of offering spatiotemporally accurate estimates; individually, the methods offer compromises between spatial and temporal accuracy. The delay from sensory-specific to cross-sensory activity was 55 ms in the auditory and 10 ms in the visual cortex, which is clearly asymmetrical. This timing pattern reflects the fact that sensory-specific activations start earlier in the auditory (23 ms) than in the visual (43 ms) cortex, and is thus consistent with the idea that the origin of the cross-sensory activations is in the sensory cortex of the opposite stimulus modality, with about 30–35 ms conduction delay between the two areas. Audiovisual interactions were observed after both sensory-specific and cross-sensory inputs converged on the sensory cortex.
Since MEG detects synchronous activity of thousands of neurons, the relationship between anatomical distance and conduction delay is not necessarily straightforward. Therefore, with the ~30 ms delay, the cross-sensory activations could utilize direct cortico-cortical connections between the auditory and visual cortices, connect through a subcortical relay, or travel through an association cortical area such as STP/STS (e.g.,
(Raij et al., 2000
)). In the last option one would additionally expect activity in STS before observing cross-sensory activity in A1/V1. The analysis is complicated by the fact that, based on intracranial data from primates (Schroeder & Foxe, 2002
; Schroeder et al., 2003
) and EEG recordings in humans (Foxe & Simpson, 2002
), visual stimuli would be expected to activate STS starting only about 8 ms after V1 onset, therefore largely overlapping cross-sensory activations in the auditory cortex. In our data, STS was strongly activated in the right hemisphere at the same time as the cross-sensory auditory cortex activation occurred, consistent with the possibility of the signal traveling through STS, but in the left hemisphere no clear STS activation was observed. Hence, STS seems unlikely to play a key role. An additional factor to take into account is that the conduction delay had a small asymmetric trend: 30 ms for auditory stimuli with a known monosynaptic connection A1→V1, and 35 ms for visual stimuli with a known somewhat longer known pathway V1→V2→A1. Hence, it appears plausible that the earliest cross-sensory activations may utilize the A1→V1 and V1→V2→A1 pathways. Future studies utilizing dynamic causality modeling (Lin et al., 2009
; Schoffelen & Gross, 2009
) might provide additional insight.
As described in Introduction, another possibility is that subcortical pathways may send direct cross-sensory inputs to sensory cortices. If the subcortical structures have a similar delay between auditory and visual processing as A1 and V1, then latency data alone cannot distinguish between cortico-cortical and subcortico-cortical cross-sensory influences. However, currently no such audiovisual pathways are known. Clearly, correct interpretation of functional connectivity analyses greatly benefits from accurate anatomical connectivity information.
The current results could mistakenly be interpreted to suggest that earliest audiovisual interactions can occur only after the cross-sensory inputs arrive at the sensory cortex. This would put a lower limit of 53 ms in the visual cortex and 75 ms for auditory cortex for audiovisual interactions to start, which is in fact what was observed in the present MEG data. Yet, there is strong EEG evidence of audiovisual interactions in humans occurring earlier, starting at about 40 ms, being maximal over posterior areas (Giard & Peronnet, 1999
; Molholm et al., 2002
; Teder-Sälejärvi et al., 2002
; Molholm et al., 2004
). We suggest three options why these early interactions were not observed in the present MEG study. First, EEG may receive somewhat stronger contribution from subcortical generators than MEG (Goldenholz et al., 2009
), which is consistent with the idea that the early interactions in EEG may be generated in subcortical structures participating in multisensory processes. Second, the subcortical parts of afferent pathways leading to sensory cortex could be modulated by subcortical multisensory influences, which would allow audiovisual interactions to occur from the very beginning of the cortically generated “sensory-specific” responses. However, this scenario would predict that the early interactions should be equally visible for EEG and MEG. Third, due to the sensitivity of MEG to mainly tangentially oriented currents, we could have missed some earlier components if they were radial. However, this is unlikely given than it has been estimated that only about 10% of the cortical surface (thin strips at crests of gyri) are radial enough to generate currents undetectable with MEG (Hillebrand & Barnes, 2002
), and further, source orientation differences would be expected to influence all activations and interactions equally because in the present study source areas were kept constant. Therefore, the most likely explanation is that the early interactions are generated in subcortical structures. EEG/MEG source localization accuracy for deep generators is poor, resulting in that these methods are not well suited for more accurate localization of the subcortical structures.
The finding that fMRI could detect strong cross-sensory activations in the calcarine fissure but in Heschl’s gyri these were almost absent was unexpected. In the present data some voxels in Heschl’s gyri were significantly activated by visual stimuli (albeit weakly) at the typical BOLD signal peak latency (see ) while the majority were not, rendering the reliability of this observation inconclusive. Previous fMRI studies have shown that at least some classes of visual stimuli (such as lip movements) may robustly activate A1 (Pekkola et al., 2005
). Even simple stimuli such as those employed in the current study have been reported to result in cross-sensory activations (Martuzzi et al., 2007
). One possible explanation is that the acoustical EPI scanner noise dampened evoked responses in the auditory cortex due to neuronal adaptation. It is also possible that, again, due to the acoustical scanner noise, the BOLD signal may saturate before the neurons do (Bandettini et al., 1998
). A possible reason why our study may have been affected by this more than the above mentioned could be that the acoustic noise is EPI parameter dependent – our faster scanning could have increased the noise. This interpretation is supported by that MEG, where the scanner is completely quiet, showed clear cross-sensory responses in the supratemporal auditory cortex.
It is unclear what the functional roles of the early cross-sensory activations might be. Behaviorally, for complex processing such as audiovisual speech, asynchrony as large as 250 ms can go unnoticed (Miller & D’Esposito, 2005
). Moreover, in realistic stimulus environments auditory input lags the visual input, depending on the distance from the source (9 ms increase for every 3 m distance), which influences the relative timings of the auditory and visual inputs. Possibly the early cross-sensory influences have a role for lower-order processing (where synchrony requirements may be tighter) than audiovisual speech. There is also evidence that these activations may be task dependent (Wang et al., 2008
). Plausibly, early cross-sensory activations could serve to facilitate later processing stages and reaction times by enhancing top-down processing and speeding up the exchange of signals between brain areas (Bar et al., 2006
; Raij et al., 2008
; Sperdin et al., 2009
As a technical finding, a very high SNR was necessary in order to detect onset latencies accurately. The present results were achieved by using a low-noise MEG instrument, high quality shielded room, and a large number of stimuli. The averaged responses in the current study consisted of about 300 individual responses per subject, which was not quite sufficient for sensor space analysis at the individual subject level, but quite sufficient for grand average analysis (about 2100 individual responses, or twice as much when additionally averaging across hemispheres). However, compared with the sensor data, extracting time courses from the auditory and visual sensory cortices by dSPM source analysis greatly improved SNR at the individual level (more robust onsets and less interindividual variability), hence giving more accurate results that also agreed with the grand average values well. Moreover, in both sensor and source space, we present two different across-subjects analyses: onset latencies picked (i) from grand average (N=7) responses ( and ) and (ii) from individual subjects’ responses ( and ). The latter were useful for testing the statistical significance of latency differences across areas. However, the grand average response consists of the largest number of epochs and therefore has by far the best SNR, consequently showing slightly earlier onsets than those picked from the individual subjects’ responses (e.g.
, compare and ). Yet, grand average responses could also be biased to show early onsets if some of the subjects show earlier onsets than the others. In our data this bias appears to be quite small as most latencies are similar across and . Still, differentiating between the boost given by improved SNR and the possible bias caused by subjects with faster onsets is difficult. The two analyses complement each other and offer slightly different interpretations. The grand average analysis is well suited for finding the earliest onset latencies across the subject pool. The means across values from individual subjects, in turn, show slightly longer onset latencies but are better protected from individual bias. An additional possibility would be to use bootstrapping to synthesize multiple grand averages and, after picking onsets from each, study their means and variances. The results, shown in Supporting Table S1
, may offer a compromise between the two analyses. With the present data all three analyses give quite similar results and lead to the same conclusions; to improve comparability with earlier studies we here focus on reporting the results with most widely used methods.
The current results are not directly comparable with studies where stimuli or tasks in one modality precede the other. For example, in audiovisual speech, the visual input (lip movements) typically starts 100–300 ms before the auditory stimulus onset, and therefore may modulate the incoming auditory signals at multiple levels, including in secondary auditory cortex (Besle et al., 2008
) and even in central auditory pathways (Musacchia et al., 2006
). Similarly, auditory evoked responses can be modulated by visuomotor processes such as gaze direction already in inferior colliculus (Groh et al., 2001
). As yet another example, attention may modulate responses and interactions through top-down mechanism in primary sensory cortices as soon as they appear (Talsma et al., 2007
; Poghosyan & Ioannides, 2008
; Karns & Knight, 2009
). The flash-sound illusion also would appear to belong in this category (Shams et al., 2002
; Shams et al., 2005
; Watkins et al., 2006
; Mishra et al., 2007
In the current study visual stimuli were presented foveally. However, anatomical studies have shown that areas in the calcarine fissure representing peripheral vision may be more strongly connected with the auditory cortex than areas representing fovea (Falchier et al., 2002
; Wang et al., 2008
). It is therefore plausible that cross-sensory latencies could be faster for peripherally than for foveally presented visual stimuli, although some previous studies have found the opposite effect (Talsma & Woldorff, 2005
; Talsma et al., 2007
The late BOLD negative undershoots for auditory stimuli in the calcarine cortex () are consistent with an earlier block design fMRI study reporting cross-sensory negative BOLD activations (Laurienti et al., 2002
); however, due to their study design, they could not investigate the time courses of the BOLD responses. Our BOLD time course analysis shows that the cross-sensory responses in the visual cortex show a small initial positive component, followed by a clearly stronger negative de-activation component. Temporal summation of such events in a block design would be expected to result in a net negative BOLD effect.
These findings are consistent with previously shown sensory-specific and cross-sensory activations (see Introduction). For example, the A1 onsets for our auditory stimuli at 23 ms are only ~8 ms slower than the earliest responses to clicks recorded from the human auditory cortex intracranially (Celesia, 1976
) or by MEG (Parkkonen et al., 2009
), and the V1 onset at 43 ms simultaneous with the earliest reported responses from V1 ((Foxe & Schroeder, 2005
; Musacchia & Schroeder, 2009
) for reviews). The observed cross-sensory onset latencies are, to our knowledge, the fastest reported in humans. This was made possible by the good SNR in our data and the extraction of source-specific amplitudes. Audiovisual interactions were observed only after the uni- and cross-sensory inputs converged on the sensory cortex, but once this happened the interactions appeared almost instantaneously (3–21 ms after convergence). The findings contribute to understanding of cross-sensory activations and interactions in sensory cortices by establishing lower limits to the latency when they can be expected to occur. The results have implications regarding the possible pathways that cross-sensory activations utilize, and suggest that audiovisual interactions occurring before cross-sensory signals arrive (for simultaneous stimuli, 53 ms in visual cortex and 75 ms in auditory cortex) are most likely of subcortical origin; interactions after these latencies could be either cortically or subcortically generated.