|Home | About | Journals | Submit | Contact Us | Français|
Fundamental observations and principles derived from traditional physiological studies of multisensory integration have been difficult to reconcile with computational and psychophysical studies that share the foundation of probabilistic (Bayesian) inference. We review recent work on multisensory integration, focusing on experiments that bridge single-cell electrophysiology, psychophysics, and computational principles. These studies show that multisensory (visual-vestibular) neurons can account for near-optimal cue integration during perception of self-motion. Unlike the nonlinear (super-additive) interactions emphasized in some previous studies, visual-vestibular neurons accomplish near-optimal cue integration through sub-additive linear summation of their inputs, consistent with recent computational theories. Important issues remain to be resolved, including the observation that variations in cue reliability appear to change the weights that neurons apply to their different sensory inputs.
A fundamental aspect of our sensory experience is that information from different modalities is often seamlessly integrated into a unified percept. How multiple sources of sensory information are combined by neural circuits has been a subject of intense research for many years. Among the first to investigate this problem at the level of single neurons, Stein and colleagues [1,2•] characterized cells in the superior colliculus using visual, auditory and somatosensory stimuli. The prevailing conclusions from these pioneering studies are that multimodal responses (1) are generally greater than the largest unimodal response (an observation referred to as ‘enhancement’) and (2) are often larger than the sum of the unimodal responses (a property referred to as ‘super-additivity’). Weak stimuli generally produce greater response enhancement and stronger super-additivity [3,4], a property referred to as ‘the principle of inverse effectiveness’. Motivated by these findings in the superior colliculus, more recent studies, including human neuroimaging work, have focused on super-additivity and the principle of inverse effectiveness as hallmark properties of multisensory integration [e.g. 5]. Such multisensory effects have also been described in sensory-specific cortical areas [reviewed in 6]. Other studies have cautioned against placing too much emphasis on super-additivity [7-9].
In contrast to the emphasis on non-linearities (super-additivity) in neurophysiological and neuroimaging studies, psychophysicists and theorists have emphasized linear cue combination. Numerous psychophysical studies have shown that humans integrate different sensory cues in a manner consistent with a weighted linear combination of perceptual estimates from the individual cues [10-15]. The weights that a subject applies to each cue are proportional to the relative reliability of the cues, such that a less reliable cue is given less weight in perceptual estimates. In parallel, theorists have proposed that optimal cue integration at the level of behavior can be accomplished by populations of neurons that simply sum their unimodal inputs, provided that they exhibit a common form of spiking statistics [16••, see also 17].
Going forward, it is of considerable interest to relate descriptions of multisensory integration at the level of single neurons (enhancement, additivity, etc) to descriptions of cue integration at the perceptual level. This clearly requires linking single-unit activity to behavioral performance, examining how multimodal integration by neurons depends on cue reliability, and understanding how populations of multimodal neurons represent probabilistic information defined by multiple cues. Here we summarize some recent developments and advances in this area, and we highlight remaining challenges.
Probability theory provides a mathematical framework within which to consider the problem of cue integration. Fundamental to this approach is the notion that there is inherent uncertainty in the information available to our senses, as well as in the encoding of that information by our sensory systems. Consequently, the neural representations that guide behavior are probabilistic. Consider the problem of estimating an environmental variable, S, based on observations involving two sensory cues, r1 and r2. For example, r1 and r2 can be thought of as sets of firing rates from populations of neurons selective for cues 1 and 2. The possible values of S, given the observations of r1 and r2 are described by the probability density function, P(S|r1,r2). Under the reasonable assumption of independent noise for the two sensory modalities, it follows that P(S|r1,r2) can be computed via Bayes’ theorem [15,18]:
In this formulation, P(S|r1,r2) is called the ‘posterior’ probability density, which can be thought of as containing both an ‘estimate’ of S (e.g., the mean) and the uncertainty associated with that estimate (e.g., the variance, σ ). On the right-hand side, P(r1|S) and P(r2|S) represent the sensory ‘likelihood’ functions for cues 1 and 2, respectively, as illustrated in Fig. 1a and 1b. The likelihood quantifies the probability of acquiring a particular set of sensory evidence (e.g., a set of firing rates in a population of neurons) for each possible value of S. For an ideal observer, the goal is often to find the value of S that maximizes the posterior probability, which is referred to as the ‘maximum a posteriori’ or ‘MAP’ estimate. Note that the estimate of S derived from the posterior distribution also relies on P(S) which is called the ‘prior’ distribution. For example, P(S) can embody knowledge of the statistical properties of experimentally or naturally occurring stimuli. The terms P(r1) and P(r2) are independent of S, and thus can be ignored in the optimization process.
If one further assumes Gaussian likelihoods (Fig. 1a and 1b) and a uniform (or at least very broad) prior probability P(S), then the (bimodal) MAP estimate of S can be expressed as a weighted sum of the estimates based on each sensory cue alone, with weights that are proportional to the relative reliabilities of the cues (where reliability is given by inverse variance, 1/σ2; [19,20]):
According to , more reliable cues have a greater weighting in the posterior estimate (Fig. 1a and 1b). In addition, the variance of the bimodal estimate is predicted to be lower than that of the unimodal estimates, according to:
According to , the largest predicted improvement in sensitivity (a decrease in the variance of the bimodal estimate by a factor of √2) occurs when the two cues have equal reliability (i.e., σ1 = σ2). In the extreme case where one cue is much more reliable than the other (σ1 <<σ2), behavior is ‘captured’ by that cue such that σBIMODAL ≈ σ1.
That human behavior is governed by these principles has been shown by multiple human psychophysical experiments using a number of different paradigms [10,12,19-21]. The basic result is quite consistent across studies: when combining multiple sensory cues, humans typically perform as Bayesian optimal observers. Recently, these findings have been extended to macaque monkeys [22••], providing for the first time an animal model for combined behavioral and neurophysiological investigations of cue integration. Gu et al. [22••] employed a heading discrimination task in which subjects were asked to report their perceived direction of self-motion relative to straight ahead using a 2-alternative-forced-choice (2AFC) task (i.e., monkeys reported heading ‘leftward’ or ‘rightward’ on each trial). Macaques were trained to perform this task based on either visual (optic flow) or non-visual (vestibular) cues, as well as both cues presented synchronously and congruently.
Gu et al. found that the behavior of macaques follows the predicted (equation ) decrease in discrimination threshold under bimodal stimulus conditions (Fig. 1c and 1d). Psychometric functions became steeper for bimodal as compared to unimodal stimuli (Fig. 1c, green versus red/blue data). On average, bimodal thresholds were reduced by ~30% under cue combination and were similar to predictions from equation  (Fig. 1d, green versus black bars). When visual and vestibular cues were put in conflict while varying the reliability (motion coherence) of the optic flow cue, macaques adjusted the relative weighting of the two cues on a trial-by-trial basis, in agreement with the predictions of equations  [23, see also 24]. Thus, macaques, like humans, can combine sensory cues in a statistically-optimal manner.
These behavioral experiments establish a robust animal model for exploring the neural basis of Bayesian inference [22••]. For perception of heading, candidate multisensory neurons are found in the dorsal medial superior temporal area (MSTd) [25-29], as well as in other brain areas (e.g., area VIP, ) not considered further here. Spiking discharges of multimodal MSTd neurons were quantified while the animal performed the heading discrimination task. For a subgroup of MSTd neurons with congruent visual and vestibular heading preferences (‘congruent’ neurons; [22••]), heading tuning became steeper in the combined condition (Fig. 2a, green). Signal detection theory  was then used to construct ‘neurometric’ functions (Fig. 2b) that quantify the ability of an ideal observer to discriminate heading based on the activity of a single neuron. For congruent cells, such as the neuron in Fig. 2a, neuronal thresholds were smallest in the combined condition, indicating that the neuron could discriminate smaller variations in heading when both cues were provided. Another class of neurons in area MSTd has opposite heading preferences for visual and vestibular stimuli (‘opposite’ cells, ). For these neurons, the neurometric function was generally flatter under cue combination [22••].
Overall, the extent to which the activity of MSTd neurons followed the prediction of cue-integration theory (equation ) depended systematically on the congruency of visual and vestibular tuning for heading (Fig. 2c). Congruent cells (Fig. 2c, filled circles) had thresholds close to the optimal prediction (horizontal dashed line). In contrast, thresholds for opposite cells were generally much higher than those predicted from optimal cue integration (Fig. 2c, open circles). Thus, the average sensitivity of congruent cells increased under bimodal stimulation by an amount similar to the monkeys’ behavior, suggesting that congruent cells could provide a neural substrate for the behavioral effects seen in Fig. 1.
The neuronal threshold results suggest that monkeys should selectively monitor the activity of congruent cells to perform the heading discrimination task. In support of this idea, choice probabilities (CPs) were also found to depend significantly on congruency of visual/vestibular tuning (Fig. 2d; [22••]). CP quantifies whether trial-to-trial fluctuations in neural firing rates are correlated with fluctuations in the monkeys’ perceptual choices, independent of stimulus variations . A significant CP>0.5 indicates that the monkey tends to choose the neuron’s preferred sign of heading (leftward vs. rightward) when the neuron fires more strongly. Such a result is thought to reflect a functional link between the neuron and perception [32-34]. Congruent cells tended to have positive CPs (>0.5) whereas opposite cells did not (Fig. 2d), consistent with the hypothesis that monkeys selectively monitored congruent cells to perform the task.
In summary, by simultaneously monitoring neural activity and behavior in the range of psychophysical threshold, it has been possible to study neural mechanisms of multisensory integration under conditions in which cue integration also takes place perceptually. In addition to demonstrating near-optimal cue integration by monkeys, a population of cortical neurons has been identified that could account for improvement in psychophysical performance under cue combination. Notably, Gu et al. found that these congruent MSTd neurons exhibited subadditive interactions during performance of the heading discrimination task, indicating that superadditivity is not necessary to achieve near-optimal cue integration. These findings establish a model system for studying the detailed mechanisms by which neurons combine different sensory signals and dynamically re-weight these signals to optimize performance as the reliability of cues varies .
In addition to predicting increased sensitivity (equation ), the Bayesian formulation predicts that monkeys dynamically re-weight sensory cues on a trial-by-trial basis according to their relative reliabilities (equation ). Linear combination at the level of perceptual estimates makes no clear prediction for the underlying neuronal combination rule. Recently, however, theorists have proposed that neurons could accomplish Bayesian integration via linear summation of unimodal inputs [16••]. In this scenario, neurons combine their inputs with fixed weights that do not change with cue reliability; changes in cue reliability are reflected in the bimodal response simply due to weaker input from a less reliable cue [16••]. Alternatively, changes in cue reliability could alter the weights with which a single neuron integrates its multisensory inputs.
It is therefore of considerable interest to characterize how neurons combine their inputs when cue reliability varies. While experiments are currently underway in trained animals , a simpler experiment was performed in which reliability of the visual cue (coherence of optic flow) was varied while recording neural activity in a passively fixating animal [35••]. Bimodal responses of MSTd neurons were well fit by a weighted linear sum of vestibular and visual unimodal responses, typically accounting for ~90% of the variance in bimodal responses. Interestingly, the weights required to fit the bimodal responses changed with the relative reliabilities of the two cues; vestibular weights increased and visual weights decreased as the coherence of the visual stimulus declined from 100% to 50% and 25% [35••]. Thus, MSTd neurons appear to give less weight to their visual inputs when optic flow is degraded, a property that might contribute to similar findings in behavior (equation ) [10-12].
These results are not necessarily in conflict with theoretical predictions [16••] for at least two reasons. First, MSTd neurons may not adhere to the assumptions of the model (e.g., Poisson-like firing statistics and multiplicative effects of stimulus reliability). Indeed, the effect of motion coherence on visual heading tuning in MSTd does not appear to be purely multiplicative[35••,36]. Second, the model has not considered the effects of interactions at the network level, such as divisive normalization , that cause neural responses to saturate with increasing intensity. It remains to be determined whether common network-level nonlinearities, such as divisive normalization, could reconcile the experimental and theoretical observations.
How do findings of Bayes-optimal cue integration relate to the notions of super-additivity and inverse effectiveness pioneered by Stein and colleagues ? Because the superior colliculus is thought to play important roles in orienting behavior where detection of stimuli is paramount, the original studies by Stein and colleagues emphasized near-threshold stimuli. However, this may bias outcomes toward a non-linear (super-additive) operating range in which multisensory interactions are strongly influenced by the nonlinear relationship between membrane potential and firing rate [38,39]. More recently, Stein’s group has shown that interactions in the superior colliculus cease to be super-additive as stimulus strength increases [3,4,7], and others have seen sub-additive effects in colliculus neurons using stronger stimuli in behaving animals [40,41]. Thus, the sub-additivity seen in MSTd by Gu et al. [22••] and Morgan et al. [35••] is likely due to use of supra-threshold stimuli and is consistent with results in other cortical areas and systems [42-47].
Rowland et al. [48••] have recently shown that there is a transient super-additive phase of neuronal responses followed by additivity or sub-additivity at longer time scales. Such early manifestations of super-additivity might decrease reaction times in bimodal orienting behaviors [49,50]. Despite its utility for orienting behaviors and detection of weak stimuli, super-additivity may play relatively little role in other types of behaviors, such as discrimination among supra-threshold stimuli. As summarized here, sub-additive responses of congruent MSTd neurons appear to be able to account for multisensory integration in heading perception. Thus, superadditivity is not a general hallmark of multisensory integration.
The past decade has seen a dramatic increase in our understanding of the computational principles that characterize human multisensory perception. Yet, relatively little is known about the neural mechanisms that underlie multisensory integration and probabilistic (Bayesian) inference in general. Concepts of super-additivity and inverse effectiveness that have dominated thinking in the field for many years do not necessarily lead a path toward understanding behavior. Rather, we suggest that studies focus on whether and how multisensory integration improves the signal-to-noise ratio of neural representations, and how changes in cue reliability bias decoding of neural population responses. Recent experiments on the neural basis of visual/vestibular cue integration for heading perception suggest that this alternative approach may be an effective way to link multisensory integration at the level of neurons and behavior. Although some critical experiments have yet to be done, results to date suggest that macaques and humans use similar computational principles for combining multiple sensory cues, and that these principles may be accounted for by the properties of individual neurons in multisensory cortical areas.
Yet, important questions remain: How distributed are these representations of multisensory signals in the brain? What are the mechanisms by which neurons re-weight their inputs according to reliability? How do populations of sensory neurons represent probability distributions? How are these population signals decoded to make optimal decisions and how much of the necessary computation takes place in sensory representations versus decision-making networks? Future work should address these questions to further bridge the gap between traditional physiological descriptions of multisensory integration (e.g., superadditivity) and recent behavioral and computational descriptions of cue integration that are built upon the foundation of Bayesian inference.
Supported by NIH EY017866 and EY019087 (to DEA) and NIH EY016178 (to GCD).
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.