|Home | About | Journals | Submit | Contact Us | Français|
Although multisensory integration has been well modeled at the behavioral level, the link between these behavioral models and the underlying neural circuits is still not clear. This gap is even greater for the problem of sensory integration during movement planning and execution. The difficulty lies in applying simple models of sensory integration to the complex computations that are required for movement control and to the large networks of brain areas that perform these computations. Here I review psychophysical, computational, and physiological work on multisensory integration during movement planning, with an emphasis on goal-directed reaching. I argue that sensory transformations must play a central role in any modeling effort. In particular the statistical properties of these transformations factor heavily into the way in which downstream signals are combined. As a result, our models of optimal integration are only expected to apply “locally”, i.e. independently for each brain area. I suggest that local optimality can be reconciled with globally optimal behavior if one views the collection of parietal sensorimotor areas not as a set of task-specific domains, but rather as a palette of complex, sensorimotor representations that are flexibly combined to drive downstream activity and behavior.
Multiple sensory modalities often provide “redundant” information about the same stimulus parameter, for example when one can feel and see an object touching one’s arm. Understanding how the brain combines these signals has been an active area of research. As described below, models of optimal integration have been successful at capturing psychophysical performance in a variety of tasks. Furthermore, network models have shown how optimal integration could be instantiated in neural circuits. However, strong links have yet to be made between these bodies of work and neurophysiological data.
Here we address how models of optimal integration apply to the context of a sensory-guided movement and its underlying neural circuitry. This paper focuses on the sensory integration required for goal-directed reaching and how that integration is implemented in the parietal cortex. We show that the models developed for perceptual tasks and simple neural networks cannot, on their own, explain behavioral and physiological observations. These principles may nonetheless apply at a “local level” within each neuronal population. Lastly, the link between local optimality and globally optimal behavior is considered in the context of the broad network of sensorimotor areas in parietal cortex.
The principal hallmark of sensory integration should be the improvement of performance when multiple sensory signals are combined. In order to test this concept, we must choose a performance criterion by which to judge improvement. In the case of perception for action, the goal is often to estimate a spatial variable from the sensory input, e.g. the location of the hand or an object in the world. In this case, the simplest and most commonly employed measure of performance is the variability of the estimate. It is not difficult to show that the minimum variance combination of two unbiased estimates of a variable x is given by the expression:
where i, i = 1,2, are the unimodal estimates and are their variances. In other words, the integrated estimate integ is the weighted sum of the two unimodal estimates, with weights inversely proportional to the respective variances. Importantly, the variance of the integrated estimate, , is always less than either of the unimodal variances. While Equation 1 assumes that the unimodal estimates i are scalar and independent (given x), the solution is easily extended to correlated or multidimensional signals. Furthermore, since the unimodal estimates are often well approximated by independent, normally distributed random variables, integ can also be viewed as the Maximum Likelihood (ML) integrated estimate (Ernst and Banks, 2002; Ghahramani et al., 1997). This model has been tested psychophysically by measuring performance variability with unimodal sensory cues and then predicting either variability or bias with bimodal cues. Numerous studies have reported ML-optimal or near-optimal sensory integration in human subjects performing perceptual tasks (for example, Ernst and Banks, 2002; Ghahramani et al., 1997; Jacobs, 1999; Knill and Saunders, 2003; van Beers et al., 1999).
Sensory integration is more complicated for movement planning than for a simple perceptual task. The problem is that movement planning and execution rely on a number of different computations, and estimates of the same spatial variable may be needed for several of these. For example, there is both psychophysical (Rossetti et al., 1995) and physiological (Batista et al., 1999; Buneo et al., 2002; Kakei et al., 1999, 2001) evidence for two separate stages of movement planning, as illustrated in Figure 1. First, the movement vector is computed as the difference between the target location and the initial position of the hand. Next, the initial velocity along the planned movement vector must be converted into joint angle velocities (or other intrinsic variables such as muscle activations), which amounts to evaluating an inverse kinematic or dynamic model. This evaluation also requires knowing the initial position of the arm.
When planning a reaching movement, humans can often both see and feel the location of their hand. The ML model of sensory integration would seem to predict that the same weighting of vision and proprioception should be used for both of the computations illustrated in Figure 1. However, we have previously shown that when reaching to visual targets, the relative weighting of these signals was quite different for the two computations: movement vector planning relied almost entirely on vision of the hand, and the inverse model evaluation relied more strongly on proprioception (Sober and Sabes, 2003). We hypothesized that the difference was due to the nature of the computations. Movement vector planning requires comparing the visual target location to the initial hand position. Since proprioceptive signals would first have to be transformed, this computation favors vision. Conversely, evaluation of the inverse model deals with intrinsic properties of the arm, favoring proprioception. Indeed, when subjects are asked to reach to a proprioceptive target (their other hand), the weighting of vision is significantly reduced in the movement vector calculation (Sober and Sabes, 2005). We hypothesized that these results are consistent with “local” ML integration, performed separately for each computation, if sensory transformations inject variability into the transformed signal.
In order to make this hypothesis quantitative, we must understand the role of sensory transformations during reach planning and their statistical properties. We developed and tested a model for these transformations by studying patterns of reach errors (McGuire and Sabes, 2009). Subjects made a series of interleaved reaches to visual targets, proprioceptive targets (the other hand, unseen), or bimodal targets (the other hand, visible), as illustrated in Figure 2A. These reaches were made either with or without visual feedback of the hand prior to reach onset, and in particular during an enforced delay period after target presentation (after movement onset, feedback was extinguished in all trials). We took advantage of a bias in reaching that naturally occurs when subjects fixate a location distinct from the reach target. In particular, when subjects reach to a visual target in the peripheral visual field, reaches tend to be biased further from the fixation point (Bock, 1993; Enright, 1995). This pattern of reach errors is illustrated in the left-hand panels of Figure 2B: when reaching left of the fixation point a leftward bias is observed, and similarly for the right. Thus, these errors follow a retinotopic pattern, i.e. the bias curves shift with the fixation point. The bias pattern changes, but remains retinotopic, when reaching to bimodal targets (Figure 2C) or proprioceptive targets (Figure 2D). Most notably, the sign of the bias switches for proprioceptive reaches: subjects tend to reach closer to the point of fixation. Finally, the magnitude of these errors depends on whether visual feedback of the reaching hand is available prior to movement onset (compare the top and bottom panels of Figures 2B–D; see alsoBeurze et al. (2007)).
While these bias patterns might seem arbitrary, they suggest an underlying mechanism. First, the difference in the sign of errors for visual and proprioceptive targets suggests that the bias arises in the transformation from a retinotopic (or eye-centered) representation to a body-centered representation. To see why, consider that in its simplified one-dimensional form, the transformation requires only adding or subtracting the gaze location (see the box labeled “Transformation” in Figure 2E). This might appear to be a trivial computation. However, the internal estimate of gaze location is itself an uncertain quantity. We argued that this estimate relies on current sensory signals (proprioception or efference copy) as well as an internal prior that “expects” gaze to be coincident with the target. Thus, the estimate of gaze would be biased toward a retinally peripheral target. Since visual and proprioceptive information about target location travels in different directions through this transformation, a biased estimate of gaze location results in oppositely signed errors for the two signals, as observed in Figure 2B,D. Furthermore, because the internal estimate of gaze location is uncertain, the transformation adds variability to the signal (see also Schlicht and Schrater, 2007), even if the addition or subtraction operation itself can be performed without error (not necessarily the case for neural computations, Shadlen and Newsome, 1994). One consequence of this variability is that access to visual feedback of the hand would improve the reliability of an eye-centered representation (upper pathway in Figure 2E) more than it would improve the reliability of a body-centered representation (low pathway in Figure 2E), since the latter receives a transformed, and thus more variable, version of the signal. Therefore, if the final movement plan were constructed from the optimal combination of an eye-centered and body-centered plan (rightmost box in Figure 2E), the presence of visual feedback of the reaching hand should favor the eye-centered representation. This logic explains why the visual feedback of the reaching hand decreases the magnitude of the bias for visual targets (when the eye-centered space is unbiased; Figure 2B) but increases the magnitude of the bias for proprioceptive targets (when the eye-centered space is biased; Figure 2D).
Together, these ideas form the Bayesian integration model of reach planning with ‘parallel representations’, illustrated in Figure 2E. In this model, all sensory inputs related to a given spatial variable are combined with weights inversely proportional to their local variability (Equation 1), and a movement vector is then computed. This computation occurs simultaneously in an eye-centered and a body-centered representation. The two resultant movement vectors have different uncertainties, depending on the availability and reliability of the sensory signals they receive in a given experimental condition. The final output the network is itself a weighted sum of these two representations. We fit the four free parameters of the model (corresponding to values of sensory variability) to the reach error data shown in solid lines in Figure 2B–D. The model captures those error patterns (dashed lines in Figure 2B–D), and predicts the error patterns from two similar studies described above (Beurze et al., 2007; Sober and Sabes, 2005). In addition, the model predicts the differences we observed in reach variability across experimental conditions (Figure 2F).
These results challenge the idea that movement planning should begin by mapping the relevant sensory signals into a single common reference frame (Batista et al., 1999; Buneo et al., 2002; Cohen et al., 2002). The model shows that the use of two parallel representations of the movement plan yields a less variable output in the face of variable and sometimes missing sensory signals and noisy internal transformations. It is not clear whether or how this model can be mapped onto the real neural circuits that underlie reach planning. For example, the two parallel representations could be implemented by a single neuronal population (Pouget et al., 2002; Xing and Andersen, 2000; Zipser and Andersen, 1988). Before addressing this issue, though, we consider the question of how single neurons or populations of neurons should integrate their afferent signals.
Stein and colleagues have studied multimodal responses in single neurons in the deep layers of cat superior colliculus, and have found both enhancement and suppression of multimodal responses (Meredith and Stein, 1983; Meredith and Stein, 1986; Stein and Stanford, 2008). Based on this work, they suggest that the definition of sensory integration at the level of the single unit is for the responses to be significantly enhanced or suppressed relative to the preferred unimodal stimulus (Stein et al., 2009). However, this definition is overly broad, and includes computations that are not typically thought of as integration. For example,Kadunce et al. (1997) showed that cross-modal suppressive effects in the superior colliculus often mimic those observed for paired within-modality stimuli. These effects are most likely due not to integration, but rather to competition within the spatial map of the superior colliculus, similar to the process seen during saccade planning in primate superior colliculus (Dorris et al., 2007; Trappenberg et al., 2001).
The criterion for signal integration should be the presence of a shared representation that offers improved performance (e.g., reduced variability) when multimodal inputs are available. Here, a “shared” representation is one that encodes all sensory inputs similarly. Using the notation of Equation 1, the strongest form a shared representation is one in which neural activity is function only of xinteg and , rather than being a function of the independent inputs, x1, and x2, . The advantage of such a representation is that downstream areas need not know about which sensory signals were available in order to use the information.
Ma et al. (2006) suggest a relatively simple approach to achieving such an integrated representation. They show that a population of neurons that simply adds the firing rates of independent input populations (or their linear transformations) effectively implements ML integration, at least when firing rates have Poisson-like distributions. This result can be understood intuitively for Poisson firing rates. The variance of the ML decode from each population is inversely proportional to its gain. Therefore, summing the inputs yields a representation with the summed gains, and thus with variance that matches the optimum defined in Equation 1 above. Furthermore, because addition preserves information about variability, this operation can be repeated hierarchically, a desirable feature for building more complex circuits like those required for sensory-guided movement.
It remains unknown whether real neural circuits employ such a strategy, or even if they combine their inputs in a statistically optimal manner. In practice, it can be difficult to quantitatively test the predictions of this and similar models. For example, strict additivity of the inputs is not to be expected in many situations, such as in the presence of inputs that are correlated or non-Poisson, or if activity levels are normalized within a given brain area (Ma et al., 2006; Ma and Pouget, 2008). These difficulties are compounded for recordings from single neurons. In this case, biases in the unimodal representations of space would lead to changes in firing rates across modalities even in the absence of integration-related changes in gain.
Nonetheless, several hallmarks of optimal integration have been observed in the responses of bimodal (visual and vestibular) motion encoding neurons in macaque area MST: bimodal activity is well-modeled as a weighted linear sum of unimodal responses; the visual weighting decreases when the visual stimulus is degraded; and the variability of the decoding improves in the bimodal condition (Morgan et al., 2008). Similarly, neurons in macaque Area 5 appear to integrate proprioceptive and visual cues of arm location (Graziano et al., 2000). In particular, the activity for a mismatched bimodal stimulus is between that observed for location-matched unimodal stimuli (weighting), and the activity for matched bimodal stimuli is greater than that observed for proprioception alone (variance reduction).
In the context of the neural circuits for reach planning, locally optimal integration of incoming signals could be sufficient to explain behavior, as illustrated in the parallel representations model of Figure 2. Here we ask whether the principles in this model can be mapped onto the primate cortical reach network.
Volitional arm movements in primate involve a large network of brain areas with a rich pattern of inter-area connectivity. Within this larger circuit, there is a subnetwork of areas, illustrated in Figure 3, that appear to be responsible for the complex sensorimotor transformations required for goal-directed reaches under multisensory guidance. Visual information primarily enters this network via the parietal-occipital area, particularly area V6 (Galletti et al., 1996; Shipp et al., 1998). Proprioceptive information primarily enters via Area 5, which receives direct projections from primary somatosensory cortex (Crammond and Kalaska, 1989; Kalaska et al., 1983; Pearson and Powell, 1985). These visual and proprioceptive signals converge on a group of parietal sensorimotor areas in or near the intraparietal sulcus (IPS): MDP and 7m (Ferraina et al., 1997a; Ferraina et al., 1997b; Johnson et al., 1996), V6a (Galletti et al., 2001; Shipp et al., 1998), and MIP and VIP (Colby et al., 1993; Duhamel et al., 1998). The parietal reach region (PRR), characterized physiologically by Andersen and colleagues (Batista et al., 1999; Snyder et al., 1997), includes portions of MIP, V6a, and MDP (Snyder et al., 2000a). These parietal areas project forward to the dorsal premotor cortex (PMd) and, in some cases, the primary motor cortex (M1) and they all exhibit some degree of activity related to visual and proprioceptive movement cues, the pending movement plan (“set” or “delay” activity), and the ongoing movement kinematics or dynamics.
While the network illustrated in Figure 3 is clearly more complex than the simple computational schematic of Figure 2E, there is a suggestive parallel. While both Area 5 and MIP integrate multimodal signals and project extensively to the rest of the reach circuit, they differ in their anatomical proximity to their visual versus proprioceptive inputs: Area 5 is closer to somatosensory cortex, and MIP is closer to the visual inputs to reach circuit. Furthermore, Area 5 uses more body- or hand-centered representations compared to the eye-centered representations reported in MIP (Batista et al., 1999; Buneo et al., 2002; Chang and Snyder, 2010; Colby and Duhamel, 1996; Ferraina et al., 2009; Kalaska, 1996; Lacquaniti et al., 1995; Marconi et al., 2001; Scott et al., 1997). Thus, these areas are potential candidates for the parallel representations predicted in the behavioral model.
To test this possibility, we recorded from Area 5 and MIP (Figure 4A) as Macaque monkeys performed the same psychophysical task that was illustrated in Figure 2A for human subjects (McGuire and Sabes, 2011). One of the questions we addressed in this study is whether there is evidence for parallel representations of the movement plan in body and eye-centered reference frames. We performed several different analyses to characterize neural reference frames; here we focus on the tuning-curve approach illustrated in Figure 4B. Briefly, we fit a tuning curve to the neural responses for a range of targets with two different fixation points (illustrated schematically as the red and blue curves in Figure 4B). Tuning was assumed to be a function of T – δE, where T and E are the target and eye locations in absolute (or body-centered) space and δ is a dimensionless quantity. If δ = 0, firing rate depends on the body-centered location of the target (left panel of Figure 4B), and if δ = 1, firing rate depends on the eye-centered location of the target (right panel of Figure 4B).
We found that there is no difference in the mean or distribution of shift values across target modality for either cortical area (Figure 4C,D), i.e. these are shared (modality-invariant) representations. Although some evidence for other shared movement-related representations have been found in the parietal cortex (Cohen et al., 2002), many studies of multisensory areas in the partietal cortex and elsewhere have found that representations are determined, at least in part, by the representation of the current sensory input (Avillac et al., 2005; Fetsch et al., 2007; Jay and Sparks, 1987; Mullette-Gillman et al., 2005; Stricanne et al., 1996). Shared representations such as these have the advantage that downstream areas do not need to know which sensory signals are available in order to use of the representation.
We also observed significant differences in the mean and distribution of shift values across cortical areas, with MIP exhibiting a more eye-centered representation (mean δ = 0.51), while Area 5 has a more body-centered representation (mean δ = 0.25). In a separate analysis, we showed that more MIP cells encode target location alone, compared to Area 5, where more cells encode both target and hand location (McGuire and Sabes, 2011). These inter-area differences parallel observations from Andersen and colleagues of eye-centered target coding in PRR (Batista et al., 1999) and eye-centered movement vector representation for Area 5 (Buneo et al., 2002). However, where those papers report consistent, eye-centered reference frames, we observed a great deal of heterogeneity in representations within each area, with most cells exhibiting “intermediate” shifts between 0 and 1. We think this discrepancy lies primarily in the analyses used: the shift analysis does not force a choice between alternative reference frames, but rather allows for a continuum of intermediate reference frames. When an approach very similar to ours was applied to recordings from a more posterior region of the IPS, a similar spread of shift values was obtained, although the mean shift value was somewhat closer to unity (Chang and Snyder, 2010).
While we did not find the simple eye- and body-centered representations that were built into parallel representations model of Figure 4D, the results can be nonetheless be interpreted in light of that model. We found that both Area 5 and MIP use modality-invariant representations of the movement plan, an important feature of the model. Furthermore, there are multiple integrated representations of the movement plan within the superior parietal lobe, with an anterior to posterior gradient in the magnitude of gaze-dependent shifts (Chang and Snyder, 2010; McGuire and Sabes, 2011). A statistically optimal combination of these representations, dynamically changing with the current sensory inputs, would likely provide a close match to the output of the model. The physiological recordings also revealed a great deal of heterogeneity in shift values, suggesting an alternate implementation of the model. Xing and Andersen (2000) have observed that a network with a broad distribution of reference frame shifts can be used to compute multiple simultaneous readouts, each in a different reference frame. Indeed, a broad distribution of gaze-dependent tuning shifts has been observed within many parietal areas (Avillac et al., 2005; Chang and Snyder, 2010; Duhamel et al., 1998; Mullette-Gillman et al., 2005; Stricanne et al., 1996). Thus, parallel representations of movement planning could also be implemented within a single heterogeneous population of neurons.
We have adopted a simple definition of sensory integration, namely, improved performance when multiple sensory modalities are available – whether in a behavioral task or with respect to the variability of neural representations. This definition leads naturally to criteria for optimal integration such as the minimum variance/ML model of Equation 1, and a candidate mechanism for achieving such optimality was discussed above (Ma et al., 2006). In the context of a complex sensorimotor circuit, a mechanism such as this could be applied at the local level to integrate the afferent signals at each cortical area, independently across areas. However, these afferent signals will include the effects of the specific combination of upstream transformations, and so such a model would only appear to be optimal at the local level. It remains an open question as to how locally optimal (or near-optimal) integration could lead to globally optimal (or near-optimal) behavior.
The parietal network that underlies reaching is part of a larger region along the IPS that subserves a wide range of sensorimotor tasks (reviewed, e.g., in Andersen and Buneo, 2002; Burnod et al., 1999; Colby and Goldberg, 1999; Grefkes and Fink, 2005; Rizzolatti et al., 1997). These tasks make use of many sensory inputs, each naturally linked to a particular reference frame (e.g. visual signals originate in a retinotopic reference frame), as well as an array of kinematic feedback signals needed to transform from one reference frame to another. In this context, it seems logical to suggest a series of representations and transformations, e.g., from eye-centered to hand-centered space, as illustrated in Figure 5A. This schema offers a great degree of flexibility, since the “right” representation would be available for any given task. An attractive hypothesis is that a schema such as this could be mapped onto the series of sensorimotor representations that lie along the IPS, e.g., from the retinotopic visual maps in area V6 (Fattori et al., 2009; Galletti et al., 1996) to the hand-centered grasp-related activity in AIP.
The pure reference frame representations illustrated in the schema of Figure 5A are not consistent with the evidence for heterogeneous “intermediate” representations. However, the general schema of a sequence of transformations and representations might still be correct, since the neural circuits implementing these transformations need not represent these variables in the reference frames of their inputs, as illustrated by several network models of reference-frame transformations (Deneve et al., 2001; Salinas and Sejnowski, 2001; Xing and Andersen, 2000; Zipser and Andersen, 1988). The use of network models such as these could reconcile the schema of Figure 5A with the physiological data (Pouget et al., 2002; Salinas and Sejnowski, 2001).
While this schema is conceptually attractive, it has disadvantages. As described above, each transformation will inject variability into the circuit. This variability would accrue along the sequence a transformations, a problem that could potentially be avoided by “direct” sensorimotor transformations such as those proposed by (Buneo et al., 2002). Furthermore, in order not to lose fidelity along this sequence, all intermediate representations require comparably sized neuronal populations, even representations that are rarely directly used for behavior. Ideally, one would be able to allocate more resources to a retinotopic representation, for example, than an elbow-centered representation.
An alternative schema is to combine many sensory signals into each of a small number of representations; in the limit, a single complex representational network could be used (Figure 5B). It has been shown that multiple reference-frames can be read out from a single network of neurons when those neurons use “gain-field” representations, i.e. when their responses are multiplicative in the various input signals (Salinas and Abbott, 1995; Salinas and Sejnowski, 2001; Xing and Andersen, 2000). More generally, non-linear basis functions create general purpose representations that can be used to compute (at least approximately) a wide range of task-relevant variables (Pouget and Sejnowski, 1997; Pouget and Snyder, 2000). In particular, this approach would allow “direct” transformations from sensory to motor variables (Buneo et al., 2002) without the need for intervening sequences of transformations. However, this schema also has limitations. In order to represent all possible combinations of variables, the number of required neurons increases exponentially with the number of input variables (the “curse-of-dimensionality”). Indeed, computational models of such generic networks show a rapid increase in errors as the number of input variables grow (Salinas and Abbott, 1995). This limitation becomes prohibitive when the number of sensorimotor variables approaches a realistic value.
A solution to this problem, illustrated in Figure 5C, is to have a large number of networks, each with only a few inputs or encoding only a small subspace of the possible outputs. These representations would likely have complex or “intermediate” representations of sensorimotor space that would not directly map either to particular stages in the kinematic sequence (5A) or to the “right” reference frames for a set of tasks. Instead, the downstream circuits for behavior would draw upon several of these representations. This schema is consistent with the large continuum of representations seen along the IPS (reviewed, for example, in Burnod et al., 1999), and the fact that the anatomical distinctions between nominal cortical areas in this region are unclear and remain a matter of debate (Cavada, 2001; Lewis and Van Essen, 2000). It is also consistent with the fact that there is a great deal of overlap in the pattern of cortical areas that are active during any given task, e.g. saccade and reach activity have been observed in overlapping cortical areas (Snyder et al., 1997, 2000b) and grasp-related activity can be observed in nominally reach-related areas (Fattori et al., 2009). This suggests that brain areas around the IPS should not be thought of a set of task-specific domains (e.g., Andersen and Buneo, 2002; Colby and Goldberg, 1999; Grefkes and Fink, 2005), but rather as a palette of complex, sensorimotor representations.
This schema suggests a mechanism by which locally optimal integration could yield globally optimal behavior, essentially a generalization of the parallel representations model of Figure 4D. In both the parallel representations model and the schema of Figure 5C, downstream motor circuits integrate overlapping information from multiple sensorimotor representations of space. For any specific instance of a behavior, the weighting of these representations should depend on their relative variability, perhaps determined by gain (Ma et al., 2006), and this variability would depend on the sensory and motor signals available at that time. If each of the representations in this palette contains a locally optimal mixture of its input signals, optimal weighting of the downstream projections from this palette could drive statistically efficient behavior.
This work was supported by the National Eye Institute (R01 EY-015679) and the National Institute of Mental Health (P50 MH77970). I thank John Kalaska, Joseph Makin, and Matthew Fellows for reading and commenting on earlier drafts of this manuscript.