Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2941160

Formats

Article sections

- Abstract
- 1 Introduction
- 2 General framework
- 3 Visuo-motor reaching tasks
- 4 Simulation results
- 5 Discussion
- References

Authors

Related links

J Cogn Neurosci. Author manuscript; available in PMC 2010 September 17.

Published in final edited form as:

PMCID: PMC2941160

NIHMSID: NIHMS226323

Fredrik Bissmarck, ATR International, Kyoto, Japan;

Address correspondence to: Fredrik Bissmarck, ATR Computational Neuroscience Labs, 2-2-2 Hikaridai Keihanna Science City, Seika, Soraku, Kyoto 619-0288, JAPAN, Phone: +81 774 95 1205, Fax: +81 774 95 1259, Email: pj.rta@kirderfx

The publisher's final edited version of this article is available at J Cogn Neurosci

See other articles in PMC that cite the published article.

Feedback signals may be of different modality, latency and accuracy. To learn and control motor tasks the feedback available may be redundant, and it would not be necessary to rely on every accessible feedback loop. Which feedback loops should then be utilized? In this article, we propose that the latency is a critical factor to determine which signals will be influential at different learning stages. We use a computational framework to study the role of feedback modules with different latencies in optimal motor control. Instead of explicit gating between modules, the reinforcement learning algorithm learns to rely on the more useful module. We tested our paradigm for two different implementations, which confirmed our hypthesis. In the first, we examined how feedback latency affects the competitiveness of two identical modules. In the second, we examined an example of visuomotor sequence learning, where a plastic, faster somatosensory module interacts with a preacquired, slower visual module. We found that the overall performance depended on the latency of the faster module alone, while the relative latency determines the independence of the faster from the slower. In the second implementation, the somatosensory module with shorter latency overtook the slower visual module, and realized better overall performance. The visual module played different roles in early and late learning. First, it worked as a guide for the exploration of the somatosensory module. Then, when learning had converged, it contributed to robustness against system noise and external perturbations. Overall, these results demonstrate that our framework successfully learns to utilize the most useful available feedback for optimal control.

For motor control and learning, the brain relies on feedback signals of different modalities such as vision and somatosensation, and appears to use them selectively depending on the task demands and the extent of learning. For example, to play the piano, the novice must first rely on visual and somatosensory feedback for finger movement. With practice, she can gradually reduce the reliance of visual feedback, and as expert she does not need to look at the keyboard at all. How does the reliance of feedback change with learning? In this paper, we consider a hypothesis that feedback delay is a major factor in selection of feedback modality, and test if appropriate feedback pathways can be selected through reinforcement learning to achieve the best real-time motor performance.

Recently, the optimal feedback control paradigm (Todorov & Jordan, 2002; Kording & Wolpert, 2006) has successfully predicted human motor behaviour (Todorov & Jordan, 2002; Kording & Wolpert, 2004; Liu & Todorov, 2007). Under this framework, execution is preceded by state estimation, by integration of available feedback modalities and models. Here the current state would have to be inferred from the delayed feedback in a recursive manner. In the context of well-learned, specialized motor skills, which are characterized by fast execution and minimum effort, this may be computationally expensive and time-consuming. In the present study, we propose an alternative, model-free architecture for learning and control of motor skills, where motor commands are computed in parallel by a modular circuit for each modality. Motor commands are mapped directly to the crude, delayed sensory feedback, and then integrated with the outputs of other modalities. This way, the quickest feedback is directly available to the controller for exploitation.

The actor-critic (Barto, 1995) is a reinforcement learning architecture proposed to be implemented by the basal gangliathalamocortical (BG-TC) system (Houk, Adams, & Barto, 1995; Montague, Dayan, & Sejnowski, 1996; Doya, 1999). Our framework is an actor-critic architecture with multiple actors (Nakahara, Doya, & Hikosaka, 2001), where each actor corresponds to a modality or submodality. We propose that the feedback latency constrains the utility of each actor. Further, we propose that the reinforcement learning algorithm plays a critical role in gating between inputs - only the better feedback signals, presumably those with shorter latency, would be reinforced. Thus, modular inputs are, once learned, gated implicitly by their latency. In our framework, the gating is realized by combination of population coded outputs, sharpened by a softmax function in favour of the module with highest confidence. This mechanism is different from explicit gating (Jacobs, Jordan, & Barto, 1991; Haruno, Wolpert, & Kawato, 2001), where explicit signals are computed to weight the influence of modules on the combined output.

The paper is outlined as follows: first, we present the general framework of our model (section 2). Then, we outline two implementations for validation of our model - “Experiment I” and “Experiment II” (section 3), with results of simulations presented in the following section (section 4). In Experiment I, we studied a very simple system to clearly understand the effect of feedback delays. Two modules, which are identical except for their feedback delays, were trained until convergence for a simple arm reaching task. We found that 1) performance was constrained by the latency of the faster module, and 2) that the contribution of a module depended on its latency relative to the other module. The faster module could effectively learn the task, without interference from the slower module. In Experiment II, we studied the interaction between vision and somatosensation in a sequential reaching task. Here, we assumed that a “somatosensory module”, corresponding to a motor skill, is learned under the assistance of a “visual module”, a pre-acquired, general but suboptimal controller guided by visual feedback. We found that for the somatosensory module to become independent of the visual module, it is critical that the latency of the somatosensory module has the shorter latency. During learning, albeit its longer latency, the visual module still functioned as a teacher for the somatosensory module. Once the motor skill was sufficiently acquired by the somatosensory module, it could execute the movement alone, independent from the visual module. At this mature stage of learning, the visual module still contributed by maintaining robustness against unexpected perturbations. Given these results, we propose that our framework, combining reinforcement learning modules by a softmax combination of population codes, realizes flexible learning and robust motor control by utilizing the best available feedback. We discuss its implications in section 5.

As a simple model of learning control using multiple delayed feedback channels, we consider a modular architecture as shown in Figure 1. The state **x**(*t*) of the physical environment evolves depending on the motor command **u**(*t*). The state is monitored through different sensory channels **y*** ^{m}*(

Below we outline the operation of the feedback control modules, combination of their outputs, and the learning algorithm. The architecture presented here is a modification of an earlier draft (Bissmarck et al., 2005).

Each module *m* has a characteristic feedback signal

$${\mathbf{y}}^{m}(t)={\text{f}}^{m}(\mathbf{x}(t-{\tau}^{m}))$$

(1)

where f* ^{m}*() is an observation function and

$${\mathbf{a}}^{m}(t)=\text{g}({\mathbf{y}}^{m}(t);{\mathbf{w}}^{m})$$

(2)

where g(*t*) is a function approximator, with a set of trainable parameters **w*** ^{m}*. Each element
${a}_{j}^{m}$ (

The motor command **u** *R ^{D}* is represented by a combination of the population coded outputs of all modules with a softmax function:

$${\pi}_{j}(t)=\frac{exp\left(\beta {\sum}_{m=1}^{M}{a}_{j}^{m}(t)+{n}_{j}(t)\right)}{{\sum}_{j=1}^{J}exp\left(\beta {\sum}_{m=1}^{M}{a}_{j}^{m}(t)+{n}_{j}(t)\right)}$$

(3)

where *β* is a constant that regulated the overlap of population codes. The noise term *n _{j}*(

$$\mathbf{u}(t)=\sum _{j=1}^{J}{\pi}_{j}(t){\overline{\mathbf{u}}}^{j}.$$

(4)

The modular outputs
${a}_{j}^{m}$ can be interpreted as the log-probability of selecting the output **ū*** _{j}*,

$${a}_{j}^{m}(t)=log(P({\overline{\mathbf{u}}}_{j}(t)\mid {\mathbf{y}}^{m}(t-{\tau}^{m}),{\mathbf{w}}^{m})).$$

(5)

Summing over all modules, adding noise and exponentiating gives the full probability *P*(**ū*** _{j}*(

Our model implements a form of the continuous actor-critic (Doya, 2000). The goal of learning is to maximize the cumulative future rewards:

$$E\left[{\int}_{0}^{\infty}{e}^{-\frac{s}{{\tau}^{TD}}}r(t+s)ds\right],$$

(6)

where *τ ^{TD}* determines how far into the future returns should be considered.

The role of the critic is to estimate the cumulative future reward from each state **x**(*t*) in the form of state value function:

$$V(\mathbf{x}(t))=E\left[{\int}_{0}^{\infty}{e}^{-\frac{s}{{\tau}^{TD}}}r(t+s)ds\right]$$

(7)

for each state **x**(*t*). The critic learns to estimate the value function from available feedback:

$$V({\mathbf{y}}^{1}(t),{\mathbf{y}}^{2}(t),\mathrm{..},{\mathbf{y}}^{M}(t);{\mathbf{w}}^{c})$$

(8)

where **w*** ^{c}* is a set of trainable parameters.

Learning of the critic and the feedback control modules is based on the temporal difference (TD) error

$${\delta}^{TD}(t)=r(t)-\frac{1}{{\tau}^{TD}}V(t)+\stackrel{.}{V}(t),$$

(9)

which signals the deviation of reward prediction. See appendix A for the update equations for the critic parameters **w*** ^{c}* and the controller parameters

We test the effects of different sensory feedback delays in two simulated experiments of arm reaching. In Experiment I, we used two somatosensory feedback control modules with different delays for a simple reaching task. The aim is to see how the minimal delay affect the control performance and how relative feedback delay affects the selection of the modules by learning. In Experiment II, we used both visual and somatosensory feedback modules for a sequential reaching task. The aim is to see whether and how transition from slow, task independent visual control to fast, task dependent somatosensory control happens under different feedback delays.

Figures 2 and and55 show the implementation of Experiments 1 and 2, respectively.

The implementation of (*τ*_{1}, *τ*_{2}) = Experiment I. The agent controls a 2DOF arm by applying joint torques to shoulder and elbow joints, with angles *θ*_{1} and *θ*_{2} respectively (*ξ*_{1} and *ξ*_{2} define the Cartesian coordinates). **...**

The implementation of Experiment II. Here, with the arm as in Experiment I, the goal is to press targets 1, 2 and 3, presented in consequent order, starting from S. Reward is given only at the time when a key is pressed. The agent consists of two modules **...**

We use a 2DOF arm, where each link is 0.3 m long, 0.1 m in diameter, and 1 kg (see Figure 2). The state is defined by its shoulder and elbow joint angles *θ*_{1} and *θ*_{2} and angular velocities _{1} and _{2}. The Cartesian hand position is *ξ** ^{hand}*(

$${u}_{d}^{\mathit{actual}}(t)=(1+{n}_{d}(t)){u}_{d}(t)\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}(d=1,2)$$

(10)

where *n _{d}*(

In Experiment I, the goal is to move the hand as quickly and accurately as possible to the target position T given the start position S. The reward signal is given by an exponential function of the distance of the hand to the target

$$r(t)=aexp(-b||{\mathit{\xi}}^{\mathit{hand}}(t)-{\mathit{\xi}}^{\mathit{target}}||)+c$$

(11)

where *a* = 6, *b* = 20 and *c* = −0.3. Each trial lasts for 1.0 second.

In Experiment II, the task is to press three targets in consecutive order, which always appear one at the time at the same positions, marked 1, 2 and 3 in Figure 5. A target is pressed when the hand reaches a proximity of the target ||*ξ** ^{hand}*(

In Experiment I, we use two somatosensory feedback controllers, while in Experiment II, we use somatosensory and visual feedback controllers.

The somatosensory control module uses a population code representing joint angles ** θ** and angular velocities

$${y}_{k}^{m}(t)=\frac{1}{Z}exp\left(-\frac{1}{2}\left\{\sum _{d}{(\frac{{\theta}_{d}(t-{\tau}_{m})-{\overline{\theta}}_{kd}}{{\sigma}_{kd}})}^{2}+\sum _{d}{(\frac{{\stackrel{.}{\theta}}_{d}(t-{\tau}_{m})-{\overline{\omega}}_{id}}{{\sigma}_{kd}^{\prime}})}^{2}\right\}\right)$$

(12)

where *k* = 1, 2, ..*K* is the index of the input units, * _{kd}* and

The output of module *m* is given by another population code

$${\mathbf{a}}^{m}(t)={\mathbf{W}}^{m}{\mathbf{y}}^{m}(t-{\tau}^{m})$$

(13)

where **W*** ^{m}* are trainable weight matrices. Initially all weights are zero. See appendix C for more details.

The input for the visual feedback controller is the Cartesian positions of the hand and the target

$${\mathbf{y}}^{v}(t)=\{{\stackrel{\sim}{\mathit{\xi}}}^{\mathit{hand}}(t),{\mathit{\xi}}^{\mathit{target}}(t-{\tau}^{v}))\}.$$

(14)

While the target position is subject to feedback delay *τ ^{v}*, we assume that an estimate of the present hand position

We assume that the feedback control of the visual module (indexed by *v*) is pre-acquired and use a linear feedback controller with inverse dynamics compensation and output smoothing (see appendix D). The controller produces bell-shaped velocity profile similar to natural hand movement.

To promote effective exploration, we use a low-pass filtered noise **n** in the action output (Equation 3)

$${\tau}^{n}\stackrel{.}{\mathbf{n}}(t)=-\mathbf{n}(t)+\nu \mathbf{N}(t)$$

(15)

where the time constant *τ ^{n}* = 50 ms and

The critic takes the population coded somatosensory feedback as its input and use a linear weighting to produce the value estimate

$$V(t)={\mathbf{w}}^{c}\mathbf{y}(t),$$

(16)

where **y**(*t*) = (**y**^{1}(*t*), **y**^{2}(*t*)) in Experiment I and **y**(*t*) = **y*** ^{s}*(

In order to compare the control performance under different settings of feedback delays, we use a number of performance measures, namely, the hand trajectory, the hand velocity profile, the cumulative reward, and the performance time.

The cumulative reward is given by

$$R={\int}_{0}^{T}r(t)dt,$$

(17)

where *T* is the length of a trial (T=1 sec in Experiment I, T=5 sec in Experiment II).

To compare the relative contribution of different modules, we define the actor weight ratio, the output deviation, and the relative output proximity. The actor weight ratio (AWR) we define as the ratio of the absolute sum of actor weights of respective trained module:

$$\text{AWR}=\frac{{\sum}_{k}{\sum}_{j}\mid {w}_{jk}^{1}\mid}{{\sum}_{k}{\sum}_{j}\mid {w}_{jk}^{2}\mid},$$

(18)

i.e. a value AWR > 1 indicates a relatively more influential actor of module 1. We also define the output deviation

$${\text{d}}^{\text{m}}(\mathbf{x}(\text{t}))=\mathbf{u}(\mathbf{x}(\text{t}))-{\mathbf{u}}^{\text{m}}(\mathbf{x}(\text{t}))\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}(\text{m}=\text{v},\text{s})$$

(19)

which shows how much module *m*’s output differ from the agent’s at time *t*. Its time average over trajectories is given by

$$<{\text{d}}^{\text{m}}>=\frac{1}{\text{T}}{\int}_{0}^{\text{T}}\mid {\mathbf{d}}^{\text{m}}(\text{t})\mid \text{dt}.$$

(20)

for a trial terminating at *T*. From the output deviation, we define the relative output proximity

$${p}^{v}(t)=1-\frac{{d}^{v}(t)}{{d}^{v}(t)+{d}^{s}(t)}$$

(21)

$${p}^{s}(t)=1-{p}^{v}(t)$$

(22)

which is a measure of the relative influence of visual and somatosensory modules over agent output, respectively. Both *p ^{v}*(

In both experiments the same learning parameters were used: inverse temperature *β* = 10, time constants *τ ^{TD}* = 200 ms,

All differential equations were approximated with the Euler forward method with a time step small enough not to affect the results (10 ms).

In Experiment I (Figure 2), we investigate how feedback latencies affect learning and control in the proposed framework. We use a simple implementation with two modules which are identical, except for their feedback latencies. The task is to learn a simple reaching movement with a 2DOF arm. We train the networks with different pairs of feedback latencies and compare their performance and relative contribution of modules after learning has converged.

Figure 3 shows examples of hand trajectories generated by the architecture with four different settings of feedback latencies. The 10 trajectories in the top row (Figure 3A) are generated with both modules after 100,000 training trials. Effective reaching movement is achieved by all four latency pairs in a robust manner. In the case of (*τ*_{1}, *τ*_{2}) = (0, 50), the variability is higher than in the other examples. Since the reward function (see Section 3.1) does not explicitly penalize variability, this is still a performance optimally close to other well-performing agents (see below). However, in the cases of (*τ*_{1}, *τ*_{2}) = (0, 0), and (0, 50) ms the movement is much faster than (50, 50), and (50, 100) ms, as can be seen in the hand velocity plots in the bottom row (Figure 3D, solid lines (mean velocity of the samples in Figure 3A)). This shows that the shortest feedback delay is critical for the performance. This is not a trivial finding as the output of the module with the longer feedback delay can interfere with the feedback command generated by the module with shorter delay.

Trajectory samples generated by four different settings of latencies (*τ*_{1}, *τ*_{2}) = (0,0), (0,50), (50,50), and (50–100) milliseconds (ms) after 100,000 trials of training. S - start, T - target. (A) 10 trajectory samples generated **...**

In order to see the relative contribution of the two modules, we compared the trajectories generated by either one of the modules (Figure 3B–C), with the other module’s output set as **a*** ^{m}*(

To verify the critical role of the minimum feedback delay in control performance and the role of relative feedback delay for module selection, we measured the cumulative reward *R* and the actor weight ratio of trained agents, for 24 different pairs of feedback delays. Under the condition of *τ*_{1} ≤ *τ*_{2}, we plot those measures in the parameter space of *τ _{min}* =

Figure 4A shows how the cumulative reward depends on *τ ^{min}* and Δ

Latency-dependent performance measures, displayed as surface plots constrained by 24 latency pairs (black dots) over latency space (*τ*_{min}, Δ*τ*). (A) Cumulative reward *R*, indicating behavioural performance of agent. (B) Actor weigth **...**

These results confirm that the performance of the modular learning control architecture is mostly determined by the module with shortest latency. This is achieved by the softmax combination of population coded outputs of modules (2.2) and tuning of modular outputs by actor-critic learning. It is noteworthy that potential problem of slower feedback module contaminating the good output of the faster module has been avoided by this scheme.

In Experiment II (Figure 5), we introduce a more realistic, complex implementation of a visuomotor sequence task. In motor skill acquisition, there is substantial evidence for a shift in cortical activity with experience, from prefrontal areas to motor areas (Petersen, Mier, Fiez, & Raichle, 1998; Jueptner, Frith, Brooks, Frackowiak, & Passingham, 1997; Doyon, Owen, Petrides, Sziklas, & Evans, 1996; Hikosaka, Nakamura, Sakai, & Nakahara, 2002a; Floyer-Lea & Matthews, 2005). Analogously, there should be a shift in modalities of feedback subserving these cortical areas; from extrinsic (visual) feedback needed for anticipation and proceduralization of task dynamics to intrinsic (somatosensory) feedback needed for optimization of motor control (Nakahara et al., 2001).

Here, we study the transfer between these two systems, a “visual module” and a “somatosensory module”, in a task of reaching a stereotyped sequence of three targets. The visual module relies on a general purpose controller which regulates a single reach to a given visual target. We assume that the module is preacquired and is not be optimized for any particular target sequence. The somatosensory module relies on somatosensory feedback and become optimized for repeated motor sequences. Architectures with different latency pairs *τ ^{v}* and

Figure 6A–B compare the reaching trajectories before and after learning ((*τ ^{v}*,

Performance before and after learning. (A) 5 sample trajectories before learning. (B) 5 sample trajectories after learning. (C) Performance times of 12 latency pairs before learning (black bars) and after learning (white bars), compared with equal levels **...**

Figure 6C compares the performance time (the time it takes to complete one trial) before (black bars) and after (white bars) 100,000 trials of learning for 12 different latency pairs. Clearly, sequence-specific learning by the somatosensory module contributes to reduction of the performance time. Its potential to do so is primarily constrained by *τ ^{s}*, which has a decreasing trend of performance time for 100, 50 and 0 ms with any latency

To elucidate the contribution of the somatosensory module, we compared the joint torque outputs of single modules (computed as in Experiment I) with the joint torque output of the agent. Figure 7A–B shows trajectories of generated joint torques over time (one trial), for the two extremes of relative latency in our study, ((*τ ^{v}*,

A comparison of contribution to joint torque outputs between the visual and somatosensory modules. (A–B) Example trajectories of shoulder (top) and elbow (bottom) torques over time for the latency pairs (*τ*^{v}, *τ*^{m})=(100, 0) (A) and **...**

We then investigated how the learned behaviour is driven by the somatosensory module. We compared the normal behaviour of the learned agent with a condition with the visual module inactive. Figure 8A shows examples of hand trajectories for four latency pairs in the two conditions. With both modules, all agents are always successful. When the visual module is inactive, the ability to control the movement depends on the relative latency Δ*τ* = *τ ^{v}* −

(A) Learned behaviour of the agent in normal execution (“both modules”, top row, solid lines) and in execution with the visual module inactive (“somatosensory module only”, middle row, dashed lines) for four different latency **...**

Figure 8C compares the mean performance time of the two conditions. Two of the agents ((50, 0) and (100, 50)) can on average perform almost as well in the somatosensory only condition, but note the smaller variance of the normal condition. The visual module provides robustness also late in learning.

To further evaluate the robustness of the composite system, we perturbed a behaving agent ((*τ ^{v}*,

External force perturbation imposed on the end effector (hand), for a trained agent (*τ*^{v}, *τ*^{s}) = (100, 0) ms. (A–B) Example trajectory when an impulse (400 N, 50 ms) perturbs the composite system. (A) Spatial movement trajectory. **...**

In summary, these results indicate that in this visuomotor sequence task, as learning progresses, the somatosensory module with the presumably shorter latency becomes dominant in motor control. After learning, the visual module provides stability when the effector ends up outside the well-trained regime. The memory transfer, or the degree of control by different modalities, critically depends on the difference in latencies between the visual and somatosensory modules.

We have examined how feedback latency affects the relative importance of modules for the learning and control of real-time motor skills. With soft-max combination of population-coded output of multiple control modules, we demonstrated in simulations how the modules with shorter latency attain dominance in motor control. Although the result may sound straightforward, there are potential problems with conflicts of multiple modules, e.g., the longer latency output pulling back movement by the shorter latency module. It is noteworthy that appropriate module selection was achieved without any explicit gating and simply by reinforcement of the output of the module that best contributed to the performance. *removed: The last experiment showed that module weighting is highly flexible; the general-purpose visual module takes over the job when the trajectory-specific somatosensory module does not perform well.*

The use of population codes strongly contribute to the robustness and stability of our framework. On the level of perception, the robustness towards sensor noise of population coding (Georgopoulos et al. 1988, Pouget et al. 2000) give sensory estimates of high certainty, enhanced by the redundancy of multiple modalities. On the level of action selection, the population codes make the sharper distributions of modular inputs regarded as signal, while the impact of flatter distributions will be suppressed like noise. In experiment II, this explains the robustness of the composite system, demonstrated towards noise and force perturbations. The somatosensory module will predominate the output within its experienced, expert regime, but as soon as noise or mechanical force makes the arm go out of range, the visual module will resume control.

In dealing with delayed or noisy sensory signal, a recently popular paradigm is to use recursive Bayesian filters to estimate the hidden state (Todorov, 2004). Such a model may also explain more weight on the faster module, that is more informative about the current state. However, Bayesian inference requires the models of the physical dynamics and sensory delay and noise and also takes heavy on-line computation, except for linear Gaussian systems where Kalman filtering is possible. Instead, here we pursued much simpler approach of training feedback controllers specialized for given delays. Analysis of pros and cons of these approaches and their possible integration is the subject of our future study. *removed: It is also worth mentioning that speed up of the trajectory in experiment II is a natural outcome of delayed reward discounting.*

The effects of feedback delays are a relatively less investigated aspect of motor control and learning (Miall & Jackson, 2006). Some tracking experiments have been conducted, all showing worse performance for artificially imposed delays 100 ms and longer (Miall et al., 1985; Foulkes & Miall, 2000; Miall & Jackson, 2006; Ogawa et al., 2007). Kitazawa et al. also showed that learning speed and performance of prism adaptation of a reaching task worsened with delays of visual feedback of the end point error, for humans (Kitazawa et al., 1995) and monkeys (Kitazawa & Yin, 2002).

In experiment II, faster movements were learned even though reward was given only for key presses, regardless of time expenditure. The time discounting of reward used in our actor-critic learning algorithm can naturally explain why performance of numerous skill learning tasks (e.g. Anderson 1983) speed up, although speed is not an explicit performance criterion.

The mechanism of transfer from declarative to procedural memories is poorly understood (Doyon & Benali, 2005; Hikosaka et al., 2002b). In our framework, modules with shorter latency become dominant with learning. As demonstrated in experiment II, this allows specialized motor skills based on fast, intrinsic feedback loops to emerge under general purpose controllers based on slow, extrinsic feedback like vision or audition. If the difference in feedback latency is long enough, the faster modality will eventually become independent of the slower modality, which can then be used for other purposes.

There are two analogies between our framework and the BG-TC system: 1) its organization into modular circuits (Alexander & Crutcher, 1990), and 2) the actor-critic architecture (Houk et al., 1995). In previous experimental (Hikosaka et al., 1999, 2002a) and computational (Nakahara et al., 2001) work, we have proposed that prefrontal and motor BG-TC loops cooperate in motor sequence learning, encoding sequences in visual and motor coordinates, respectively.

The success of this rather simple modular learning control framework motivates future studies with agents comprising of more complex, heterogeneous features, such as different sensor noise levels, learning speeds, or inclusion of feedforward components. For example, given a slow, low-noise module and a fast, noisy module, the former would be used for precision tasks and the latter for speed tasks. To further test the generality of this prediction, delayed auditory feedback could be added as a third modality, and modality dependence could be tested under different pairs of feedback delays.

The brain receives possibly thousands of sensory signals, from which it has to make a sensible response. Biological reinforcement learning may not just be about selecting actions, but also about selecting sensory input. In this context, feedback latencies may be a critical factor for which input and output connections are formed.

FB wishes to thank Mitsuo Kawato, Erhan Oztop and Jun Morimoto for comments on an earlier draft. This research was funded by Core Research for Evolutional Science and Technology (CREST), Japan Science and Technology Agency.

Our model implements a form of the continuous actor-critic (Doya, 2000). The function of the critic is to estimate the cumulative sum of expected future reward *r*(*t*), i.e. to learn the value function. For a given policy, the continuous value function is defined as

$$V(\mathbf{x}(t))=E\left[{\int}_{0}^{\infty}{e}^{-\frac{s}{{\tau}^{TD}}}r(t+s)ds\right]$$

(23)

for each state **x**(*t*). The time constant *τ ^{TD}* determines how far into the future returns should be considered. The critic implements a function approximator to estimate the value function from available feedback:

$$V=V({\mathbf{y}}^{1}(t-{\tau}^{1}),{\mathbf{y}}^{2}(t-{\tau}^{2}),\mathrm{..},{\mathbf{y}}^{M}(t-{\tau}^{M});{\mathbf{w}}^{c})$$

(24)

where **w*** ^{c}* is a set of trainable parameters. The temporal difference (TD) error

$${\delta}^{TD}(t)=r(t)-\frac{1}{{\tau}^{TD}}V(t)+\stackrel{.}{V}(t).$$

(25)

The TD error is used to update the parameters in the critic (see below), and converges to zero when Equation 8 equals Equation 6. The TD error is also used to improve the policy **(***t*) of the actor, where circumflex denotes the noise-free, deterministic policy. The action deviation signal

$${E}_{j}(t)=\frac{{({\pi}_{j}(t)-{\widehat{\pi}}_{j}(t))}^{2}}{2}$$

(26)

is the difference between the learned action and the action that was actually selected. The TD error reinforces or penalizes this deviation to update the policy **. To control the time scale of states and actions to be updated we use eligibility traces:
**

$${\stackrel{.}{e}}_{k}^{c}(t)=-\frac{1}{{\tau}^{ET}}{e}_{k}^{c}+\frac{\partial V}{\partial {w}_{k}^{c}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}{\stackrel{.}{e}}_{kj}^{m}(t)=-\frac{1}{{\tau}^{ET}}{e}_{kj}^{c}+\frac{\partial {E}_{j}(t)}{\partial {w}_{kj}^{m}}$$

(27)

for the critic and actor modules respectively. The parameters are indexed by *k* and *τ ^{ET}* is a time constant. The trace for the

$$\frac{\partial {E}_{j}(t)}{\partial {w}_{kj}^{m}}=({\pi}_{j}(t)-{\widehat{\pi}}_{j}(t))\frac{\partial {\pi}_{j}(t)}{\partial {w}_{kj}^{m}}.$$

(28)

The parameters are updated by gradient descent as

$${\stackrel{.}{w}}_{k}^{c}=\alpha {\delta}^{TD}(t){e}_{k}^{c}(t)\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}{\stackrel{.}{w}}_{kj}^{m}=\alpha {\delta}^{TD}(t){e}_{kj}^{m}(t)$$

(29)

where *α* denotes the learning rate.

In both experiments the population codes were equal. In the somatosensory modules, the preferred joint angles * _{kd}* and angular velocities

The preferred joint torques **ū*** _{j}* corresponding to action

The somatosensory module in experiment II also included “context units”. The context units consists of 3 tapped delay lines, each corresponding to a key in the sequence task. Each delay line had 8 units, i.e. 24 context units in all. For the *k*-th unit in the *n*-th delay line (*k > K*_{0}, *k* ≠ *K*_{0} + 8(*n* − 1) + 1):

$${\stackrel{.}{y}}_{k}^{m}(t)=-\frac{1}{{\tau}^{C}}{y}_{k}^{m}(t)+{y}_{k-1}(t)$$

(30)

where *τ ^{C}* = 30 ms. Each delay line is initiated by the input at (

$${y}_{k}^{m}(t)=\delta (t-{\tau}_{n}^{\mathit{keypress}})$$

(31)

where *δ* is the Dirac delta function, and
${\tau}_{n}^{\mathit{keypress}}$ is the instant the *n*-th key was pressed.

The feedback signal **y*** ^{v}* to the visual module consists of the hand kinematics

The joint torques are first computed by

$${\stackrel{.}{\mathbf{u}}}^{\mathit{visual}}(t)=-\frac{1}{{\tau}^{CT}}{\mathbf{u}}^{\mathit{visual}}(t)+\lambda {{\mathbf{u}}^{\mathit{visual}}}^{\prime}({\ddot{\stackrel{\sim}{\mathit{\xi}}}}^{\mathit{hand}},{\stackrel{.}{\stackrel{\sim}{\mathit{\xi}}}}^{\mathit{hand}},\mathbf{e})$$

(32)

where *τ ^{CT}* and

$${{\mathbf{u}}^{\mathit{visual}}}^{\prime}(t)={\mathbf{J}}^{T}(\mathbf{M}({\stackrel{.}{\stackrel{\sim}{\mathit{\xi}}}}^{\mathit{hand}}+{\mathbf{K}}_{1}{\stackrel{.}{\stackrel{\sim}{\mathit{\xi}}}}^{\mathit{hand}}-{\mathbf{K}}_{2}\mathbf{e})+\mathbf{C}{\stackrel{.}{\stackrel{\sim}{\mathit{\xi}}}}^{\mathit{hand}})$$

(33)

in Cartesian coordinates, where **J** is the Jacobian (** θ**/

The module output is an expansion of the joint torque **u*** ^{visual}* on a population vector

$${a}_{j}^{v}(t)=\frac{1}{Z}exp(-\frac{1}{2}\{\sum _{d}{(\frac{{u}_{d}^{\mathit{visual}}(t)-{\overline{u}}_{jd}}{{\sigma}_{jd}^{\prime \prime}})}^{2}\})$$

(34)

where *Z* is the normalization term, *ū _{jd}* is a preferable joint torque for Cartesian dimension

The parameters of equations of 32 and 33 were *τ ^{CT}* = 50 ms,

Fredrik Bissmarck, ATR International, Kyoto, Japan.

Hiroyuki Nakahara, RIKEN Brain Science Institute, Saitama, Japan.

Kenji Doya, Okinawa Institute of Science and Technology, Okinawa, Japan.

Okihide Hikosaka, National Eye Institute, NIH, Maryland.

- Alexander GE, Crutcher MD. Functional architecture of basal ganglia circuits: neural substrates of parallel processing. Trends Neurosci. 1990;13(7):266–71. [PubMed]
- Barto A. Adaptive critics and the basal ganglia. In: Houk J, Davis J, Beiser D, editors. Models of information processing in the basal ganglia. Cambride, MA: MIT Press; 1995. pp. 215–232.
- Bissmarck F, Nakahara H, Doya K, Hikosaka O. Responding to modalities with different latencies. In: Saul K, Weiss Y, Bottou L, editors. Advances in neural information processing systems. Vol. 17. Vancouver, Canada: Camebridge, MA: MIT Press; 2005. pp. 169–176.
- Doya K. What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? Neural Netw. 1999;12(7–8):961–974. [PubMed]
- Doya K. Reinforcement learning in continuous time and space. Neural Comput. 2000;12(1):219–45. [PubMed]
- Doyon J, Benali H. Reorganization and plasticity in the adult brain during learning of motor skills. Curr Opin Neurobiol. 2005;15(2):161–7. [PubMed]
- Doyon J, Owen AM, Petrides M, Sziklas V, Evans AC. Functional anatomy of visuomotor skill learning in human subjects examined with positron emission tomography. Eur J Neurosci. 1996;8(4):637–48. [PubMed]
- Floyer-Lea A, Matthews PM. Distinguishable brain activation networks for short- and long-term motor skill learning. J Neurophysiol. 2005;94(1):512–8. [PubMed]
- Foulkes AJ, Miall RC. Adaptation to visual feedback delays in a human manual tracking task. Exp Brain Res. 2000;131(1):101–10. [PubMed]
- Haruno M, Wolpert DM, Kawato M. Mosaic model for sensorimotor learning and control. Neural Comput. 2001;13(10):2201–20. [PubMed]
- Hikosaka O, Nakahara H, Rand MK, Sakai K, Lu X, Nakamura K, et al. Parallel neural networks for learning sequential procedures. Trends Neurosci. 1999;22(10):464–71. [PubMed]
- Hikosaka O, Nakamura K, Sakai K, Nakahara H. Central mechanisms of motor skill learning. Curr Opin Neurobiol. 2002a;12(2):217–22. [PubMed]
- Hikosaka O, Nakamura K, Sakai K, Nakahara H. Central mechanisms of motor skill learning. Curr Opin Neurobiol. 2002b;12(2):217–22. [PubMed]
- Houk J, Adams J, Barto A. A model of how the basal ganglia generate and use neural signals that predict reinforcement. In: Houk J, Davis J, Beiser D, editors. Models of information processing in the basal ganglia. Cambride, MA: MIT Press; 1995. pp. 249–270.
- Jacobs R, Jordan M, Barto A. Task decomposition through competition in a modular connectionist architecture: The what and where vision tasks. Cognitive Science. 1991;(15):219–250.
- Jueptner M, Frith CD, Brooks DJ, Frackowiak RS, Passingham RE. Anatomy of motor learning. ii. subcortical structures and learning by trial and error. J Neurophysiol. 1997;77(3):1325–37. [PubMed]
- Kitazawa S, Kohno T, Uka T. Effects of delayed visual information on the rate and amount of prism adaptation in the human. J Neurosci. 1995;15(11):7644–52. [PubMed]
- Kitazawa S, Yin PB. Prism adaptation with delayed visual error signals in the monkey. Exp Brain Res. 2002;144(2):258–61. [PubMed]
- Kording KP, Wolpert DM. Bayesian integration in sensori-motor learning. Nature. 2004;427(6971):244–7. [PubMed]
- Kording KP, Wolpert DM. Bayesian decision theory in sensorimotor control. Trends Cogn Sci. 2006;10(7):319–26. [PubMed]
- Liu D, Todorov E. Evidence for the flexible sensorimotor strategies predicted by optimal feedback control. J Neurosci. 2007;27(35):9354–68. [PubMed]
- Miall RC, Jackson JK. Adaptation to visual feedback delays in manual tracking: evidence against the smith predictor model of human visually guided action. Exp Brain Res. 2006;172(1):77–84. [PubMed]
- Miall RC, Weir DJ, Stein JF. Visuomotor tracking with delayed visual feedback. Neuroscience. 1985;16(3):511–20. [PubMed]
- Montague PR, Dayan P, Sejnowski TJ. A framework for mesencephalic dopamine systems based on predictive hebbian learning. J Neurosci. 1996;16(5):1936–47. [PubMed]
- Nakahara H, Doya K, Hikosaka O. Parallel cortico-basal ganglia mechanisms for acquisition and execution of visuomotor sequences - a computational approach. J Cogn Neurosci. 2001;13(5):626–47. [PubMed]
- Ogawa K, Inui T, Sugio T. Neural correlates of state estimation in visually guided movements: an event-related fmri study. Cortex. 2007;43(3):289–300. [PubMed]
- Petersen E, van Mier H, Fiez A, Raichle E. The effect of practice on the functional anatomy of task performance. Proceedings of the National Academy of Sciences. 1998;95:853–860. [PubMed]
- Pouget A, Dayan P, Zemel RS. Inference and computation with population codes. Annu Rev Neurosci. 2003;26:381–410. [PubMed]
- Todorov E. Optimality principles in sensorimotor control. Nat Neurosci. 2004;7(9):907–15. [PMC free article] [PubMed]
- Todorov E, Jordan MI. Optimal feedback control as a theory of motor coordination. Nat Neurosci. 2002;5(11):1226–35. [PubMed]
- Weiss Y, Fleet D. Velocity likelihoods in biological and machine vision. In: Rao R, Olshausen B, Lewicki M, editors. Statistical theories of the cortex. Cambridge, MA: MIT Press; 2002. pp. 77–96.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |