A decision-making, reinforcement learning model

The model is based in part on a pooling scheme that relates the activity of a population of MT-like neurons to a perceptual decision about motion direction, and in part on a reinforcement learning rule that evaluates and adjusts this pooling process (which for weak stimuli represents a “partially observable” or “hidden” state) at the end of each trial according to the reward outcome () ^{20}^{-}^{22}.

The pooling model is composed of three stages that correspond to the sensory representation in MT, the pooling of the MT responses, and the formation of the direction decision based on the pooled response in LIP (, orange arrows; see

Supplemental Fig. 1 for alternative implementations). We modelled MT as a population of 7200 neurons with 36 different direction tuning functions distributed uniformly around the full 360°, with trial-by-trial responses to the motion stimulus and interneuronal correlations based on previously measured values

^{4}^{, }^{23}^{, }^{24}. The cumulative, motion-driven responses from each MT neuron (

*x*_{i}) were then pooled by LIP as a weighted sum:

where

*w*_{i} is the pooling weight assigned to the

*i*^{th} MT neuron. Like previous studies, we assumed that this pooling process is not perfect and corrupted the pooled response,

*y*, with Gaussian noise

^{21}. The direction decision was made based on the arithmetic sign of the pooled response: rightward (i.e., between -90 and 90°, as in ) for >0, leftward otherwise.

The pooling weights from MT to LIP were random initially and then adjusted according to a reinforcement-learning rule after each trial (, green arrows). We used a simple delta rule that adjusts the pooling weights on trial

*n+1* (

*w*_{n+1}) based on the weights on trial

*n* (

*w*_{n}) and an update term (

*Δw*) that depends on a reward prediction error

^{20}^{, }^{25}^{, }^{26},

where

*α* is the learning rate (which, in principle, could change with time or differ for correct and error trials, although we do not do so here),

*C* describes the choice (-1 for leftward, 1 for rightward), r is the reward outcome for the current trial (1 if a reward was given, 0 otherwise),

*E*[

*r*] is the predicted probability of making a correct (rewarded) choice given the pooled response (

*y*) for that trial,

*x* is the vector of MT responses,

*E*[

*x*] is the vector of baseline MT responses (which we modelled as the responses to 0% coherence motion; the population mean could also be used), and

*m* and

*n* are binary variables that determine the exact form of the rule used (in the main text we use

*m*=1 and

*n*=0; for alternatives see and

Supplemental Fig. 2). The difference between actual and predicted reward is called the reward prediction error and is equal to

*r* -

*E*[

*r*]. In general, weight adjustments are based on the correlation between the reward prediction error and

*x*.

*C* determines the sign of the adjustments. Neurons that respond more strongly to rightward motion will tend to have more positive weights because the reward prediction error tends to be positive on correct, rightward-choice trials (

*C*(

*r* -

*E*[

*r*])=1 -

*E*[

*r*])) and incorrect, leftward-choice trials (

*C*(

*r* -

*E*[

*r*])=

*E*[

*r*])). Conversely, neurons that respond more strongly to leftward motion will tend to have more negative weights.

| **Table 1**Model variants. Bold indicates the version used in the main text. Statistical test is whether the model-generated lapse rates and discrimination thresholds have single-exponential time constants that match the data (t-test based on linear regression parameters (more ...) |

This formulation assumes that a single LIP neuron (or pool of neurons) forms the decision variable, the sign of which determines choice. Alternatively, one can consider two pools of decision neurons corresponding to each of the two choices

^{27}. In this case, the decision variable is computed as the difference between the two pooled responses (with independent noise sources), and

Eq. 3 is used to update two separate sets of weights (see

**Supplemental Methods**). This formulation gives comparable results to the single decision pool (

Supplemental Fig. 3), and therefore we use the simpler scheme here, which allows us to present and analyze a single set of weights for each simulation.

The term

*E*[

*r*] can be thought of as an estimate of the subject’s confidence at the time of the decision that the decision is correct and a reward will be obtained

^{28}. We computed this estimate using the pooled response

*y*: the sign of

*y* determined the decision, whereas the absolute magnitude of

*y* reflected confidence in that decision. Specifically,

*y* was modelled after LIP and therefore assumed to be proportional to the log-odds ratio that rightward is the correct choice, given the sensory evidence

*x* and equal priors

^{17}^{, }^{29}^{, }^{30}. Thus, the estimated probability of a correct decision was computed as 1/(1+e

^{-β|y|}), where ß is a proportionality constant that can be estimated using a sequential estimation technique (or directly from the recent history of choices; see

Supplemental Fig. 2)

^{31}.

After each update, we normalized

*w* by (Σ

*w*_{i}^{2}/

*w*_{amp})

^{1/2} to keep the vector length of

*w* constant throughout training (because Σ

*w*_{i}^{2}=

*w*_{amp}). This weight normalization, which is thought to be a common feature of cortical plasticity

^{32}, enhanced the stability of the model by preventing any of the pooling weights from growing without bound. Moreover, it prevented the model from learning by indiscriminately increasing the overall magnitude of the weights and therefore the magnitude of the pooled response (

*y* in

Eq. 1) to overcome the effects of the decision noise. The value of

*w*_{amp} was chosen to give a pooled response similar to responses of LIP neurons in trained monkeys (

Supplemental Fig. 9e).

We simulated trial-by-trial performance of the model using the exact sequences of stimulus conditions (i.e., motion directions, coherences, and viewing durations for each monkey) used for training in our previous study ^{4}. We then compared the model to the real behavioural, MT, and LIP data.

Comparison of the model and behavioural data

The monkeys learned from experience to associate a given direction of motion with a particular eye-movement response. We quantified this form of associative learning using the lapse rate (errors for high-coherence stimuli; , gray symbols). We assume that these errors reflect, in part, incomplete knowledge of the sensory-motor association but not limitations in perceptual processing, because the high-coherence stimuli are easily discriminable even for naïve subjects, or failure to perform the general task requirements, because this analysis included only trials in which the monkey maintained fixation during motion viewing and then responded with an appropriate eye movement (87.4% of all trials for monkey C, 88.7% for Z). For both monkeys, the lapse rate declined rapidly over the first week or so of training, quickly reaching an asymptotic value of near zero ().

The monkeys also became increasingly sensitive to weak motion signals over many months of training. We quantified these improvements in perceptual sensitivity using the discrimination threshold (the motion coherence corresponding to ~81% correct responses at a viewing duration of 1 sec; , black symbols). Threshold was measured concurrently with lapse rate for all but the earliest training sessions, when performance was poor and only high-coherence stimuli were used. Thresholds decreased gradually, starting when low-coherence stimuli were introduced and continuing well after the monkeys had acquired the visuomotor association (i.e., near-zero lapse rates; ) ^{4}.

The model can account for both the associative and perceptual changes. Using the same sequences of trials experienced by the monkeys, we used the model to generate a simulated sequence of choices during learning. We then computed the lapse rate (, gray symbols, using 250-trial blocks) and discrimination threshold (, black symbols, using 1,000-trial blocks) from the simulated behavioural data. Like for the monkey data, lapse rates declined rapidly to near zero (time constants, *τ*_{la}, of exponential fits to the behavioural lapse rate, mean [68% confidence intervals, or CI] in number of trials for monkey C: 1,082 [1,080 1,175] for real data, 2,532 [1,839 3,658] for simulated data; monkey Z: 9,205 [9,139 9,564] for real data, 6,132 [6,000 6,539] for simulated data). Discrimination threshold improved more gradually, eventually reaching lower asymptotes comparable to those reached by the monkeys (time constants, *τ*_{th}, of exponential fits to behavioural threshold, mean [68% CI] in number of trials for monkey C: 28,317 [26,945 29,839] for real data and 26,841 [26,170 27,326] for simulated data; monkey Z: 18,422 [17,339 20,500] for real data, 22,467 [22,016 22,620] for simulated data). The lower asymptotes were also similar (mean [68% CI] in percent coherence for monkey C: 13.2 [12.7 13.7] for real data and 11.8 [11.7 12.0] for simulated data; monkey Z: 21.5 [19.7 22.5] for real data and 21.0 [20.3 21.5] for simulated data).

The time course of learning for the model depended critically on the learning rate (

*α* in

Eq. 2). We simulated performance of the model for both monkeys for a range of learning rates (α,) and estimated τ

_{la} and τ

_{th} from exponential fits. Both parameters decreased as the learning rate increased (, black symbols), indicating that simulated performance improved more rapidly with training using larger learning rates. The values of τ

_{la} and τ

_{th} estimated from the monkeys’ behavioural data (solid lines in ; asterisks in ) were consistent with the relationship between τ

_{la} and τ

_{th} of the model for a particular learning rate for monkey C (linear regression, H

_{0}: [τ

_{la}, τ

_{th}]

_{monkey}=[τ

_{la}, τ

_{th}]

_{model},

*p*=0.56). The match was not as good for monkey Z (

*p*<0.05), whose lapse rate declined more slowly than for any of the other monkeys we have trained on this task

^{4}^{, }^{33}, possibly reflecting factors other than knowledge of the sensory-motor association like distractibility (e.g., monkey Z took longer to attain fixation and broke fixation more often than monkey C during the first ~50 training sessions; see

Supplementary Table 1 in ref.

^{4}). Below we use the learning rate that provides the best-matching τ

_{la} and τ

_{th} between the model and the monkeys (;

*α* = 7×10

^{-7} for monkey C, and

*α* = 1×10

^{-7} for monkey Z).

These results were robust to a variety of pooling schemes and reinforcement rules used in the model ( and

**Supplemental Figs.**
1
**and**
2). In brief, similar results were found using one or two pools of decision neurons; additive, multiplicative, or both kinds of pooling noise; multiplicative, subtractive, or no normalization of the linear weights; and linear or non-linear pooling. Likewise, similar results were found using different learning rules, as long as they were based on a correlation between sensory input and a reward prediction error. A qualitatively different scheme, in which pooling weights remained constant but a decision bound on the pooled signal varied with training, was unable to reproduce the pattern of behavioural results.

Changes in pooling weights optimize behaviour

The improvements in simulated discrimination performance resulted from changes in the pooling weights (). The association between motion direction and decision direction was established early in training, as lapse rates fell to near zero but performance to weaker motion signals was still near chance (, 1^{st} arrow). At this early stage, pooling weights tended to be strongest but of opposite signs near the two directions of motion used in the discrimination task (, second panel). As training progressed and simulated performance improved (, 2^{nd} and 3^{rd} arrows), the pooling weights continued to evolve such that weights to more sensitive neurons tuned to ~0° became more positive, and weights to more sensitive neurons tuned to ~180° became more negative (, third and forth panels). Thus, the improvements in sensitivity to weak motion appeared to result from an increasingly selective readout of the more sensitive sensory neurons with training.

The learning rule (

Eq. 2) appeared to guide the pooling weights to a form of optimal linear readout at the end of training. We computed the optimal pooling weights for the population of sensory neurons in our simulations using Fisher’s linear discriminant analysis, which maximizes percent correct for the discrimination task for any linear decoder without knowledge about the correlation structure in MT (; see also

**Supplemental Methods** and

Supplemental Fig. 4). We compared the pooling weights of the model (; gray symbols) to the optimal pooling weights (red lines) for neurons with two different levels of sensitivity (top row, mean neurometric threshold = 8.7% coherence; bottom row, 16.5% coherence). In both cases, the pooling weights approached the forms predicted by the optimal scheme, with the largest absolute weights assigned to neurons tuned to the two discriminated directions of motion and then falling off gradually as a function of direction tuning and motion sensitivity. These optimal weights define a decision boundary that most effectively separates the high-dimensional, population MT responses for leftward versus rightward motion (

Supplemental Fig. 5)

Relationship with choice probability in MT

Choice probability is a measure of the relationship between trial-by-trial fluctuations in the activity of individual neurons and choice behaviour ^{34}^{, }^{35}. For neurons in area MT, choice probability tends to be near chance early in training and then progresses steadily to values slightly but reliably above chance after training, consistent with a noisy contribution to the direction decision ^{4}^{, }^{21}^{, }^{35}. The model shows a similar, steady increase of choice probability for neurons tuned to the two directions of motion with training, as the pooling weights of these sensory neurons are adjusted to drive the decision process more effectively ().

However, the changes in pooling weights alone were not sufficient to account for another key feature of the real MT choice probability data: a selective increase with training for the most sensitive neurons

^{4}. As noted previously, the strength of readout (here implemented as the pooling weights) only partially determines choice probability. Another key factor is interneuronal correlations, which must be present for the activity of any one neuron in the population to be predictive of behaviour

^{21}. We examined how choice probability was affected by the structure of interneuronal correlations, which we assumed remained static throughout training but could depend on the similarity of response properties of pairs of sensory neurons

^{4}^{, }^{36}^{, }^{37}. These assumptions had little effect on the pooling weights learned by the model (

Supplemental Fig. 4) but substantial effects on choice probability. When correlation strength depended on the similarity of direction tuning but not sensitivity between pairs of neurons, simulated choice probability was insensitive to neurometric sensitivity throughout training (). In contrast, when the strength of correlations between pairs of neurons in our model decreased as their direction tuning and sensitivity became less similar, simulated choice probability matched the MT measurements and increased selectively for the most sensitive neurons (). Thus, dynamic pooling weights and static interneuronal correlations that both depended on neuronal sensitivity could together account for changes in the relationship between MT activity and choice behaviour throughout training.

Relationship with decision-related activity in LIP

The selective changes in the pooling weights caused improvements in the pooled response (

*y* in

Eq. 1) that were somewhat similar to changes in motion-driven responses of individual LIP neurons measured during training ()

^{4}. Like in LIP, the simulated pooled response reflected the direction decision throughout training, increasing from zero to more positive values with increasing viewing time on trials in which a rightward decision was made, and decreasing from zero to more negative values with increasing viewing time on trials in which a leftward decision was made. With training, the pooled response also became increasingly dependent on motion strength, increasing more steeply as a function of viewing time for rightward decisions and decreasing more steeply as a function of viewing time for leftward decisions.

However, there were two important differences between the pooled response in our model and LIP activity. First, the pooled response in our model grew roughly linearly as a function of viewing duration. In contrast, LIP activity tended to increase early in motion viewing but then saturate at a threshold level

^{4}. This saturation is thought to represent the termination of a decision process, particularly for response-time versions of the task

^{12}^{, }^{30}. However, for our task, in which we controlled the viewing duration, a decision model with no such termination is sufficient to account for behavioural performance for the range of viewing durations used

^{4}^{, }^{38}. Moreover, an alternative learning model based on a dynamic termination process could not account for the behavioural data (

Supplemental Fig. 1).

The second difference was that our model tended to overestimate the signal-to-noise ratio (SNR, the difference in mean responses to the two directions of motion divided by their common standard deviation) of individual LIP neurons throughout training. We computed the SNR of the pooled response of the model as a function of training for two coherence levels for each monkey (, black symbols). We also computed the SNR of individual LIP responses using a receiver-operator characteristic (ROC) analysis that quantified how well the responses distinguished between the two direction choices, assuming Gaussian noise (, red symbols). The SNR of both the simulated and real LIP responses grew consistently in a similar manner with training (correlation coefficient between SNRs of the simulated and real LIP responses for monkey C:

*r*_{51.2%}=0.65, H

_{0}:

*r*=0 using Fisher’s Z transformation,

*p*<10

^{-15};

*r*_{12.8%}=0.32,

*p*<10

^{-6}; monkey Z:

*r*_{99.9%}=0.41,

*p*<10

^{-3};

*r*_{25.6%}=0.17,

*p*=0.15), approaching the upper bound achieved by the optimal pooling weights ( and , dotted line). However, the real LIP responses tended to have a slightly lower SNR than the simulated responses (paired t-test, H

_{0}: SNR

_{data}-SNR

_{model}=0;

*p*_{51.2%}=0.05 and

*p*_{12.8%} =0.10 for monkey C;

*p*_{99.9%}<10

^{-22} and

*p*_{25.6%}<10

^{-10} for monkey Z). One likely cause of this discrepancy is that our model represents a pooled LIP signal that is less noisy than individual neurons. Also, the simulated responses represent the difference in activity between two populations of LIP neurons supporting leftward and rightward decisions, respectively, and thus is less noisy than either component measured separately. Accordingly, an alternative model with separate LIP populations corresponding to the two direction decisions had SNRs that more closely matched the data (

Supplemental Fig. 3f).

Perceptual learning on a fine discrimination task

The coarse discrimination task from our previous study, modelled above, used two alternative motion directions separated by 180°. In this case, the most informative neurons are those that respond most strongly to those directions (; red and blue triangles) ^{39}^{, }^{40}. We tested whether the model can also account for improved perceptual sensitivity on a fine discrimination task, in which the two alternative motion directions are separated by a much smaller amount (; red and blue triangles). In this case, the most informative neurons are not those that respond most strongly to the presented stimuli but rather are tuned ~40° away from those directions ^{34}^{, }^{39}^{, }^{40}.

We performed similar simulations as before using the same learning rate as the coarse discrimination task (

*α*=7×10

^{-7}), but trained the model using ±10° motion stimuli instead of 0° and 180°. Unlike the weight profiles for the coarse task, changes in the pooling weights for this task were not centred on neurons tuned to the direction of motion of the stimulus (±10°). Instead, the strongest weights developed around neurons tuned to values offset from the stimuli by ±~40° (), consistent with previous reports and reflecting the direction tuning width of individual MT neurons (

Supplemental Fig. 7)

^{34}^{, }^{39}^{, }^{40}. By the end of training, which for this difficult task took approximately twice as many trials as for the coarse task, the weights were similar to the optimal readout (). Thus, the learning model is not specific to the coarse discrimination task but instead can find the most informative neurons - which are not necessarily the most responsive neurons - to solve at least two different types of discrimination task.

Specificity of learning

A key feature of many forms of perceptual learning is that the improvements tend to be specific to the stimulus configuration used during training, including for coarse and fine direction discrimination tasks ^{4}^{, }^{41}^{, }^{42}. This specificity helps to distinguish improvements in perceptual sensitivity from associative and other forms of learning. The stimulus configuration we used to train the monkeys for the coarse task tended to use a roughly horizontal axis of motion but varied somewhat from session-to-session, depending on the direction and spatial tuning properties of the MT and/or LIP neuron being recorded. Under those conditions, improvements in performance tended to be largest for sessions in which the axis of motion was similar to values used in previous sessions ^{4}. Our model, trained on the same sequence of motion axes, showed similar specificity ().

We examined the specificity of learning in more detail for both the coarse and fine discrimination tasks (). After training the model using a single pair of simulated motion directions, we measured both lapse rate and discrimination threshold using different directions. For the coarse task, lapse rates were mostly absent except 90° from the trained direction, consistent with the weight profiles that distinguish the two alternatives across a broad range of directions (e.g., ). In contrast, perceptual sensitivity degraded steadily as a function of distance from the trained directions, comparable to the decline in weights over that range. For the fine task, both lapse rates and discrimination threshold showed a higher degree of stimulus specificity, reflecting narrower weight profiles (e.g., ) and the overall greater difficulty of that task.

Predictions

Our model makes several testable predictions. First, these forms of perceptual learning are driven by a reward prediction error. According to the model, this error signal depends on both motion coherence (which generates the prediction) and reward feedback and should be present even after acquisition of the sensory-motor association. We predict that this error signal is encoded by brainstem dopaminergic neurons that encode a similar error signal during conditioning tasks

^{6}^{, }^{7}. Second, interneuronal correlations in MT should depend on neurometric sensitivity (). Third, the specificity of learning on the coarse and fine discrimination tasks should depend on the direction axis in the manner shown in . Fourth, learning should be fastest if strong motion stimuli are used early in training and weaker stimuli later (consistent with ref.

^{43}), because the error signal depends on the value of the pooled MT response and therefore is noisier for weaker motion, particularly early in training (

Supplemental Fig. 8).