|Home | About | Journals | Submit | Contact Us | Français|
Reward expectation and reward prediction errors are thought to be critical for dynamic adjustments in decision-making and reward-seeking behavior, but little is known about their representation in the brain during uncertainty and risk-taking. Furthermore, little is known about what role individual differences might play in such reinforcement processes. In this study, it is shown behavioral and neural responses during a decision-making task can be characterized by a computational reinforcement learning model and that individual differences in learning parameters in the model are critical for elucidating these processes. In the fMRI experiment, subjects chose between high- and low-risk rewards. A computational reinforcement learning model computed expected values and prediction errors that each subject might experience on each trial. These outputs predicted subjects’ trial-to-trial choice strategies and neural activity in several limbic and prefrontal regions during the task. Individual differences in estimated reinforcement learning parameters proved critical for characterizing these processes, because models that incorporated individual learning parameters explained significantly more variance in the fMRI data than did a model using fixed learning parameters. These findings suggest that the brain engages a reinforcement learning process during risk-taking and that individual differences play a crucial role in modeling this process.
In order to maximize rewards during decision-making, organisms can estimate expected rewards, or value, of various decision options and continually update their expectations according to outcomes of their decisions. In the past half century, reinforcement learning theory has emerged as a powerful tool to characterize how organisms acquire such reward expectations and how they can use outcomes of their decisions to adjust those expectations (Sutton and Barto, 1998; Camerer, 2003; Schultz, 2004). In typical reinforcement learning models, ‘weights’ represent expected outcomes of each decision option, and thus decision options with stronger weights become preferred and are more likely to be chosen than are decision options with relatively weaker weights. The difference between the expected outcome (e.g. reward) and the received outcome is termed a prediction error, and can be used to adjust decision option weights so they better reflect the true reward value of the chosen decision. Thus, these two variables—weights and prediction errors—are distinct but related, and together form a simple mechanism by which organisms can dynamically adjust their decision-making based on reinforcements (Cohen and Ranganath, under review).
Neuroscientists have suggested that reward prediction errors are encoded in structures including midbrain dopamine regions, the cingulate cortex and ventral striatum. In particular, phasic increases in activity are observed when reinforcements are better than expected (a positive prediction error), and phasic decreases in activity are observed when reinforcements are worse than expected or not given (a negative prediction error) (Schultz et al., 1997; Waelti et al., 2001; Daw et al., 2002; Holroyd and Coles, 2002; O’Doherty et al., 2003; Schultz, 2004; Seymour et al., 2004; Rodriguez et al., 2005; Abler et al., 2006).
In contrast, neural representations of expected rewards (termed ‘weights’ in reinforcement learning models) are thought to be housed in the orbitofrontal cortex and amygdala: activity in these regions is sensitive to the relative preference of rewards, suggesting that these regions might encode the expected values or relative motivational significance of different decision options (Tremblay and Schultz, 1999; Hikosaka and Watanabe, 2000; Hollerman et al., 2000; Kringelbach et al., 2003). Together, these findings suggest a neuroanatomical distinction between prediction errors and expected rewards (Haruno and Kawato, 2006). Thus, the first goal of this study was to test whether a reinforcement learning model could be used to uncover representations of expected rewards and reward prediction errors in an environment that involved decision-making under uncertainty.
The second goal of this study was to test the role of individual differences in these processes. Specifically, reinforcement learning models have learning rates that describe how the prediction error adjusts the weights (equations provided in the ‘Methods’ section): a large learning rate means that the prediction error strongly influences the adjustment of the weight, whereas a small learning rate (e.g. close to 0) means that the prediction error only slightly influences the weights. These parameters are typically selected a priori and fixed across subjects (e.g. O’Doherty et al., 2003; Seymour et al., 2004). These models have provided powerful insights into the neural computations of a prediction error, although they are traditionally tested either in passive learning or in simple choice tasks in which there is a ‘best’ or correct response. However, fixing these parameters to be constant across all subjects might not be appropriate in more complex situations, such as those that involve decision-making under uncertainty or risk, in which different individuals might interpret the same reinforcement in different ways. For example, after losing a high-risk gamble, some people might avoid another high-risk gamble, whereas others might continue seeking high-risk gambles (Cohen and Ranganath, 2005). Reinforcement learning models with fixed learning parameters do not capture this inter-subject variability because fixed learning parameters assume that all subjects interpret and use reinforcements in the same way to update weights of decision options. However, these parameters can be empirically estimated for each subject based on their behavioral data and used to characterize behavioral and neural processes (e.g. Paulus and Frank, 2006). Here, the performance of models that used fixed or individually derived learning rates to determine the importance of individual differences in reinforcement learning processes are compared.
Seventeen subjects (aged 22–27 years, eight males) were scanned while engaged in a decision-making task in which on each trial they chose either a high-risk (40% chance of $2.50 and 60% chance of $0.00) or a low-risk (80% chance of $1.25 and 20% chance of $0.00) decision option. Subjects were told the probabilities and amounts of each decision option prior to the start of the experiment, and they practiced for several minutes before scanning began. This training minimized early learning and guessing processes that may have affected performance and brain activity during the early phases of the task. Thus, this task is useful for studying how reinforcements are used to adjust behavior on the trial-by-trial level rather than examining how learning optimal response patterns occurs over a longer time scale.
On each trial, subjects first saw a visual cue for 400ms that indicated that the trial began. They indicted their decision to choose the high- or low-risk decision option either by pressing a button or withholding a response, depending on the shape of the cue (press to indicate high-risk decision if the cue was a square, or withhold a response to indicate a high-risk decision if the cue was a circle). This was done to prevent subjects from planning their motor responses before the trial began. Results did not differ according to this manipulation, and these conditions were thus collapsed. Additional control trials were included in which subjects simply made a response (i.e. no decision was involved). These trials are not discussed in the present article. An inter-trial interval of 2–8s (jittered) separated each trial. There were 300 trials spaced over eight scanning runs. Other, nonoverlapping results from this data set are reported elsewhere (Cohen and Ranganath, 2005).
MRI data were collected on a 1.5T GE Signa scanner at the UC Davis Research Imaging Center. Functional imaging was done with a gradient echo planar imaging (EPI) sequence (TR = 2000, TE = 40, FOV = 220, 64 × 64 matrix, voxel size = 3.475 × 3.475 × 5mm3, 22 oblique axial slices). Coplanar and high-resolution T1 weighted images were acquired from each subject. EPI data were realigned to the first volume, coregistered with the anatomical scan, spatially normalized to Montreal Neurological Institute (MNI) space (Brett et al., 2002) resampled to 3.5mm isotropic voxels, and spatially smoothed with an 8mm FWHM kernel using SPM99 software.
The model contains the following components: (1) Weights for each decision option (whigh-risk and wlow-risk for high- and low-risk decision options). Weights are thought to index expected rewards or subjective values, but are here termed weights for consistency with the machine learning literature (Sutton and Barto, 1998); (2) A prediction error signal (δ) generator. The prediction error node takes as input the weight of the chosen decision option and the actual reward received, and sends the difference between these two as output back to the weights (equation in the following paragraph). Thus, outcomes that are ‘better than expected’ yield positive prediction errors and increase the weight of the chosen decision option, and outcomes that are ‘worse than expected’ yield negative prediction errors and thus decrease the weight of the chosen decision option.
The model adjusts its weights as follows: The weight on trial t + 1 is the weight on trial t plus the prediction error on trial t: w(t + 1) = α × w(t) + η × δ(t). Thus, when the prediction error is positive (which occurs after a reward is received), the weight on the next trial (w (t + 1)) increases. Importantly, the weight is scaled by α, a discount parameter (sometimes called a ‘forgetting’ parameter), and the prediction error is scaled by η, the learning rate. These parameters can be estimated based on subjects’ behavioral data (see the following text). The learning rate associated with each weight can take on one of three values on each trial: 0 when the decision option was not chosen, and, when the decision option was chosen, ηreward and ηnon-reward for trials in which subjects received or did not receive a reward, respectively. Having separate parameters provides flexibility for the model to respond to different outcomes in different ways. In other words, high-risk wins need not be treated as equal to low-risk wins. Values of 1.25, 2.5 and 0 were used to represent low-risk rewards, high-risk rewards and non-rewards, respectively. Although the relative scaling of the two rewards is important (because the magnitude of the high-risk reward is twice as much as that of the low-risk reward), the actual numerical values are arbitrary with respect to the fMRI analyses, and the results would not be different if reward values were, for example, 125 and 250.
Three models were compared: a model in which all parameters were estimated individually for each subject (the ‘individual differences’ model), a model that used the average parameters across all subjects (the ‘group’ model) such that parameters were empirically estimated but were fixed across all subjects (parameters were: high-risk/reward: 0.033; high-risk/nonreward: 0.213; low-risk/reward: 0.201; low-risk/nonreward’: 0.137; discount: 0.753); finally, a model that used fixed, a priori selected parameters for all subjects (the ‘fixed’ model). For the fixed model, α was set to 0.99 and η was set to 0.7. These parameters have been used previously (O’Doherty et al., 2003). The purpose of comparing these models was to evaluate the results that would be obtained if one used the model in different ways.
To estimate these parameters for each subject, an iterative maximum likelihood minimization procedure (Luce, 1999; Barraclough et al., 2004; Cohen and Ranganath, 2005) was implemented in MATLAB. On each iteration, the model takes the behavioral choices and outcomes for each subject and computes the probability of the subject choosing the high-risk decision on each trial as the difference of the logarithm of the weights:
The procedure uses the nonlinear, unconstrained NelderMead simplex method (Lagarias et al., 1998) to find values of the learning parameters that maximize the sum of p(t)high-risk or p(t)high-risk across the experiment (depending on the decision made by the subject on trial t). Learning parameters are adjusted on each iteration until further iterations and adjustments do not improve the model. Weights are each set to 1at the start of each iteration, and 0.5 is used as starting values for all parameters, although the initial values had negligible effects on their final estimates. There was an average of 479.2 iterations (SD: 259.8, range: 218–1034) until convergence. Note that the criteria for optimizing learning rates does not involve directly comparing weights or prediction errors and actual decisions made by the subjects, and is completely orthogonal to the fMRI data, and so comparing results from the models is not redundant with how the parameters were estimated.
To examine putative neural representations of prediction errors and weights, each model was fed the unique history of decisions and reinforcements from each subject, and calculated a reward prediction error and difference in the weights for the two decision options on each trial of the experiment.1 In this study, the difference between the weights, rather than the weights themselves, is used because decision options were not associated with unique behavioral responses, and the brain likely does not house separable representations for ‘high-risk’ and ‘low-risk’ decision options. Because the two decision options have equal mathematical expected values (i.e. the magnitude of reward times the probability of reward is one dollar for each option), this difference term may correspond to trial-by-trial changes in relative subjective value or motivational significance of the two decision options. These vectors of model outputs were then convolved with each subject's empirically derived hemodynamic response function (obtained from a separate visual-motor response task) (Aguirre et al., 1998; Handwerker et al., 2004) to produce a unique expected blood oxygenation-level dependent (BOLD) response to these terms for each subject. The procedure is illustrated in Figure 1. To the extent that the BOLD response in a particular voxel correlates with this independent variable, the voxel covers tissue in which activity may reflect or be modulated by prediction errors as defined by the model. This method has been previously used to study the putative neural correlates of prediction errors (O’Doherty et al., 2003; O’Doherty et al., 2004; Seymour et al., 2004; Tanaka et al., 2004; Glascher and Buchel, 2005; Haruno and Kawato, 2006).
In the analysis, all of the task variables (combinations of high- and low-risk rewards and nonreward and a no-decision control condition) and the vectors calculated by the model were included as independent variables. The task variables were included to remove any possible shared variance between normal task covariates and the prediction error and weight regressors. All variables were centered on a mean of zero. Separate general linear models (GLMs) were conducted for each model. Results of single-subject analyses were maps of statistical values, where the value at each voxel is the parameter estimate (unstandardized β) of the relation between the BOLD response in that voxel and the independent variable (e.g. prediction error). In the present analyses, two maps were of interest: the prediction error and difference in weights. Group-level analyses were conducted by entering these maps into a one sample t-test, in which the β-estimate at each voxel across subjects was tested against zero, and subject was treated as a random variable. Significant activations were identified with a two-tailed threshold of P < 0.001 and a cluster threshold of five contiguous voxels. In the fMRI behavior correlation, one subject was removed from analyses because the behavior β-value was over three SDs above the mean (Figure 2c). However, results with this subject included were very similar.
To examine the correspondence between the model and subjects’ behavioral choices, behavioral responses were compared with prediction errors and weights generated by the individual differences and fixed models. Responses were coded as 0 (safe decision) or 1 (risky decision) and were smoothed with a running-average filter with a 10-trial kernel to produce a continuous vector that reflects the local fraction of choices selected. Such methods are often used to examine correspondence between model predictions and behavioral selections (Sugrue et al., 2004; Bayer and Glimcher, 2005; Samejima et al., 2005). Because of autocorrelations induced by the smoothing, data were analyzed with autoregression, which estimates both the autocorrelation coefficient [using AR(1)] and the regression parameters that are independent of autocorrelation present in the data. Greenhouse–Geisser corrections to degrees of freedom were used in ANOVAs of behavioral fits.
If subjects chose the decision option with the stronger weight, as reinforcement learning theory suggests, the model's calculated weights should correlate with subjects’ trial-to-trial choices. This was tested by computing the autoregression with each subject's local fraction of high-risk choices and the model's calculated weight of the high-risk option for each trial. This β-coefficient was significantly greater than zero across subjects (average β = 0.20, t16 = 2.6, P = 0.01) (Figure 2a and c). The average β for the analysis with the fixed model was smaller, although still significant across the group (average β = 0.09, t16 = 2.5, P = 0.01). The average β for the analysis with the group model was not different from zero (average β = −0.08, t16 = 2, P = 0.054). A repeated-measures ANOVA on these β-values using ‘model’ as factor revealed that these fits were significantly different (F1.4,22.8 = 7.04, P = 0.008), such that the individual differences model yielded greater β-values than those of the group model (P = 0.01) and the fixed model yielded greater β-values that those of the group model (P = 0.01).
It was further predicted that prediction error signals are used to guide decision-making (Cohen and Ranganath, under review). In particular, if negative prediction errors indicate that reinforcements are worse than expected, these negative prediction errors might signal a need for adjustments in behavior; larger prediction errors should therefore signal greater need for behavioral adjustments. If this is the case, the model's calculated prediction error on each nonreward trial should predict subjects’ choices in the subsequent trial. This was operationalized as whether, following each nonreward, subjects chose the same vs the opposite decision option on the following trial as on the current one (e.g. when not receiving a high-risk reward on trial n, does the subject choose another high-risk reward or a low-risk reward on trial n + 1?). The β-coefficient between this trial-to-trial strategy and the model's calculated prediction error on each of these trials was not significantly different from zero across subjects (average β = 0.78; t16 = 0.32). This occurred because for some subjects the β-coefficient was positive and for other subjects this β-coefficient was negative (Figure 2b and c). That is, for some subjects, larger prediction errors following nonrewards were associated with an increased probability of behavioral switches, whereas for others the opposite was the case. This seemingly counterintuitive variability is significantly related to individual variability in the neural correlates of the prediction error, as described in the following section. The fixed model also showed no significant β-coefficient across subjects (average β = −0.0003; t16 = −0.0008). The group model, however, showed a significant β-coefficient across subjects (average β = 0.581; t16 = 4.26, P = 0.001). A repeated-measures ANOVA revealed that these fits were significantly different (F1.3,21.4 = 5.8, P = 0.017) such that the group model yielded greater β-values than those of the individual differences model (P = 0.037), and of the fixed model (P < 0.001).
For the individual differences model, activations were observed in the right amygdala extending into the hippocampus, right orbitofrontal cortex extending into the ventral striatum, bilateral caudate, bilateral thalamus/putamen, bilateral dorsolateral prefrontal cortex and cerebellum (Figure 3d). Figure 4a displays an example BOLD time course and weight vector (convolved with a hemodynamic response) to illustrate the correlation. No deactivations (i.e. more activity for the weight of the low-risk option compared to the high-risk option) were observed. Table 1 lists activation foci for this and all other analyses reported here.
The group model yielded activations that were largely overlapping with those observed for the individual differences model: bilateral posterior orbitofrontal cortex/subgenual cingulate (BA 11/25) as well as anterior orbitofrontal cortex (BA 11), right ventrolateral prefrontal cortex (BA 47), bilateral thalamus, bilateral dorsal prefrontal cortex (BA 44/45) and parietal cortex (BA 39) (Figure 3e).
Finally, for the fixed model, activations were observed in right temporal cortex and left dorsolateral prefrontal cortex (BA 46) and right middle temporal gyrus and superior parietal gyrus.
Next, the performance of the models is formally compared by testing whether the difference between β-values at each voxel produced by different models (e.g. group model results − fixed model results) was significantly greater or less than zero. There were no differences between the individual differences and group maps. Comparing the individual differences and fixed maps revealed regions with significantly higher β's in the left cerebellum and right thalamus. Comparing the group and fixed models revealed several regions with higher β-values for the group model, including bilateral caudate, posterior orbitofrontal cortex, cerebellum, thalamus and bilateral prefrontal cortex (Figure 5a).
For the individual differences model, activity in several regions was significantly positively correlated with the reward prediction error signal, including the left midbrain (anatomically consistent with a source in the substantia nigra), dorsal cingulate cortex, bilateral prefrontal cortex (BA 6) and right cuneus (BA 18/19) (Figure 3a and b). Figure 4b displays an example BOLD time course and reward prediction error vector (convolved with a hemodynamic response) to illustrate the correlation. In addition to activations (i.e. positive correlations with the reward prediction error signal), there were also deactivations (i.e. inverse correlations with the reward prediction error signal) in the head of the right caudate, left middle temporal gyrus (BA 21) and left angular gyrus.
For the group model, activations were observed in bilateral amygdala, ventral striatum extending in the caudate in the left hemisphere, anterior cingulate and medial supplementary motor cortex, posterior cingulate and anterior prefrontal cortex (BA 10) (Figure 3c).
Finally, the fixed model produced no significant activations, but a deactivation was observed in ventrolateral prefrontal cortex (PFC) (BA 47).
Next the performance of the models was directly compared. There were no differences between the individual differences and group maps. However, comparing the individual differences and fixed maps revealed regions with significantly higher β's in the dorsal cingulate and ventrolateral prefrontal cortex (Figure 5c). Comparing the group and fixed models yielded largely similar results, although additional regions exhibited higher β-values for the group model including the right ventral striatum and orbitofrontal cortex (Figure 5b).
The variability in the relationship between prediction errors and behavioral strategies (Figure 2c) suggests that individuals differed in how they used the prediction error signal to guide decision-making. Thus, these individual differences might reflect differences in the neural representation of the prediction error signal. To test this, the β-value is used between the prediction error term and the local fraction of stay/switch choices of the subjects (e.g. the relationships depicted in Figure 2b) as an independent variable in a regression with the statistical brain activation maps of correlates of the prediction error. Cross-subject variability in each of these analyses reflects differences in the representation and use of prediction errors during the task, and thus significant brain activations in this analysis indicate that differences in how prediction errors guide behavior predict differences in how prediction errors might be represented in the brain. As seen in Figure 6, the behavioral correlation significantly predicted the model's fit to the fMRI data in bilateral ventral striatum, orbitofrontal cortex and prefrontal cortex.2
Next, this individual differences correlation analysis was run using β-coefficients between the group model's prediction errors and subjects’ stay/switch behaviors. In contrast to the findings obtained from the individual differences model, no activations were observed, even at a liberal threshold of P < 0.01, uncorrected. Finally, no significant activations were observed for the fixed model.
Here, evidence is provided suggesting that, during decision-making under uncertainty, two variables predicted by reinforcement learning theory and estimated using computational modeling—reward prediction errors and decision option weights—are encoded in a network of cortical and subcortical brain structures and used to guide decision-making. Consideration of individual differences proved central to elucidating the behavioral and neural correlates of reinforcement learning because the individual differences and group models explained significantly more variance in the fMRI data than did the fixed model.
Activity in several regions including the amygdala, orbitofrontal cortex/ventral striatum and caudate nucleus correlated with the model-derived estimate of the difference in weights. Neurons in these regions are known to encode the relative value of rewards as well as expectations of rewards. For example, orbitofrontal and amygdala neurons show increased firing rates to preferred rewards compared to less preferred rewards (Everitt et al., 1991; Tremblay and Schultz, 1999; Hikosaka and Watanabe, 2000; Baxter and Murray, 2002; Gilbert et al., 2003). Thus, these regions may compute online assessments of the relative value of competing decision options, and may guide behavior by indicating which option is the most valuable or worthy of pursuit. Consistent with this interpretation, patients with damage to orbitofrontal cortex and amygdala have impairments in reward-based decision-making and often continue to prefer risky decision options even when this behavior leads to long-term losses (Bechara et al., 1997, 2000, 2003). Among their impairments may be inability to compute or utilize computations of relative value to guide their decision-making.
In this study, the difference between the weights of the high- and low-risk decision options is used, rather than the weights themselves, because in this study there were no unique behavioral responses associated with choosing high- vs low-risk decision options, and so specific motor representations of the high- and low-risk decision options could not be formed. Thus, what is represented by the difference vector, and what the brain activations may reflect, is not representations of decision options or value per se but of the difference in value or motivational significance between two competing decision options. Other studies have demonstrated that when specific decisions are linked with specific behaviors (e.g. eye movements or left- vs right-hand button presses), activity in neural structures that represent those behaviors is influenced by the value of that decision option (Schall, 1995; Gold and Shadlen, 2000; Sugrue et al., 2004; Samejima et al., 2005; Cohen and Ranganath, under review). Thus, value might be encoded in the brain both as strength of action representations and as relative activation of neurons in orbitofrontal cortex and amygdala, among other regions.
The prediction errors generated by the model correlated with activity in the midbrain, dorsal cingulate cortex and prefrontal cortex for the individual differences model, and the ventral striatum, amygdala, dorsal cingulate cortex and prefrontal cortex for the group model. These activations confirm those reported in previous studies (Schultz et al., 1997; Waelti et al., 2001; Daw et al., 2002; Holroyd and Coles, 2002; O’Doherty et al., 2003; Schultz, 2004; Seymour et al., 2004; Rodriguez et al., 2005; Abler et al., 2006). Although precise localization of activation in the midbrain is difficult, this activation appears to be centered in the substantia nigra, the origin of the nigrostriatal dopamine pathway. The location of this activation is also consistent with coordinates reported in previous fMRI studies of reinforcement learning processes (Seymour et al., 2004; O’Doherty et al., 2006) and with direct recordings of single unit activity in monkeys (Schultz, 1998; Bayer and Glimcher, 2005).
In the individual differences model, a deactivation (i.e. inverse correlations with the prediction error) was observed in the caudate nucleus. Although such deactivations have not been previously reported, previous investigations of the neural bases of reward prediction error signals did not test for deactivations, instead using one-tailed statistical tests that would only reveal positive correlations with the prediction error signal (O’Doherty et al., 2003; Seymour et al., 2004). Thus, deactivations would not have been identified even if they were present in the data. However, this finding seems consistent with the presumed role of the caudate as the ‘actor’ in actor–critic models of reinforcement learning (Montague et al., 1996; O’Doherty et al., 2004). Specifically, the ‘critic’ (thought to be the ventral striatum or midbrain) uses prediction errors to associate reinforcements with events or actions that preceded them, and the ‘actor’ uses prediction errors to guide appropriate behavioral responses. Thus, the actor may use an inverse prediction error term to help motivate behavior (i.e. larger negative prediction errors means more motivation to adjust behavior) (Joel et al., 2002; Worgotter and Porr, 2005).
The variance in the relation between the prediction errors and stay/switch strategies following losses suggested that different subjects used or calculated the prediction error differently. This seems counterintuitive because reinforcement learning theory suggests that larger prediction errors signal a greater need to change behavior. In other words, it appears as if in some cases, observed behavior is ‘opposite’ to what the model suggests behavior should be. Given that there is no optimal policy or correct strategy, it is possible that some subjects viewed choosing a nonrewarded decision a second time as a strategy ‘switch’, which would mean they actually were using prediction errors as reinforcement learning suggests, but that their conceptualization of strategy was different from how it was modeled. This could occur, for example, if some subjects thought that when a decision option did not provide a reward in the current trial, it would in the next trial. Regardless, this variance proved to be meaningful because the prediction error–behavioral strategy β-coefficients explained variance in the prediction error–brain activation correlations. Such relationships were observed primarily in the ventral striatum and orbitofrontal cortex, consistent with previous reports that these regions are sensitive to prediction errors (McClure et al., 2003; O’Doherty et al., 2003; Abler et al., 2006; Haruno and Kawato, 2006; Jensen et al., 2006). There were no similar correlations with the group and fixed models, even at more liberal statistical thresholds. This dissociation suggests that models incorporating individual differences provide maximal sensitivity to uncovering further individual differences. Interestingly, the regions that exhibited significant correlations with the individual differences model overlapped considerably with the regions identified in the group analysis, in particular the ventral striatum.
The importance of individual differences naturally leads to the question of their origins. Differences in risk-taking preferences have been linked to a number of neurobiological and psychosocial factors such as the concentration of dopamine D2 receptors in the limbic system (Noble, 1998, 2003), socioeconomic status (Diala et al., 2004), or personality (Craig, 1979; Zuckerman and Kuhlman, 2000; Petry, 2001). Lee and colleagues (Barraclough et al., 2004; Lee et al., 2005) found that reinforcement learning parameters in monkeys are highly stable over many testing sessions of the same experiment, suggesting that these learning parameters reflect stable individual differences. Stability of learning parameters across multiple settings and over time is especially relevant to the present study, because the same individuals might have different learning rates in different tasks, such as those in which some strategies provide a greater cumulative reward in the long run. Regardless of their origins and generalizability, however, characterizing how individual differences modulate these processes may prove critical to elucidating the neural mechanisms of reinforcement learning and decision-making.
However, it is not suggested that choosing learning parameters a priori to be the same for all subjects is incorrect or inappropriate. Indeed, without measuring choice behavior over time it is impossible to empirically estimate learning parameters how they were estimated here. Fixed parameters might be appropriate in passive learning experiments or in simple decision-making situations in which the optimal response is always to maintain rewarded behaviors and avoid punished behaviors. However, in more complex situations in which different individuals evaluate and utilize reinforcements in different ways, models with a priori chosen parameters may not adequately characterize reinforcement learning processes.
Nearly all reinforcement learning models contain the same basic components: representations of each decision option or stimulus (weights, in this study), and a means to adjust those representations (typically a prediction error). Many variants of reinforcement learning models exist and could be related to behavioral and neuroimaging data, but differences between different models are typically minor and more related to the experimental paradigm than to the interpretation of model parameters and output (see Sutton and Barto, 1998, for an extensive comparison of the similarities and differences between various reinforcement learning models). The model used in the present study is of course not the only possible model that could be applied to this data set; indeed, one could propose a new model specifically designed to capture behavior in this task. However, the model used here was selected because (1) it is widely used in neuroscience to study reinforcement learning (see Montague and Berns, 2002, for a review) and (2) it has a proposed biological basis and is used to investigate neuroanatomical correlates of reinforcement learning and decision-making (Schultz et al., 1997; Barraclough et al., 2004; Montague et al., 2004). Typical uses of such models typically involve passive learning (Seymour et al., 2004) or very simple decision-making situations in which one response is optimal and another suboptimal (O’Doherty et al., 2004). The fact that this simple reinforcement learning model is capable of modeling behavior and brain activity during more complex situations that involve risk and uncertainty with no optimal response is a strength of the reinforcement learning model approach to understanding dynamic changes in brain activity.
This work was supported by an Extramural Research Grant from the Institute for Research on Pathological Gambling and Related Disorders. The author thanks Charan Ranganath and Chris Moore for their help.
Conflict of Interest
1 It would be ideal to separate the hemodynamic response from the decision and feedback phases of the trial, as prediction errors may be differentially represented during these phases. Unfortunately, the rapid event-related design combined with the sluggishness of the hemodynamic response precludes such a distinction from the present analyses. Thus, each trial was treated as a single event.
2 This relationship could not be explained by the use of a win/stay–lose/switch strategy, because the probability of using this strategy did not correlate with any of the learning parameters (all P > 0.05), nor did it correlate with activation in any of the regions identified in this analysis (all P > 0.5).