|Home | About | Journals | Submit | Contact Us | Français|
Contingency theories of goal-directed action propose that experienced disjunctions between an action and its specific consequences, as well as conjunctions between these events, contribute to encoding the action-outcome association. Although considerable behavioral research in rats and humans has provided evidence for this proposal, relatively little is known about the neural processes that contribute to the two components of the contingency calculation. Specifically, while recent findings suggest that the influence of action-outcome conjunctions on goal-directed learning is mediated by a circuit involving ventromedial prefrontal, medial orbitofrontal cortex and dorsomedial striatum, the neural processes that mediate the influence of experienced disjunctions between these events are unknown. Here we show differential responses to probabilities of conjunctive and disjunctive reward deliveries in the ventromedial prefrontal cortex, the dorsomedial striatum, and the inferior frontal gyrus. Importantly, activity in the inferior parietal lobule and the left middle frontal gyrus varied with a formal integration of the two reward probabilities, ΔP, as did response rates and explicit judgments of the causal efficacy of the action.
The capacity for goal-directed action depends critically on our ability to detect and represent the causal relationship between actions and their consequences. Evidence suggests that, while such judgments are biased by conjunctions, or pairings, of an action with its specific consequences, they are also highly sensitive to disjunctions; behavioral studies have found that judgments regarding the causal status of actions vary with the likelihood of the outcome occurring non-contingently (i.e., in the absence of the action and unsignaled; Shanks & Dickinson, 1991). However, there has been little research investigating the neural bases of the influence of non-contingent outcomes on the encoding of the action-outcome relationship.
Instrumental contingency theory formalizes the integration of response-contingent and non-contingent rewards by representing the strength of the action-reward relationship as the difference between two conditional probabilities: the probability of gaining a target reward (r) given that a specific action (a) is performed and the probability of gaining the reward in the absence of that action (~a) [i.e., ΔP=P(r|a) – P(r|~a), see Hammond, 1980]. Hence, according to this view, when the two probabilities are equal, the net action-reward contingency, and so the causal status of the action, is nil regardless of the number of experienced action-reward conjunctions. The sensitivity of goal-directed actions to the instrumental contingency has now been convincingly demonstrated in both humans (Shanks & Dickinson, 1991; Chatlsoh et al., 1985) and rats (Hammond, 1980; Balleine & Dickinson, 1998): Numerous studies have found evidence of a selective decrease in the performance of an action as the contingency is degraded by increasing P(r|~a) while keeping P(r|a) constant (e.g. Balleine & Dickinson, 1998) and, in humans, explicit causal judgments vary with the instrumental contingency across variations in both conditional probabilities (e.g., Chatlosh et al., 1985).
At a neural level, studies in rats suggest that the influence of contingency degradation is mediated by a circuit involving the prelimbic prefrontal cortex (mPFC) and the dorsomedial striatum (DMS) (Balleine & Dickinson, 1998; Corbit & Balleine, 2003; Yin et al., 2005). Consistent with these results, in a human imaging study, Tanaka, Balleine & O’Doherty (2008) found that activity in the medial prefrontal cortex (mPFC), the medial orbitofrontal cortex (mOFC) and the DMS increased with an increase in the contingency between pressing a button and receiving monetary reward. Activity in mPFC also scaled linearly with explicit causal judgments, which, in turn, were significantly correlated with the instrumental contingency. However, Tanaka et al. only assessed changes in P(r|a) and did not manipulate P(r|~a), which remained constant (at zero) across conditions. Thus, it is unknown how activity in these areas relates to the representation of non-contingent reward probabilities and their integration with response-contingent ones.
The goal of the current study was, therefore, to assess the neural basis of contingency detection in humans sampling across a broad contingency space in which we systematically varied both conditional probabilities, P(r|a) and P(r|~a), across blocks of training. Together with changes in neural activation we assessed the effects of these manipulations behaviorally, both on changes in performance and in explicit causal judgment.
Nineteen healthy right-handed volunteers (25 ± 4 years old, 8 females) participated in the study. The volunteers were pre-assessed to exclude those with a previous history of neurological or psychiatric illness. All subjects gave informed consent, and the study was approved by the Institutional Review Board of the California Institute of Technology.
At the beginning of the experiment, subjects were informed that they would be given the opportunity to earn a 25 cent reward by pressing a key but that, while they were free to press the key as often as they liked, each press would cost them 1 cent. They were further instructed that the relationship between pressing the key and receiving the 25-cent reward would vary across blocks and that, in some blocks, the 25-cent reward might be delivered in the absence of a response. To ensure some familiarity with the task, subjects completed a single 80 sec training block, in which P(r|a) equaled 0.18 and P(r|~a) equaled 0 (i.e., ΔP=0.18), before they entered the scanner.
Once in the scanner, each subject was presented with six different contingency conditions; the values of the two conditional probabilities for each of these conditions are listed in Table 1, together with the resulting ΔP (rows 1-3). Due to methodological constraints imposed by the fMRI method, each block of training was divided into three, 30 sec, “Respond” intervals during which responding was unconstrained, interleaved with three, 20 sec, “Rest” intervals, during which subjects were instructed to not respond (see Fig. 1). To ensure sufficient sampling of the response in each condition, and to avoid any carry-over of response suppression from low contingency blocks, the first “Respond” interval in each block, henceforth the baseline interval, always had the same, relatively high, contingency as that employed during pre-scanning practice (i.e., ΔP=0.18). The subsequent two “Respond” intervals within a block, henceforth the experimental intervals, shared one of the condition-specific contingencies shown in Table 1. For each subject, the entire set of six contingency blocks was presented in each of three sessions, separated by a 5 min break during which the subject remained in the scanner, with the order of the blocks randomized within sessions. Our task included a cost for responding, in order to encourage subjects to regulate their performance according to the instrumental contingency.
Each time the participant pressed the key or received a reward, a yellow fractal appeared. For non-rewarded responses, the duration of this fractal was 250 ms., whereas, whenever a reward was delivered, the yellow fractal remained on the screen for 1 sec., together with a depiction of a quarter, and the text “Reward, You Win!” (see Figure 1). While the yellow fractal, and associated information, was displayed no other events were generated; consequently, these brief periods imposed a constraint on how closely a non-contingent reward could occur to a response, in addition to the time-bin used to implement P(r|~a), discussed below. Finally, throughout each response interval, a running total of cents accumulated within the block was displayed in the top right corner of the screen.
The default time-bin for generating non-contingent rewards was 1000 ms., however, this time-bin was modified, for each subject, based on the average rate of responding. Specifically, for each subject, and in each session, the time-bin in each block, except the first, was equal to the average inter-response interval in the previous block, as long as this bin was not shorter than 500 ms. or longer than 2000 ms., in which case the default of 1000 ms. was used. Although, in some cases, this generated experienced reward probabilities that deviated substantially from programmed ones, it ensured that even subjects that responded at very high rates received non-contingent rewards. In addition to our primary behavioral measure of response rate, judgments of the causal relationship between pressing the key and receiving the 25-cent reward were collected at the end of each block, on a scale ranging from 0 (pressing the key never caused the reward to occur) to 100 (pressing the key always caused the reward to occur).
A 3 Tesla scanner (MAGNETOM Trio; Siemens) was used to acquire structural T1-weighted images and T2*-weighted echoplanar images (repetition time = 2.65 s; echo time = 30 ms; flip angle = 90°; 45 transverse slices; matrix = 64 × 64; field of view = 192 mm; thickness = 3 mm; slice gap = 0 mm) with BOLD contrast. To recover signal loss from dropout in the medial orbitofrontal cortex (mOFC) (O’Doherty et al., 2002), each horizontal section was acquired at 30° to the anterior commissure–posterior commissure axis.
The conditional probabilities acted as parameters for a software probability generator and, consequently, the actual values of P(r|a) and P(r|~a) sometimes deviated from those listed in Table 1. To equate temporal assumptions across subjects and sessions, we specified a sampling period of 1 sec, coding each of the relevant events (i.e., responses, response-contingent rewards and non-contingent rewards) as either present or absent in each such period. We then computed ΔP based on the resulting event frequencies, collapsing across the two experimental intervals making up each block. Response rates associated with these objective contingency values were computed as presses per second, correcting for the time consumed by reward-deliveries (i.e. 1 sec per reward). Finally, the six blocks were ranked according to ΔP values, and response rates and causal ratings for the ranked blocks were entered into a Contingency X Session repeated measures analysis of variance. For a more fine-grained analysis we also computed contingencies and response rates for each 10 s period, across blocks and sessions, and assessed the correlation between these variables for each subject.
Image processing and statistical analyses were performed using SPM5 (http://www.fil.ion.ucl.ac.uk/spm). The first four volumes of images were discarded to avoid T1 equilibrium effects. All remaining volumes were corrected for differences in slice acquisition, realigned to the first volume, spatially normalized to the Montreal Neurological Institute (MNI) echoplanar imaging template, and spatially smoothed with a Gaussian kernel (8 mm, full width at half-maximum). We used high-pass filter with cutoff = 128 s. Our imaging analysis focused on two basic questions: 1) is there a neural signal that maps onto the instrumental contingency across variations in both conjunctive and disjunctive reward probabilities and 2) are the two components of the contingency correlated with distinct patterns of neural activity? Following Tanaka et al., to assess changes in neural activity over time as a function of local fluctuations in the relevant variables, we constructed two sets of subject-specific fMRI design matrices, each with an onset regressor modeling a blood oxygen level-dependent (BOLD) response over 10 s periods. In one model the two conditional probabilities, computed for each period, were entered as parametric modulators, and in the other ΔP was entered as a single modulator. In both models, response rates and reward deliveries associated with each 10s period were convolved with a canonical hemo-dynamic response function and entered, without orthogonallization, as regressors of no interest together with six additional regressors accounting for the residual effects of head motion. All regressors of interest were convolved with a canonical hemo-dynamic response function. Group-level random-effects statistics were generated by entering contrasts of parameter estimates for the different modulators into a between subjects analysis.
Small volume corrections (SVC) were performed on three a priori regions of interest using a 10 mm sphere; the center coordinates were obtained by averaging across several studies (see Table 2). All of these areas have been identified in highly relevant studies assessing goal-directed instrumental action-selection: 1) Medial orbitofrontal cortex (x,y,z)=(3,33,-19), 2) Medial prefrontal cortex (x,y,z)=(7,51,-5) (Tanaka et al., 2008; Glaesher et al., 2009; Valentin et al., 2007; O’Doherty et al., 2003; Hampton et al., 2006) and 3) Right (x,y,z)=(13,11,11) and left (x,y,z)=(-9,7,4) caudate nucleus (CN) (Tanaka et al, 2008; O’Doherty et al., 2003; Tricomi et al., 2004). All other areas were reported at p < 0.05, using cluster size thresholding (CST) to adjust for multiple comparisons (Forman et al., 1995). AlphaSim, a Monte Carlo simulation (AFNI) was used to determine cluster size and significance. Using an individual voxel probability threshold of p=0.001 indicated that using a minimum cluster size of 134 MNI transformed voxels resulted in an overall significance of p < 0.05.
In order to separate effects due to the processing of conditional probabilities from those reflecting encoding of ΔP, we conducted an exclusion analysis, masking the contrasts of conditional probabilities with the ΔP contrast; specifically with a positive contrast for P(r|a) and a negative contrast for P(r|~a). Because this analysis involves accepting the null hypothesis that neural activity does not correlate with ΔP, we used a very liberal threshold of 0.1 for these masking contrasts.
To eliminate non-independence bias for plots of parameter estimates, a leave-one-subject-out (LOSO; Esterman et al., 2009) approach was used, in which 19 GLMs were run with one subject left out in each, and with each GLM defining the voxel cluster for the left out subject. The relevant local maxima were then used to extract beta weights for a range of modulator values for each subject and session, and these were averaged to plot overall effect sizes. Extreme modulator values for which more than 70% of data points were missing (≤-0.5 and >0.5 for the contingency modulator and >0.5 for the two conditional probabilities) were excluded from the plots.
The mean causal ratings, and mean objective conditional probabilities (based on actually experienced event frequencies) are shown for each programmed condition in Table 1. Note that, for conditions across which P(r|~a) varies while P(r|a) remains constant and high, mean ratings also remain high and relatively unchanged, suggesting a bias towards P(r|a). However, these mean causal ratings more likely reflect individual differences in the objective variable values. Specifically, objective values of P(r|a) and P(r|~a) in individual blocks differed substantially from the mean objective values, computed across subjects and sessions, as well as from the programmed values. Furthermore, while large deviations in the objective values sometimes resulted in negative contingencies, the rating scale did not allow participants to indicate a preventive causal relationship, thus biasing judgments in a positive direction. Consistent with this interpretation, when mean causal ratings were computed solely based on blocks in which both objective conditional probabilities were close to the programmed values (within a 0.05 deviation), mean ratings equaled 46.2, 32.5 and 18.8 respectively for the conditions listed in columns 4 to 6 of Table 1, suggesting a much stronger influence of P(r|~a). All of our statistical analyses, and all subsequently reported descriptive statistics, are based on objective, rather than programmed, variable values.
As can be seen in the top panel of Figure 2, mean response rates clearly decrease with a decrease in mean objective contingency, F(1,18)=38.8, p<0.001. Simple contrasts revealed that all differences between adjacent contingency levels were significant, p’s<0.05, except for that between the 2nd and 3rd levels (for which p=0.07). The mean response rate and mean objective contingency for the baseline intervals were both high (1.4 and 0.22 respectively) relative to those in the majority of experimental conditions. Comparable results were observed with respect to the explicit causal judgments, shown in the bottom panel of Figure 2, which also decreased linearly as a function of objective contingency, F(1,18)=47.5, p<0.001, with all differences between adjacent contingency levels reaching significance, p’s<0.05, except for that between the 1st and 2nd levels (p=0.20). There was no significant effect of session, nor any significant interactions, for either response rates or causal judgments (all F’s < 1.0). The mean difference, across subjects and sessions, between the two experimental intervals within each block was 0.32 with a standard deviation of 0.36. Notably, such variations in response-rate were likely due to the fact that the experienced contingency also varied across these two intervals; indeed, differences in response-rates across the intervals within a block were highly correlated, across subjects and sessions, with concomitant differences in contingency, p<0.001. Finally, the correlations between ΔP and response rates, computed across 10 second bins for each subject, were highly significant for the vast majority of subjects; p<0.001 for 16 out of 19 subjects and p<0.05 for 2 of the remaining 3. In summary, our results replicate those of Shanks and Dickinson (1991; Chatlosh et al., 1985); both response rates and explicit causal judgments showed a systematic decline with a decrease in objective contingency.
We first tested for areas showing changes in activity related to the instrumental contingency. Our results indicated that three distinct areas tracked contingency values: the left middle frontal gyrus and the left superior and inferior parietal lobules. Interestingly, we found that activity in these areas correlated negatively with this parametric modulator (all p’s < 0.05, CST) (see Figure 3a-b), such that their activity was greatest when subjects experienced low (including negative) contingencies, and was weakest when subjects experienced high contingences. Bilateral activity was seen in all three areas at an uncorrected threshold of 0.005. No voxels survived our statistical threshold for the reverse contrast, testing for areas in which activity correlated positively with contingency.
We next tested for areas showing changes in activity related to P(r|~a) and found significant responses throughout the lateral frontal cortex bilaterally, the medial frontal cortex, the right posterior superior temporal gyrus, the right posterior intraparietal sulcus and the left posterior (Levitt et al., 2002) caudate (all p’s < 0.05, CST). Activity emerged bilaterally in all of these areas, except for that in the posterior caudate nucleus, at an uncorrected threshold of 0.005. Activity also emerged in the medial prefrontal cortex [(x,y,z)=(12, 42, 3)], although this did not quite reach significance (p<0.001, uncorrected). Moreover, only the effects found in the right inferior frontal cortex (IFG) and the left posterior caudate nucleus (pCN) survived masking with an exclusive, negative, contrast for ΔP, thresholded at 0.1, suggesting that activity in these areas is specific to a representation of P(r|~a) (see Figure 4a). To rule out overall reward-rate as the source of observed neural activity, we also conducted an additional analysis in which we included this variable in the design matrix and found that the significant effects still emerged at the corrected threshold of 0.05. The test for areas showing changes in activity related to P(r|a) revealed significant effects in the right medial prefrontal cortex (mPFC) [p < 0.05, (SVC)], and the right anterior caudate nucleus (aCN) [p < 0.05, SVC]. An uncorrected threshold of 0.005 did not render these effects bilateral, and only the effects found in the caudate survived masking with an exclusive positive contrast for ΔP, again thresholded at 0.1 (see Figure 4b).
To further assess the effects found for ΔP in the inferior parietal lobule, we anatomically defined this region, in each hemisphere, using WFU PickAtlas (Maldjian et al., 2003). We then used MarsBar (Brett et al., 2002) to perform ROI analyses of a [low-high] contrast, and found significant effects in both the left and right IPL (both p’s < 0.05), confirming our previous finding that activity in this area deceases as the contingency increases. We also performed ROI analyses contrasting high with low values for the two conditional probabilities within our caudate ROIs, as well as within an ROI defined across the vmPFC (consisting of the mOFC and adjacent mPFC regions defined in Table 2). For P(r|~a), these analyses yielded significant effects in the left caudate nucleus (p < 0.05). For P(r|a), we found significant effects in both the left and right caudate, as well as in the vmPFC (all p’s < 0.05).
When trying to determine how effective an action is in producing some reward, it is important to consider two conditional probabilities: the probability that the action is followed by that reward, P(r|a), and the probability that the reward occurs in the absence of that action, P(r|~a). The behavioral influence of both response-contingent and non-contingent rewards on free operant responding has been convincingly demonstrated in both humans and rats, and has been central to claims about the role of causal knowledge in the performance of goal-directed actions (Balleine & Dickinson, 1998). However, while there has been extensive research on the neural processes underlying the influence of response-contingent rewards on action selection, what we know about the neural bases of processing non-contingent rewards, and their integration with response-contingent ones, is limited to the results of a relatively small body of studies in rodents (e.g., Yin et al, 2005). The current study used fMRI to investigate the neural substrates of action-outcome contingency learning in humans, with a focus on identifying areas responsible for integrating information about response-contingent and non-contingent reward probabilities. Consistent with previous results (e.g., Tanaka et al., 2008), we found that neural activity in the ventromedial prefrontal cortex (vmPFC) and the right anterior caudate (aCN) encoded the probability with which an action would be followed by reward. In contrast, information about the probability of noncontingent reward was processed by two separate neural circuits: Activity in the inferior frontal gyrus (IFG) and the left posterior caudate (pCN) was found to vary with the probability of receiving reward in the absence of any action. Finally, activity in the inferior and superior parietal lobules (I/SPL), and in the middle frontal gyrus (MFG), varied with instrumental contingency, a formal integration of the two reward probabilities, as did response rates and subjective causal judgments.
Our finding that neural activity in the right aCN and mPFC varied with the probability of response-contingent reward delivery, P(r|a), is consistent with that of Tanaka et al. (2008) and supports the suggestion that these structures are functionally homologous to the rodent DMS and prelimbic cortex respectively (Balleine & O’Doherty, 2009). Notably, the mPFC effects did not survive masking by the contingency contrast suggesting that, rather than just encoding P(r|a), this area might contribute to contingency computations; indeed, activity in mPFC was also found for P(r|~a), albeit below our threshold for statistical significance. Additional evidence for a role of mPFC in reward integration comes from studies showing that activity this area correlates with the average value of distinct stimuli (Wunderlich et al., 2010), with the subjective valuation of delayed rewards (Kable & Glimcher, 2007), and with the relative decision value between monetary and social rewards (Smith et al., 2010).
We also found that activity in the left pCN, but not aCN, varied with the probability of non-contingent rewards, indicating that distinct striatal areas may support estimation of the respective reward probabilities. Importantly, a similar dissociation between anterior and posterior dorsomedial striatum has previously been demonstrated in rodents; Yin et al., (2005) found that inactivation of the posterior DMS abolished sensitivity to contingency degradation (i.e., to the delivery of non-contingent rewards) while inactivation of the anterior DMS had no effect. Likewise, Corbit and Janak (2010) found that the posterior DMS was critical for the acquisition of both response-outcome and stimulus-outcome relationships, while the anterior DMS appeared to only be needed for response-outcome encoding. The current results suggest that a comparable heterogeneity might exist along the anterior-posterior axis of the human caudate nucleus (i.e., across the head and body of the caudate), providing converging evidence for the proposal that brain systems responsible for the modulation of goal-directed actions based on variations in instrumental contingency are highly conserved across species (Balleine & O’Doherty, 2010). Consistent with previous work (e.g., Shanks & Dickinson, 1991) our behavioral results, depicted in Figure 2, show a clear decrease in response rates with a decrease in the difference between the probabilities of response-contingent and non-contingent rewards (i.e., with instrumental contingency). To explore the neural basis of this response modulation we tested for regions correlating with the local contingency, computed over 10 s intervals. We found that activity increased with a decrease in contingency in the IPL and SPL and the MFG. Although several recent neuroimaging studies have implicated the I/SPL and MFG in the representation of action-reward contingencies (Delgado et al., 2005; Koch et al, 2008; Schlund & Cataldo, 2005; Schlund & Ortu, 2010), they have primarily explored the role of these areas in reward predictability. For example, Delgado et al., (2005; Koch et al., 2008) found that activity in the IPL increased as the probability of being correct, and thus of reward, given one of two alternative actions decreased from 1.0 to 0.5, across stimulus conditions. They interpreted these results as reflecting a recruitment of areas responsible for controlled cognitive processes due to the decrease in reward predictability. Without additional assumptions, it is difficult to apply the concept of reward predictability to the distinction between response-contingent and non-contingent rewards that is the focus of the current paper. Nonetheless, we note that this account fails to explain our finding that activity in the S/IPL appears to decrease as P(r|a) increased toward 0.5, while P(r|~a) remained relatively unchanged. In other words, that activity in these areas decreased with a decrease in the predictability of response-contingent reward (see last two bars and bottom rows of the effect-size plots in Figure 3b).
In the current task, the decision to respond or not is based on the relative probability of reward in the presence and in the absence of the action. Note that this is akin to choosing between two alternative actions based on their respective reward probabilities; in both cases, an integration of distinct sources of information is required in order for optimal response strategies to develop. In a recent neurophysiology study, Seo et al. (2009) updated the individual values of two alternative actions using reinforcement learning theory (Sutton & Barto, 1998) and modeled the difference between action value functions as a decision-variable in a free-choice task. Recording the activity of neurons in a sub-region of IPL (the lateral intraparietal cortex; LIP) in rhesus monkeys, they found that a substantial percentage of these neurons changed their activity according to the difference between the two action value functions. Interestingly, using a similar task but recording from the monkey striatum, Samejima et al. (2005) found that a greater number of striatal neurons were selective to reward-expectancies associated with one but not the other action than were tuned to the difference between action values. To our knowledge, the current results provide the first simultaneous demonstration of this anatomical distribution of distinct and integrated value functions across striatal and posterior parietal regions respectively. Further studies are needed to clarify the exact role of the implicated parieto-striatal circuit in goal-directed action selection, and how it relates to the fronto-striatal network that has been the focus of rodent lesion studies of instrumental contingency learning.
Although contingency computations are considered central to goal-directed learning, a couple of additional factors known to strongly influence instrumental performance likely contributed to the current findings. For example, it is possible that the immediacy of response-contingent reward delivery (i.e., the strong action-reward contiguity) employed here played a role in the estimation of P(r|a) (e.g., Shanks & Dickinson, 1991), and consequently in the observed correlation between this variable and activity in the anterior caudate. It is also important to note that instrumental contingency is closely related to the utility of performing an action; recall that, in the current study, whereas non-contingent rewards were free, response-contingent rewards (of equal magnitude) were associated with a small monetary cost and, presumably, also with an effort-based cost. The IPL has been previously implicated in the integration of reward and risk (Ernst et al., 2004) and in response-selection based on reward-maximization (Bush et al., 2002); it is possible, therefore, that the currently reported effects in this area reflect the incorporation of non-contingent rewards into a cost-benefit analysis, rather than the computation of contingency per se. Further research is needed to determine how neural correlates of instrumental contingency learning relate to the encoding of action-reward contiguity and to the estimation of utility.
This work was supported by Grant 56446 from the National Institute of Mental Health (NIMH) to B.W.B, and a grant from the NIMH (RO3MH075763) to J.O.D. We thank S. B. Ostlund for valuable comments on the manuscript and K. Wunderlich for technical support.