|Home | About | Journals | Submit | Contact Us | Français|
Our decisions are guided by information learnt from our environment. This information may come via personal experiences of reward, but also from the behaviour of social partners1, 2. Social learning is widely held to be distinct from other forms of learning in its mechanism and neural implementation; it is often assumed to compete with simpler mechanisms, such as reward-based associative learning, to drive behaviour3. Recently however, neural signals have been observed during social exchange reminiscent of signals seen in associative paradigms4. Here, we demonstrate that social information may be acquired using the same associative processes assumed to underlie reward-based learning. We find that key computational variables for learning in the social and reward domains are processed in a similar fashion, but in parallel neural processing streams. Two neighbouring divisions of the anterior cingulate cortex were central to learning about social and reward-based information, and for determining the extent to which each source of information guides behaviour. When making a decision, however, the information learnt using these parallel streams was combined within ventromedial prefrontal cortex. These findings suggest that human social valuation can be realised via the same associative processes previously established for learning other, simpler, features of the environment.
In order to compare learning strategies for social and reward-based information, we constructed a task in which each outcome revealed information both about likely future outcomes (reward-based information) and about the trust that should be assigned to future advice from a confederate (social information).
24 subjects performed a decision-making task requiring the combination of information from three sources (fig 1, methods and supplementary information): (i) the reward magnitude of each option (generated randomly at each trial); (ii) the likely correct response (blue or green) based on their own experience of rewards on each option; and (iii) the confederate’s advice, and how trustworthy the confederate currently was. When a new outcome was witnessed, subjects could use this single outcome to learn in parallel about the likely correct action, and the trustworthiness of the confederate.
The investigation resembles previous experiments that have compared animate and inanimate conditions in different trials or experiments5,6. Here, however, both sources of information were present on each trial outcome but the relevance of each was manipulated continuously allowing determination of both the fMRI signal and the behavioural influence associated with each source of information.
Optimal behaviour in this task requires the subject to track the probability of the correct action and the probability of correct advice independently, and to combine these two probabilities into an overall probability of the correct response (supplementary information). Computational models of reinforcement learning (RL) have had considerable success in predicting how such probabilities are tracked in learning tasks outside the social domain7. The simplest RL models integrate information over trials by maintaining and updating the expected value of each option. When new information is observed this value is updated by the product of the prediction error and the learning rate7. In our task, there are two dissociable prediction errors; the reward prediction error (actual reward - expected value), for learning about the correct option; and the confederate prediction error (actual - expected fidelity), for learning about the trustworthiness of the confederate. The optimal learning rate depends on the volatility of the underlying information source8-10. In volatile conditions, subjects should give more weight to recent information, using a fast learning rate. In stable conditions, subjects should weigh recent and historical information almost equally, using a slow learning rate. By ensuring that the correct option and the confederate’s advice became volatile at different times, we ensured that the learning rate for these two sources of information varied independently. We used a Bayesian reinforcement learning (RL)8 model (supplementary info) to generate the optimal estimates of prediction error, volatility and outcome probability separately for each source of information (fig 1b,c,d).
We first sought to establish whether human behaviour matched predictions from the RL model. We used logistic regression to determine the degree to which subject choices were influenced by the optimally-tracked confederate and outcome probabilities, and by the difference in reward magnitudes between options. Parameter estimates for all three information sources were significantly greater than zero, and there was no significant difference in the degree to which subjects used reward and social information to determine their behaviour (fig 1e). Furthermore there was no significant effect either of subjects blindly following confederate advice without learning its value, or of subjects assuming that the confederate would behave in the same way as the previous trial (fig 1e). Hence subjects were able to integrate the fidelity of the confederate over many trials in an RL-like fashion.
We then investigated whether the FMRI signal reflected the model’s estimates of prediction error and volatility, for both social and reward information, when subjects witnessed new outcomes. In the reward domain, neural responses have been identified that encode these key parameters8, 11-16. Dopamine neurons in the ventral tegmental area (VTA) code reward prediction errors12, 13, 17. Similar signals are reported in the dopaminoceptive striatum11, 18 and even in the VTA itself, when specialized strategies are used in human fMRI studies19. FMRI correlates of the learning-rate in the reward domain have been reported in anterior cingulate sulcus (ACCs). If humans can learn from social information in a similar fashion, it should be possible to detect signals that co-vary with the same computational parameters, but in the social domain.
We observed BOLD correlates of the confederate prediction error in dorsomedial prefrontal cortex (DMPFC) in the vicinity of the paracingulate sulcus, right middle temporal gyrus (MTG), and in the right superior temporal sulcus at the temporoparietal junction (STS/TPJ) (figure 2a). Equivalent signals were present in the left hemisphere at the same threshold, but did not pass the cluster extent criterion; similar effects were also found bilaterally in the cerebellum (supplementary information). Notably, these regions showed a pattern of activation similar to known dopaminergic activity in reward learning13, but for social information. Activity correlated with the probability of a confederate lie after the subject decision but before the outcome was revealed (a prediction signal). When the subjects observed the trial outcome, activity correlated negatively with this same probability, but positively with the event of a confederate lie (Figure 2b). This signal reflects both components of a prediction error signal for social information: The outcome (lie or truth) minus the expectation (Figure 2b). These signals cannot be influenced by reward prediction errors as the two types of prediction error were decorrelated in the task design. The presence of this prediction error signal in the brain is a prerequisite for any theory of an RL-like strategy for social valuation.
We performed a similar analysis for prediction errors on reward information (reward minus expected reward). We found a significant effect of reward prediction error in the ventral striatum (figure 2c), the ventromedial prefrontal cortex, and anterior cingulate sulcus (see supplementary information). As in the social domain, we observed significant effects of all three elements of the reward prediction error (Figure 2d) (see supplementary information for discussion).
As previously demonstrated8, the volatility of action-outcome associations predicted BOLD signal in a circumscribed region of ACCs (figure 3a). This effect varied across people such that those whose behaviour relied more on their own experiences (supplementary information) showed a greater volatility related signal in this region (figure 3b). The volatility of confederate advice correlated with BOLD signal in a circumscribed region in the adjacent ACC gyrus (ACCg) (figure 3a). Subjects whose behaviour relied more on this advice showed greater signal change in this region (figure 3c). Notably, this double dissociation [reflected in a three way interaction between area (ACCs versus ACCg), volatility type (social versus outcome) and degree of reliance on social (F1,20=7.145, p=0.015) or experiential information (F1,20=5.379, p=0.031)] can be understood by reference to a dissociation in macaque monkeys. Selective lesions to ACCs but not ACCg impair reward-guided decision-making in the reward domain20. In the social domain, male macaques will forego food to acquire information about other individuals21, 22. Selective lesions to ACCg but not ACCs abolish this effect23. We found that BOLD signals in these two regions reflect the respective values of the same outcome for learning about the two different sources of information.
Learning about reward probability from vicarious and personal experiences recruits distinct neural systems, but subjects combine information across both sources when making decisions (figure 1e). A ventromedial portion of the prefrontal cortex (VMPFC) has been shown to code such an expected value signal for the chosen action24, 25 during decision-making.
We computed two probabilities of reward on the subject’s chosen option; one based only on experience and one based only on confederate advice. BOLD Signal in the VMPFC was significantly correlated with both probabilities (figures (figures4a4a and S4). However, there was subject variability in whether the VMPFC signal better reflected the reward probability based on outcome history or on social information. The extent to which the VMPFC data reflected each source of information (at the time of the decision) was predicted by the ACCs/ACCg response to outcome/social volatility (at the time when the outcomes were witnessed) (figure 4b,c).
Here, we have shown that the weighting assigned to social information is subject to learning and continual update via associative mechanisms. We use techniques that predict behaviour when learning from personal experiences to show that similar mechanisms explain behaviour in a social context. Furthermore, we demonstrate fundamental similarities between the neural encoding of key parameters for reward-based and social learning. Despite employing similar mechanisms, distinct anatomical structures code learning parameters in the two domains. However, information from both is combined in ventromedial prefrontal cortex when making a decision.
By comparing the two sources of information, we find that social prediction error signals similar to those reported in dopamine neurons for reward-based learning are coded in the MTG, STS/TPJ and DMPFC. BOLD signal fluctuations in these regions are often seen in social tasks26, 27, and in tasks which involve the attribution of motive to stimuli28.Such activations have been thought critical in studies of theory of mind28. That these regions should code quantitative prediction and prediction error signals about a confederate, lends more weight to the argument that social evaluation mechanisms are able to rely on simple associative processes.
A second crucial parameter in reinforcement learning models is the learning rate, reflecting the value of each new piece of information. In the context of reward-based learning, this parameter predicts BOLD signal fluctuations in ACC sulcus at the crucial time for learning8 - a finding that is replicated here. We further demonstrate that the exact same computational parameter, in the context of social learning, predicts BOLD fluctuations in the neighbouring ACC gyrus. This functional dissociation is mirrored by differences in the regions’ anatomical connectivity. In the macaque monkey, connections with motor regions lie predominantly in ACCs29, giving access to information about the monkey’s own actions. Connections with visceral and social regions, including the STS, lie predominantly in ACCg29, giving access to information about other agents. Nevertheless, that it is the same computational parameter that is represented in ACCs and ACCg, suggests that parallel streams of learning occur within ACC for social and non-social information.
It has been suggested that VMPFC activity might represent a common currency in which the value of different types of items might be encoded25, 30. Here we show that the same portion of the VMPFC represents the expected value of a decision based on the combination of information from social and experiential sources. However, the extent to which the VMPFC signal reflects each source of information during a decision is predicted by the extent to which the ACCs and ACCg modulate their activity at the point when information is learnt. If, as is suggested, the VMPFC response codes the expected value of a decision, then the ACCs response to each new outcome predicts the extent that this outcome will determine future valuation of an action; the ACCg response predicts the extent to which this outcome will determine future valuation of an individual.
Subjects performed a decision-making task whilst undergoing FMRI, repeatedly choosing between blue and green rectangles, each of which had a different reward magnitude available on each trial. The chance of the correct colour being blue or green depended on the recent outcome history. Prior to the experiment, subjects were introduced to a confederate. At each trial, the confederate would choose between supplying the subject with the correct or incorrect option, unaware of the number of points available. The subject’s goal was to maximise the number of points gained during the experiment. In contrast, the confederate’s goal was to ensure that the eventual score would lie within one of two pre-defined ranges, known to the confederate but not the subject. The confederate might therefore reasonably give consistently helpful or unhelpful advice, but this advice might change as the game progressed (supplementary information). During the experiment, the confederate was replaced by a computer that gave correct advice on a prescribed set of trials. Subjects knew that the trial outcomes were determined by an inanimate computer program, but believed that the social advice came from an animate agent’s decision.
We would like to acknowledge funding from the UK MRC (TEJB,MFSR), the Wellcome Trust (LTH) and the UK EPSRC (MWW). Thanks to Steven Knight for helping with data acquisition, and Kate Watkins for help with figure preparation.
Detailed analysis of the task, the learning model, the behavioural analysis, the data acquisition and pre-processing, and several further results and discussion can be found in the supplementary information. Here, we describe aspects of the FMRI modelling that may be relevant to the interpretation of our results. Further technical details can also be found in the supplementary information.
We performed two FMRI GLM analyses using FMRIB’s Software library (FSL31). The first looked for learning-related activity (figures (figures2,2, 3 and S3), the second for decision-related activity (figure (figure44 and S4). In each case a general linear model was fit in pre-whitened data space (to account for autocorrelation in the FMRI residuals)32. Regressors were convolved and filtered according to FSL defaults (see supplement).
The following regressors (plus their temporal derivatives) were included in the model: 4 regressors defining the different times during the task (see figure 1 and supplement): CUE, SUGGEST, INTERVAL, MONITOR; 4 regressors defining key learning parameters when the outcomes are presented (see supplement): [MONITOR x REWARD HISTORY VOLATILITY], [MONITOR x CONFEDERATE VOLATILITY], [MONITOR x REWARD PREDICTION ERROR], [MONITOR x CONFEDERATE PREDICTION ERROR].
The following regressors (plus their temporal derivatives) were included in the model: 4 regressors defining the different times during the task (see figure 1 and supplement): CUE, SUGESST, INTERVAL, MONITOR; 7 regressors defining key decision parameters at the times when they were available during the decision (see supplement): [CUE x EXPERIENCE-BASED PROBABILITY], [SUGGEST x EXPERIENCE-BASED PROBABILITY],[SUGGEST x CONFEDERATE -BASED PROBABILITY], [CUE x CHOSEN REWARD MAGNITUDE], [SUGGEST x CHOSEN REWARD MAGNITUDE], [CUE x UNCHOSEN REWARD MAGNITUDE], [SUGGEST x UNCHOSEN REWARD MAGNITUDE]. Note that probabilities were log-transformed such that their linear combination in the GLM would approximate the optimal combination for behaviour (see supplement). Figure 4a was generated using the mean ([1 1 1]) contrast of all probability-related regressors.
FMRI group analyses were carried out using a GLM with 3 regressors: A group mean, the weight for reward history information based on each subject’s behaviour (see supplement), the weight for confederate information based on each subject’s behaviour (see supplement).
The following processing steps are illustrated schematically and described in more detail in the supplement (figure S2). Individual subject data were taken from ROIs defined by the group clusters. Data from each trial were upsampled and re-aligned to points in the trial corresponding to the onset of the 4 trial stages. Data were Z-normalised across trials at each time point in the trial. We then performed 2 general linear models across trials for both reward, and confederate prediction errors. This allowed us (i) to test at which points in the trial the data correlated with the prediction of reward, or the prediction of confederate fidelity, and (ii) to test at which points after the outcome the data correlated with the trial outcome, or actual confederate fidelity. A prediction error signal should comprise 3 parts. (i) a positive correlation with the prediction after the decision; (ii) a positive correlation with the trial outcome at the time of this outcome; (iii) a negative correlation with the prediction at the time of the outcome (as a prediction error is defined as the outcome minus the prediction).
We witnessed all 3 parts of the confederate prediction error as deflections in BOLD correlations at the relevant times. However, due to the nature of the haemodynamic response, it is difficult to test significance from just these deflections. We therefore fit a haemodynamic model to these correlation profiles in each subject (see supplement). The key test was whether the timecourse of correlations with the prediction could be accounted for by a positive haemodynamic impulse at the time of the decision and a negative haemodynamic impulse at the time of the outcome; and whether the timecourse of correlations with the outcome could be accounted for by a positive haemodynamic impulse at the time of the outcome. By fitting the haemodnamic model we were able to measure three parameter estimates for each of these three haemodynamic impulses in each subject, and perform random effects t-tests to measure statistical significance of each.