We measured pupil diameter in thirty human subjects while they performed an isoluminant version of a predictive–inference task^{2}. Below we describe task performance, summarize a nearly optimal model that captures key features of performance, demonstrate that certain aspects of pupil diameter encode key variables in the model that can be used to predict performance, and finally show that a task–independent manipulation of arousal and pupil diameter can lead to predictable changes in task performance.

Behavior

The predictive–inference task required subjects to minimize errors in predicting the next number (outcome) in a series. The outcomes were picked from a Gaussian distribution with a mean that changed at random intervals (change points) and a standard deviation (set to either 5 or 10) that was stable over each block of 200 trials (). After each prediction was recorded, the new outcome was shown using an iso–luminant display for 2 s, during which time the subject maintained fixation and pupil diameter was measured (). After this interval, the outcome disappeared and the previous prediction reappeared, to be updated for the subsequent trial. Payment scaled inversely with the subject’s mean absolute error during the session^{2}.

We quantified the extent to which each new outcome influenced the subsequent prediction as the learning rate in a simple delta–rule model (

Eq. 3)

^{2}. The learning rate was equal to the magnitude of change in the prediction expressed as a fraction of the error made on the previous prediction. Thus, a learning rate of one indicated abandonment of the previous prediction in favor of the most recent outcome. A learning rate of zero indicated maintenance of the previous prediction despite a non–zero prediction error.

Subjects tended to use variable learning rates that spanned the entire allowed range, from zero to one. Within this range, learning rates tended to be higher for larger errors, scaled by the noise of the generative distribution (). Learning rates also tended to be highest on the trial after a change point and then decay for several trials thereafter (). These basic trends were similar across subjects, although individual subjects used dramatically different distributions of learning rates ().

Reduced Bayesian model

The learning rates used by subjects were consistent with both a full and a simplified version of the optimal (Baysian) model^{2,17–19} One advantage of the reduced Bayesian model is that it updates beliefs according to a delta rule in which the learning rate is computed according to only two parameters computed per trial: change–point probability and relative uncertainty ().

Change–point probability approximates the posterior probability that the mean of the generative distribution changed since the previous trial, given all previous data. If the mean did change, then previous outcomes should be unrelated to future ones and not contribute to an updated prediction. Accordingly, the model uses learning rates that scale linearly towards one (thus discarding historical information) as change–point probability approaches one (). Change–point probability is computed by comparing the probability of each new outcome given either the current predictive distribution or the occurrence of a change point (

Eq. 5). Its value increases monotonically as a function of the absolute difference between predicted and actual outcome, scaled according to the standard deviation of the generative distribution (

Eq. 6, ).

Relative uncertainty is a function of total uncertainty, which in our task arises from two sources. The first source, noise, reflects the unreliability with which a single sample can be predicted from a distribution with a known mean. The second source reflects the unreliability of the current estimate of the mean, which decreases as more data are observed from a distribution. Relative uncertainty is the magnitude of this second form of uncertainty as a fraction of total uncertainty, analogous to the gain in a Kalman filter. Relative uncertainty determines the learning rate when change–point probability is zero and sets the

*y*–intercept of the relationship between change–point probability and learning rate otherwise (). The effects of relative uncertainty on model learning rates are greatest on the trials following a change point, when its value peaks at 0.5 and then decays over several trials (

Eq. 7; ).

Like the human subjects, the model tended to compute learning rates that were highest just following a change point in the mean of the generative distribution and then decayed for several trials independently of noise. When applied to the exact same outcome sequences as the subjects, the model also tended to produce similar learning rates ().

We related change–point probability and relative uncertainty computed in the model to the mean pupil diameter (“pupil average”) and change in pupil diameter (“pupil change”) measured during the 2–s outcome–viewing period ( inset), using two linear regression models. The first, simpler model had four parameters: change–point probability and relative uncertainty computed from the reduced Bayesian model, the standard deviation of generative distribution, and a binary variable describing whether or not the prediction error was exactly zero. The second model included all of these parameters, as well as several potential confounding factors such as eye position and velocity (see Methods). The models are complementary: the first avoids potential interactions between large numbers of parameters and thus has coefficients that are more readily interpretable, whereas the second avoids missing out on the many factors that in principle could affect our pupil measurements. Both models captured a significant amount of variability in the pupil data (For pupil average/pupil change data, an *F*–test rejected the null model relative to the small model for 27/15 of the 30 subjects, and a nested *F*–test rejected the small model relative to the large model for 29/19 of the 30 subjects, *p*<0.05).

Below we first report the most prominent effects from these regression analyses, which were similar for the two models and include roughly monotonic relationships between pupil change and change–point probability and between pupil average and relative uncertainty. We later show that these relationships were in fact slightly more complicated and included a dependence on baseline pupil diameter that helps us to interpret the results in terms of known properties of the arousal system.

Pupil change reflected change–point probability

The change in pupil diameter during the outcome–viewing period, like change–point probability in our model, tended to increase as a function of error magnitude, scaled as a function of noise (; compare to ). Accordingly, when computed by the model using the same sequence of outcomes experienced by each subject, change–point probability tended to be positively predictive of z–scored pupil change ( ordinate). The complement was also true: change–point probability varied systematically as a function of pupil change for data pooled across the population (). In contrast, there was no consistent relationship between change–point probability and pupil average ( abscissa).

One notable exception to the positive relationship between pupil change and error magnitude occurred for trials in which the error was exactly zero, which corresponded to relatively large pupil changes (left–most data in ). Accordingly, a binary variable added to the linear model that described whether or not the subject correctly predicted the outcome was related to pupil change (the mean value of the regression coefficient was 0.180 *z*_{PC} for the four–parameter regression model and 0.156 *z*_{PC} for the larger model; *p*<0.05 for *H*_{0}: mean=0 for each model) but not pupil average (mean regression coefficient=−0.076 and −0.092 *z*_{PA} for the smaller and larger regression models, respectively, *p*>0.05). Thus, pupil change reflected not only change–point probability, but also whether or not the subject correctly predicted the observed outcome.

Average pupil diameter reflected belief uncertainty

The average pupil diameter during the outcome–viewing period, like relative uncertainty in our model, tended to peak on the trial after a change point and then diminish in magnitude as more relevant information reinforced the existing belief (; compare to and ). Accordingly, when computed by the model using the same sequence of outcomes experienced by each subject, relative uncertainty tended to be positively predictive of pupil average ( abscissa). This result did not simply reflect differences in motor output following change points (e.g., longer button presses to choose a learning rate near one), because similar results were obtained in a control experiment in which subject predictions were reset using a learning rate of 0.5 on each trial, thus requiring the same motor act to choose a learning rate of either zero or one (mean regression coefficient=0.30 and 0.35 z_{PA}/RU for the smaller and larger regression models, respectively, *p*<0.05). The complement was also true: relative uncertainty varied systematically as a function of pupil average for data pooled across the population (). In contrast, there was no consistent relationship between relative uncertainty and pupil change ( ordinate).

Overall uncertainty in our task depends on not only relative uncertainty but also noise, which we manipulated by varying the standard deviation of the generative distribution in blocks (STD=5 or 10). Consistent with our model, in which noise is only used to compute change–point probability (

Eqs. 5 and

6), these manipulations of noise were reflected in pupil change but only insofar as pupil change represented change–point probability (). These manipulations of noise did not have any other systematic effects on either pupil change or pupil average (

*p*>0.1 for

*H*_{0}: a mean value of zero for the regression coefficient describing the influence of noise on the given pupil measurement for both regression models). Thus, for this task pupil average did not appear to reflect overall uncertainty about a future outcome but rather a specific form of uncertainty that arises after change points and signals the need for rapid learning.

Pupil metrics reflected individual learning differences

As noted above (), there was a great deal of variability in the average learning rates used by individual subjects. These individual differences are thought to reflect biases that govern the extent to which subjects tend to interpret the cause of prediction errors in terms of either noise or change points^{2}. One advantage of our reduced model is that it can simulate these individual differences in terms of the subjective hazard rate, which is the expected rate at which change points will occur. Accordingly, fitting the model to behavioral data from individual subjects with subjective hazard rate as a single free parameter yielded fit values that varied systematically with average learning rates (*r*=0.93, *H*_{0}: *r*=0, *p*<0.001; ).

These individual differences in the inferred (fit) subjective hazard rates corresponded to individual differences in both the temporal dynamics and magnitude of outcome–locked pupil responses. We quantified the temporal dynamics using an index that related the pupil response on a given trial to a mean–subtracted version of the template shown in . This template describes the strength of the across–subject, linear relationship between pupil diameter and hazard rate in a sliding time window. This relationship was strongest soon after outcome onset, thus likely reflecting prior expectations about the newly arriving outcome. There was a positive relationship between the mean value of this index and fit hazard rate for individual subjects (*r*=0.51, *p*<0.01). In addition, there was a positive relationship between pupil average and fit hazard rate for individual subjects (*r*=0.40, *p*<0.05).

Based on these relationships, we constructed a linear regression model using the temporal–dynamics index and pupil average to explain individual differences in task performance. The model yielded strong, pupil–based predictions of per–subject values of both fit hazard rate (*r*=0.59, *p*<0.001) and average learning rate (*r*=0.59, *p*<0.001; ). Thus, individual differences in average learning rate, which can be described computationally as differing expectations about the rate of change–points, could be predicted from the temporal dynamics and average magnitude of pupil diameter measured during outcome viewing.

Pupil metrics predicted trial–by–trial learning rates

The relationships between pupil metrics and parameters of the reduced Bayesian model suggest that measurements of pupil diameter during the outcome–viewing period can be used to predict the subsequent learning rate. For example, we found positive relationships between pupil change and change–point probability () and between pupil average and relative uncertainty (). Thus, observing relatively high values of either pupil metric on a given trial should indicate that the subject will use a larger–than–average learning rate when adjusting beliefs according to the outcome observed on that trial. We tested this idea directly, as follows.

First, we examined the relationship between pupil change, pupil average, and learning rate for individual subjects. We used a regression model to describe learning rate (z–scored per subject) in terms of pupil change and pupil average. On average, this linear regression computed per subject yielded a positive coefficient for pupil change (mean=0.108 *z*_{LR}/*z*_{PC}, *p*<0.05 for *H*_{0}: mean=0) and a smaller, not statistically significant, positive coefficient for pupil average (mean=0.085 *z*_{LR}/*z*_{PA}, *p*=0.13; ).

Second, we used a simple, weighted sum of pupil change and pupil average to assess their combined predictive power across subjects. Using weights equal to the mean value of the per–subject regression coefficients from the previous analysis (), the weighted sum was moderately predictive of learning rate across all subjects (*r*=0.067, *p*<0.001). However, this analysis did not take into account a systematic, negative dependence of the sum of these per–subject coefficients (which is related to the overall ability of the weighted sum to account for learning rate) on subjective hazard rate predicted by pupil dynamics (). Subjects with low pupil–predicted hazard rates had pupil responses that were good predictors of learning rate. Subjects with increasingly high pupil–predicted hazard rates had pupil responses that were increasingly less predictive, and in some cases negatively predictive, of learning rate.

Third, we used a more complicated linear model that also included across–subject differences in pupil dynamics that related to subjective hazard rates, which markedly improved our overall ability to use pupil metrics to predict learning rates. This model had three terms: 1) the sum of pupil change and pupil average computed per trial, weighted according to average regression coefficients in ; 2) the pupil–predicted hazard rate, computed per subject (see ); and 3) the multiplicative interaction between these two variables. Using this model, pupil measurements could effectively predict learning rates for all data from all subjects (*r*=0.38, *p*<0.001). These predictions accounted for variations in learning rates both across () and within () subjects.

Task–independent pupil manipulation altered behavior

To examine whether the correlations between pupil measures and learning behavior might reflect an underlying causal process, we used an arousal manipulation that affected pupil diameter and measured its effects on learning behavior. In particular, we occasionally and without warning switched the auditory cue that preceded fixation. Subjects were told that these auditory–cue switches were unrelated to the task and they therefore should ignore the specific sounds. Nevertheless, this manipulation led to increases in both pupil average and pupil change on trials in which the fixation cue was switched (; *t*–test for *H*_{0}: mean effect size=0, p<0.001 for both pupil average and pupil change). Thus, we caused consistent changes in the pupil measures that were correlated with the computational variables needed to solve the task.

This manipulation caused systematic changes in task performance that depended on baseline pupil diameter (). For trials with relatively small baseline diameter (i.e., less than its per–subject median value), individual subjects tended to use larger learning rates on auditory–switch trials than otherwise ( abscissa; mean across subjects=0.113, *t*–test for *H*_{0}: mean=0, *p*<0.01). For trials with relatively large baseline diameter, subjects used slightly smaller learning rates on auditory–switch trials than otherwise, although this trend was not statistically significant ( ordinate; mean=−0.037, *p*=0.35). The average difference in the size of these effects from small– versus large–diameter trials was >0, implying that the effects of this manipulation depended on baseline pupil diameter ( diagonal; paired *t*–test, *p*<0.001). These effects did not result from systematic differences in task conditions for switch versus non–switch trials, because the same three analyses yielded no effects when applied to learning rates computed by our reduced Bayesian model (*p*>0.5).

This dependence on baseline pupil diameter is suggestive of the Yerkes–Dodson “inverted U” relationship between arousal and learning. According to that idea, learning is highest for moderate levels of arousal and lowest for either overly high or overly low levels of arousal^{20}. Our subjects appeared to be consistently engaged during task performance, implying that we were probably not sampling overly low or high arousal states. Nevertheless, in a narrower range and assuming a correspondence between arousal state and baseline pupil diameter, we found that the relationships between learning behavior and our arousal manipulation were qualitatively consistent with an “inverted U.” In particular, auditory–switch trials tended to correspond to the largest increases in learning rate when baseline pupil diameter was relatively low (steepest ascent in the “inverted U”) and the largest decreases in learning rate when baseline pupil diameter was relatively high (steepest descent in the “inverted U”; , open circles).

This “inverted U” relationship was also apparent in our previous pupil measurements, in two ways. First, across subjects, those with larger average pupil diameters during outcome viewing tended to use learning rates that were less, or even negatively, predicted by fluctuations in pupil metrics relative to other subjects (). Second, subjects that had lower pupil–predicted hazard rates used learning rates that were positively correlated with pupil metrics when their baseline pupil diameter was low but negatively correlated when their baseline pupil diameter was high (, filled circles). Thus, results from both our pupil-manipulation and pupil-measurement experiments were consistent with an important role for the arousal system in the rational regulation of learning.