Despite the sparseness of the data, the fact that multiple statistical models converge on the same pattern suggests the validity of the conclusion: language-mediation of oculomotor control can occur within 80–120 ms; examination of the distribution of launch times towards the man (or the girl) as the words ‘man’ or ‘girl’ unfold shows that there were more signal-driven saccades launched in this and subsequent bins than there were noise-driven saccades. This would appear to have been the case not just when the scene was concurrent with the unfolding word (Study 1) but also when the scene had been removed prior to the spoken stimulus (Study 2, although interpretation of these data is hampered by their sparseness). Unlike prior estimates of the time-course with which language can influence oculomotor control, these studies were typical ‘visual world’ studies in which the eyes were free to fixate on any part of the scene prior to the onset of the target word. The speed with which language-mediation was observed in these studies is all the more remarkable given the nature of the look-and-listen task that was employed in both studies: participants were not under instructions to move their eyes as quickly as they could – indeed, they were under no explicit instruction at all. Whether more explicit task demands would change the patterns observed in these studies is an empirical question; most likely the timing with which signal is distinguished from noise would not change, but the number of saccades would.
If it takes around 200 ms to plan and launch a saccade (given the earlier estimates of saccadic latency), how could one possibly observe language-mediated oculomotor control in just 100 ms? One possibility has to do with the fact that in these studies, the critical target onset (the /m/ in ‘man’ or /g/ in ‘girl’) was preceded by a determiner (‘the’) which may convey, on its vowel, coarticulatory information about either manner or place of articulation of the following consonant. The same is true of the Allopenna et al. (1998)
study (coarticulation on ‘the’ in ‘the beaker/beetle’ would have been distinct from ‘the’ in ‘the carriage’, for example), and yet their data showed a 200 ms delay from the onset of the noun
before eye movements towards the beaker or beetle diverged from eye movements towards the baby carriage. Thus, to the extent that the studies reported here and in Allopenna et al. (1988)
had comparable coarticulation, it is unlikely that the fast effects observed here are due to that coarticulation. A subsequent study by Dahan, Magnuson, Tanenhaus, and Hogan (2001)
demonstrated that the eye movement record is
sensitive to coarticulatory information: They compared the eye movement record in response to words such as ‘net’ (where the first consonant and vowel were taken from a different articulation of this same word) and a version of ‘net’ where the first consonant and vowel were taken from an articulation of the word ‘neck’. They found that this difference manifested in the eye movement record an average of 220 ms after the offset of the initial consonant-vowel sequence (i.e. the onset of the final consonant that disambiguated the unfolding sequence as ‘net’). In Studies 1 and 2, the corresponding point would be the offset of the determiner (i.e. the onset of ‘man’ or ‘girl’), in which case the effects of coarticulatory information could not be expected to manifest, given the Dahan et al. (2001)
finding, until around 220 ms after the onset of the target word – in other words, coarticulatory effects would have little impact on the eye movement record within the 100 ms timeframe identified in Studies 1 and 2. Moreover, Dahan et al. (2002)
manipulated coarticulatory information within
the target word – it is an empirical question whether coarticulatory information on a determiner can cause eye movements towards just those objects in the scene whose names might be compatible with that coarticulatory information (but see below for further discussion of what effects coarticulation could have within the timeframe suggested by Studies 1 and 2).
Given that coarticulation is an unlikely explanation for the speed with which signal was distinguished from noise in Studies 1 and 2, a further possibility is that the data are due to express saccades
(e.g. Cavegn & d'Ydewalle, 1996; Fischer & Boch, 1983; Fischer & Ramsperger, 1984
) – visually-guided saccades with short latencies of around 100 ms However, to our knowledge, such saccades have only been observed when the signal to move the eyes is the onset of a visual target. A more likely possibility is that the effects observed here are not due to the rapid planning and subsequent launching of eye movements, but are instead due to the cancellation
of already-planned saccades towards some object other
than the one that the unfolding language will refer to. Estimates of the time required to cancel a saccade vary between around 100 ms (Kornylo, Dill, Saenz, & Krauzlis, 2003
) and 160 ms (Curtis, Cole, Rao, & D'Esposito, 2005
), although these are averaged estimates, and do not reveal the likely distribution of cancellation times; nonetheless, the time-course of cancellation is compatible with the data from Studies 1 and 2. But even disregarding the time it takes to either plan or cancel a saccade, 100 ms is barely time to know what the unfolding word will become (depending on the initial phonemes, 100 ms may or may not include a part of the vowel following the first phoneme, although as indicated earlier, co-articulatory information conveyed by the preceding determiner – ‘the man’ – may also provide information about the identity of the first phoneme). Given the rapidity of the observed effects, and the minimal phonological information conveyed by those first 100 ms, the following is one possible account of the data.
As in Altmann and Kamide's (2007)
account of why the eyes move in the visual world paradigm, it is assumed that free inspection of the visual scene activates conceptual representations of the objects contained within the scene. A part of this conceptual knowledge is the phonological specification of the object's possible names (see Meyer, Belke, Telling, & Humphreys, 2007
, for an example of how phonological information is activated even during a visual search task for which such information is task-irrelevant). It is further assumed that eye movements towards an object that has already been identified are preceded by shifts in covert attention towards that object (e.g. Henderson, 1992; Henderson, Pollatsek, & Rayner, 1989
), and that these pre-saccadic shifts (which under the premotor account of attention constitute the saccadic plan) re-activate conceptual knowledge about those objects including
that phonological knowledge about the objects' names (it is conceivable that aspects of such knowledge are available from parafoveal preview alone – in the studies above, the man and the girl were the only animate objects, and were thus particularly salient and highly likely to have been fixated in the earliest moments of the scene's presentation; the current data set do not allow contingent analyses on participants not
having previously fixated the target objects). Thus, when planning a saccade towards e.g. the girl, phonological knowledge about the targeted object is known in advance of launching the saccade; in effect, we know the name of the thing we're about to launch an eye movement to. If, simultaneously, we hear the /m/ in ‘man’ (preceded, perhaps, by coarticulatory information about the upcoming /m/) we cancel the saccade because of the phonological mismatch between the unfolding word (‘man’) and the name of the currently targeted object (‘girl’). Canceling eye movements towards the girl when we hear the earliest moments of ‘man’ (or towards the man when we hear the earliest moments of ‘girl’), or alternatively, canceling eye movements when we can anticipate the upcoming /m/ on the basis of coarticulation on the preceding vowel, would lead to the observed pattern of data: Consider, for example, two hypothetical participants, both of whom have been fixating some location within the scene for exactly 150 ms. Participant A is planning a saccade to the man (and will execute that saccade in, say, another 100 ms); Participant B is planning a saccade to the girl (which again, would be executed in another 100 ms) 150 ms into their respective fixations they hear ‘man’ – Participant A continues with the planned saccade, but Participant B cancels their own planned saccade. Thus, 100 ms after the onset of ‘man’, Participant A launches towards the man, but Participant B does not
launch towards the girl – hence one more saccade at 100 ms to the man than to the girl on hearing ‘man’ (and perhaps for another pair of participants, to the girl than to the man on hearing “girl”). This hypothetical example, mirroring the behavioral data, demonstrates language-mediated discrimination between intended and unintended targets at 100 ms; not because more saccades are launched towards the intended target, but because fewer saccades are launched towards the unintended target.
This is, of course, conjecture: a number of testable predictions follow, however – for example, increasing the number of phonological competitors that share that same first phoneme should lessen the effects (in the example given for Study 1, eye movements planned towards the motorbike would not be cancelled, for example. Across the different items in that study, there were insufficient trials with such competitors to test the prediction).
There is an alternative account of why saccades might be cancelled: Prior inspection of the scene leads to the activation of conceptual representations which prime their subsequent reactivation by the partially unfolded phonetic input; having seen the girl means that the /g/ in ‘girl’ is interpreted as indicating with greater likelihood (than if heard out of this visual context) the word “girl”. In other words, the previously seen girl primes the conceptual representations that are compatible with that initial /g.../ sequence, and the representational ‘boost’ that accrues due to the overlap between the conceptual representations activated by the unfolding speech and the conceptual representations previously activated by the girl constitutes a change in attentional state that accompanies saccadic planning (see the earlier discussion of interpretation, attention, and planning, as well as Altmann & Kamide, 2007
, for discussion of this representational boost). This suggests that the mismatch that causes cancellation of the saccade towards the man could be due not simply to phonological
mismatch, but to conceptual
mismatch also; the /g/ in ‘girl’ activates the conceptual correlates of the word, given that these have been primed by the girl in the scene, and the mismatch between this conceptual representation and the conceptual representation that is concomitant with the planned eye movement towards the man causes that eye movement to be cancelled. The current dataset cannot determine whether phonological or conceptual mismatch (or both) drives the cancellation process, although the greater rapidity in the visual world paradigm with which phonological effects can be observed relative to semantic effects (e.g. Huettig & McQueen, 2007
) suggests that, at the very least, the earliest cancellations are likely to be driven by phonological mismatch.
This same account applies also to the blank screen data (Study 2, although the sparseness of the data rendered that study less conclusive): Hearing the earliest moments of /g/ activated the conceptual representation corresponding to the word “girl” (because of the priming afforded by the previously seen scene, as described above), and the activation of this conceptual representation causes the initiation of a plan to move the eyes back towards the location previously occupied by the girl. The planning of this subsequent eye movement necessarily competes with the planned eye movement that is about to be launched elsewhere, resulting in the inhibition and more likely cancellation of the originally planned eye movement. An account based on phonological match/mismatch is also possible if we assume that phonological information, and not just conceptual information, is associated with each location.
One substantive issue remains: How is the initially planned eye movement in fact cancelled? With respect to cancellation of saccades, the visual world paradigm resembles, at times, a double-step
paradigm. Double-step procedures in eye-movement research involve a target appearing in one location, to which an eye movement is prepared, and shortly after, a second target appearing at a new location (depending on the task, the first target may remain onscreen or may be replaced by the second). If the interval between the two steps is short (around 100 ms), participants will often inhibit the planned saccade to the first target and respond with a single saccade toward the second target – accrual of information in one channel blocks, or (in more recent formulations) competes with (and can therefore inhibit), processing in another; e.g. Becker and Jurgens (1979
, see Ray, Schall, & Murthy, 2004
, for an account of double-step saccade sequences compatible with the guided activation theory of cognitive control cited earlier, and Findlay & Walker, 1999
, for an account of saccade generation based on competitive inhibition). In the visual world paradigm, the equivalent situation occurs when an eye movement is planned towards one location (the first step), but the unfolding language suggests another location (the second step); in the cases reported here from Studies 1 and 2, the first step corresponds to the endogenously cued location towards which the eyes are initially programmed, regardless of the language, and the second corresponds to the location exogenously cued by the earliest moments of the language (this is not the only way in which the double-step concept applies to the visual world paradigm – later in the sentence, both steps could be linguistically mediated, or other exogenous or indeed endogenous processes could cause double-step-like behavior). Within this double-step interpretation of the data, cancellation of saccades is a natural behavioral phenomenon (another phenomenon associated with competition between alternative locations to which a saccade might be directed concerns saccadic curvature (e.g. Walker, McSorley, & Haggard, 2006
), although saccadic trajectories were not assessed in the current study).
This raises, however, a secondary, but equally important issue: in the double-step paradigm, inhibition of the initially planned saccade is due to an alternatively cued target that triggers an alternative plan, but in the account of the present data given above, this inhibition is due to mismatch
against information associated with the initially planned saccade. On the face of it, these are two very different phenomena. However, an alternative (competition-based) perspective on the double-step phenomenon is that the accrual of information during the planning of the first saccade, but which pertains to the planning of the second step (cf. parallel programming of saccades; Walker & McSorley, 2006
), inhibits the first saccade precisely because it mismatches the information that constitutes the plan for the first saccade. To return to the data above: The mismatch that arises between /m/ and the phonological information associated with the targeted location to which the initial saccade is planned is accompanied by a match
between the /m/ and the phonological information associated with other locations in the scene (although recall that the match/mismatch could also arise at the conceptual level). The visual world paradigm relies, generally, on the fact that the objects in the scene are already known by the time the language comes along – activating conceptual representations in response to a word such as ‘girl’ causes the eyes to move towards the girl in the scene precisely because the location of that girl is already known (again, see Altmann & Kamide for a mechanistic account of this process). Thus, mismatch
in the visual world paradigm (assuming that the sentence does felicitously apply to the depicted scene) is necessarily accompanied also by match
. Hence the competitive planning processes which, in the more typical double-step paradigm, result in cancellation or inhibition of the initially planned saccade (and which might result also in influences on saccadic curvature).
The account of how it is that signal can be distinguished from noise so early in the language-mediated eye movement record also explains why the earliest data are necessarily sparse – this early mediation, through cancellation of an initially planned saccade, will only occur on those trials in which, fortuitously, the target word unfolds at just the right moment in time relative to that planned saccade (most likely that unfolding must precede the onset of the originally planned saccade – if the unfolding occurs while the eye is executing the saccade (or too late for that saccade to be cancelled), the early influence of the language may manifest differently, as a brief fixation on the target location followed by a corrective saccade whose programming preceded that brief fixation). Thus, cancelled saccades will likely be considerably rarer than executed saccades (although within the visual world paradigm this remains an empirical issue, and will depend on many different factors including, for example, whether or not the language is highly predictive of what will be referred to next; saccadic launch times, for example, are known to be dependent on the number of alternative locations to which a saccade may need to be launched and the informativeness of the signal (Carpenter & Williams, 1995; Reddi et al., 2003
) and thus is it likely that the effects observed here would change as a function of e.g. scene complexity and informativeness of the language – cf. the earlier prediction about the number of phonological competitors in the scene).
If the data from Studies 1 and 2 are indeed due to cancellation of saccades (cf. the double-step paradigm), rather than to saccadic planning de novo
, they have implications for the interpretation of other visual world data: First, the hypothesis that links eye movements to changes in cognitive state due to the unfolding language (Altmann & Kamide, 2007; Tanenhaus, Magnuson, Dahan, & Chambers, 2000
) has to be modified to include cancellation of pre-programmed eye movements, such that rapid effects of mismatch and match (as in the studies by Allopenna et al., 1998
, and others) could manifest in cancelled eye movements. Second, the widely-held assumption that eye movement patterns within the first 200 ms following some part of the linguistic signal could not be due to that signal need also to be modified: While it may be true (on the basis of psychophysical studies of oculomotor control) that on average
eye movements cannot be planned de novo
and executed within much less than 200 ms, they can be cancelled
well within that 200 ms timeframe, and the fixation/saccadic record will be influenced by those cancelled saccades. The data reported here suggest that, if it is
necessary to make any assumptions at all about when the language can first influence the eye movement record, this timeframe must be halved to 100 ms. Finally, the necessary theoretical confluence of saccadic planning and interpretation of the unfolding language, together with the observed speed with which signal-driven saccades distinguish from noise-driven saccades, suggests that the mediation of the one process (saccadic planning) by the other (language interpretation) is automatic
, in the sense that it is not under conscious volitional control (although this does not entail that it is task-independent – that remains an empirical issue); in support of this claim, most theories of saccadic control maintain that the modification of existing saccadic plans is not under volitional control (e.g. Findlay & Walker, 1999
Further research is clearly needed: As indicated above, the claim that these data are due to cancellation of saccades is conjecture, as is the (associated) parallel that has been drawn between certain kinds of saccadic behavior in the visual world paradigm and saccadic behaviors observed in the double-step paradigms that have been used to study saccadic control. Moreover, this was an opportunistic reanalysis of earlier studies that had not been designed to address specifically the issues described here. A considerably larger dataset is required for a more in-depth analysis of what in fact is driving the effects reported here. The fact that we observe language-mediated oculomotor control within around 100 ms in a task which does not explicitly encourage rapid behavioral responses means that further data are required to explore also how this time-course might change as a function of task, and how it might change as a function of the complexity of the scene and the phonological and conceptual characteristics of the objects depicted within the scene. Many other parameters may be relevant also, such as the time to preview the scene before the onset of the critical signal, the clarity of that signal, where the eyes are fixating as that signal unfolds (or whether they are in motion), and so on. For now, it remains the case that across two separate studies with very different visual properties (the scene being concurrent or absent), and across a range of different statistical models (making different assumptions about the underlying data), there is substantial convergence on the time-course with which language can mediate oculomotor control – within 100 ms.