Interactions between spoken language and visual scene
In most visual world experiments, the relevant visual environment consists of a collection of clipart drawings on a computer screen. The visual display corresponds to a small array of objects (e.g. four objects in a grid in a setup developed by
Allopenna, Magnuson, & Tanenhaus, 1998, which has been widely used to study spoken-word recognition), or a static, schematic scene consisting of a collection of objects, including potential agents (e.g.
Altmann & Kamide, 1999, who developed this setup to study sentence-level processing). Shortly after the visual display appears, a spoken word or sentence is presented, which relates to one or more objects in the scene. Of interest is the location and timing of fixations to the objects in the visual display in response to hearing the spoken stimulus—and in particular how this pattern of fixations is affected by particular properties of the linguistic stimulus, which is carefully controlled and varies across experimental conditions.
In order to make inferences about language processing on the basis of the pattern of fixations observed in a visual world experiment, one needs to specify a linking hypothesis that describes the relationship between eye movements and linguistic processing. A standard assumption in the visual-world literature is that a shift in visual spatial attention is typically, though not necessarily, followed by a saccadic eye movement to the attended region. As a consequence, the locus of fixation provides insights into the current interpretation of the speech input. A well-specified linking hypothesis should also provide an answer to the basic question of why, when and how listeners shift their visual attention to visual information that is related to an unfolding linguistic expression. Furthermore, it should specify how eye movements reflect the mutual effects of linguistic processing and the visual processing of the contents of the visual display or scene.
In the sections that follow, we contrast a comprehension-based linking hypothesis with a goal-based linking hypothesis. These linking hypotheses differ primarily in the degree of emphasis on the importance of goal structures and the degree to which they view situated language comprehension as a separable component that can be embedded within more specific task-based goals. We begin by outlining the modal, though often implicit, linking hypothesis that we believe is assumed by most researchers who use the visual world paradigm. According to this view the goal of the participant is to understand the utterance, the scene, and how they relate to one another. We then introduce a more task-based perspective, inspired by the work on vision in natural tasks that we briefly reviewed earlier. We then consider some of the possible consequences of a goal-based linking hypothesis for the interpretation of fixation patterns in visual world experiments.
A comprehension-based linking hypothesis
Numerous studies have now demonstrated rich interactions between spoken language and the visual world. Perhaps the simplest effects occur when a listener fixates an object upon hearing its name (see
Salverda & Altmann, in revision, for a study suggesting that such effects are automatic, i.e., at least partially independent of task). In addition, linguistic material that is relevant for reference resolution but that is not inherently referential, such as verbs, prepositions, and scalar adjectives, has immediate effects on fixations to potential upcoming referents (e.g.,
Altmann & Kamide, 1999;
Chambers, Tanenhaus, Eberhard, Filip, & Carlson, 2002;
Eberhard, Spivey-Knowlton, Sedivy, & Tanenhaus, 1995). Furthermore, fixations to a blank screen as people listen to descriptions of events with spatial descriptions can systematically reflect the construction of an internal model (
Spivey & Geng, 2001) and can reflect expectations for an upcoming noun that refers to an object no longer present (e.g. looking at the location where a cake had just been presented upon hearing the verb "eat" in the sentence "The boy will eat the cake";
Altmann, 2004;
Altmann & Kamide, 2009). The earliest moments of reference resolution and syntactic-ambiguity resolution (as indexed by eye movements) are influenced by the presence of referential alternatives (
Altmann & Kamide, 2007,
2009;
Sedivy, Tanenhaus, Chambers, & Carlson, 1999;
Tanenhaus et al., 1995;
Spivey, Tanenhaus, Eberhard, & Sedivy, 2002;
Trueswell, Sekerina, Hill, & Logrip, 1999;
Novick, Thompson-Schill, & Trueswell, 2008;
Kaiser & Trueswell, 2008).
A comprehension-based linking hypothesis can account for the aforementioned language-vision interactions while preserving some degree of representational autonomy between linguistic and visual representations. Fixations can be modeled by assuming that language and vision are independent or partially independent systems, and that visual-linguistic interactions arise due to areas of representational overlap (cf.
Jackendoff, 2002). The underlying assumption that guides this approach, from the perspective of language processing, is that language comprehension is itself is a well-defined task-independent process that can be decomposed into building representations at traditional levels of linguistic representation, such as recognizing words, building syntactic structures, resolving reference, and building event representations (i.e., representations concerning the roles that entities play in the events or activities denoted by the predicates in the sentence). Most crucially, language comprehension is assumed to be fundamentally different from and independent of other task goals.
If we merge the comprehension-based view of language processing with a bottom-up approach to scene representation, then the appearance of a stimulus display in a visual world study would result in the bottom-up construction of a visual scene representation. This process would proceed independently of any expectations about the language that would subsequently be presented. We will assume that the visual scene is represented as an event-based conceptual representation.
When a spoken sentence is presented, incremental analysis of the spoken input results in the construction of linguistic representations. Because the linguistic input unfolds over time, and because speech is processed incrementally, listeners encounter temporary ambiguities at multiple levels of representation (e.g. lexical ambiguity, referential ambiguity, or syntactic ambiguity). At each levels of representation, listeners attempt to resolve these temporary ambiguities by making provisional commitments to likely alternatives, taking into account “correlated constraints” (cf.
MacDonald, Pearlmutter, & Seidenberg, 1994;
Trueswell, Tanenhaus, & Garnsey, 1994), and generate expectations about upcoming information (e.g.,
Altmann & Kamide, 1999,
2007;
Levy, 2008). The interpretation of the speech input thus changes dynamically, over time, as more information in the speech signal becomes available. At each moment in time, the linguistic representations that are (most) activated given the currently available speech input may interact with static visual representations derived from processing the visual scene, which results in fixations to objects and entities in the scene that reflect the current interpretation of the unfolding utterance.
This view, which has been developed in a series of papers by Altmann and colleagues (
Altmann & Kamide, 2007; also see
Knoeferle & Crocker, 2006;
Mayberry, Knoeferle, & Crocker, 2009 for a related proposal and implemented models) allows for the possibility that some common representational format drives eye movements to aspects of a display in a visual world experiment. According to this perspective, eye movements in visual world studies reflect interactions between linguistic and visual representations that occur at the level of conceptual representations. The linguistic representations that are constructed during the incremental processing of the spoken input include conceptual representations pertaining to the meaning of individual words and linguistic expressions. These conceptual representations include visually-based information, such as the shape of an object (see
Dahan & Tanenhaus, 2005, and
Huettig & Altmann, 2007, for evidence that visually-based information is rapidly available and used to search for the referent of a spoken noun in a visual world study). According to the linking hypothesis put forward by
Altmann and Kamide (2007), featural overlap between conceptual representations activated by the spoken language and conceptual representations activated by the visual world result in a boost in activation of those representations. This boost in activation then results in increased attention to, and therefore, an increased likelihood of fixating, those objects in the visual world whose conceptual representations match the conceptual representations activated by the concurrently presented spoken language. Thus, the overlap in linguistic and visual representations naturally and automatically leads to the identification of the referents of referring expressions (e.g., fixating a piece of candy in a visual display upon hearing "The boy looked at the candy" or "Click on the candy").
The comprehension-based view also accommodates richer interactions between visual and linguistic systems, that is, fixations to relevant objects upon hearing linguistic information that is not inherently referential (e.g., looking at a cake upon hearing "The boy will eat the …"). Relevant information in the representation of the scene can also be used as a probabilistic constraint to resolve ambiguities in the linguistic input. For example, referential context could influence parsing of ambiguous input by providing information that is more consistent with one potential analysis. Likewise, the language might resolve ambiguities in the scene, for example, the role that real or depicted entities might play in an event suggested by the scene, or the type of event represented in the scene. However, interactions between language and vision are claimed to originate from, and be constrained to, areas of representational overlap between visual and linguistic representations. According to the comprehension-based view, then, the resulting interactions should in principle be fully predictable given the precise content of visual representations on the one hand, and linguistic representations on the other hand. Thus, patterns of fixations are not predicted to vary under different task conditions.
The linking hypothesis we have described thus far does not account for some clear effects of task on fixations. For example,
Chambers et al. (2002) showed that the size of a potential container influenced whether or not it was considered as a potential goal for an instruction, such as “Pick up the cube. Now put the cube into the can.” Perhaps more strikingly, though,
Chambers, Magnuson & Tanenhaus (2004) found that non-linguistic task-based affordances affect the earliest moments of syntactic ambiguity resolution. An example display in Experiment 2 in
Chambers et al. (2004) had the following four objects: two whistles, one of which was on a clipboard and had a looped string attached to it; another clipboard; and a box. Participants heard an instruction to manipulate objects in the display (“Put the whistle on the clipboard into the box”). Because the display contained two whistles, when listeners heard “Put the whistle …”, they were expecting a phrase that would disambiguate the intended referent, that is, information that would clarify which whistle the speaker intended to refer to (see
Altmann & Steedman, 1988). As a result, they correctly parsed the immediately following prepositional phrase “on the clipboard” as an NP modifier, and therefore fixated the whistle on the clipboard. But, when participants used a hook to carry out the spoken instruction, they incorrectly parsed “on the clipboard” as the goal even though the potential instrument (i.e., the hook) was not introduced linguistically and most of the instructions thoughout the experiment required the participant to use the hook in a non-instrument role (e.g., “Put the hook in the box”). In this condition, the referential domain was constrained by the action to be performed in service of the experimental task.
Task-based effects like those observed in the Chambers et al. studies are difficult to explain using the form of the comprehension-based linking hypothesis we have just outlined. We could, however, maintain the comprehension-based linking hypothesis and augment it by adding another layer of representation to represent task goals. This approach would necessitate having a principled way of determining when task goals can interact with comprehension. Recent incremental parsers in which probabilistic representations are computed in parallel, and real world knowledge can be used to adjust the probabilities of alternatives when new information arises, are one promising direction. If we adopt this approach then real world knowledge, including task constraints, could come into play at points of potential ambiguity.
A goal-based linking hypothesis
Although the comprehension-based linking hypothesis we sketched above remains viable, the vision in natural tasks literature suggests to us that it might be fruitful to explore a more radical “top-down” oriented linking hypothesis that assigns a more central role to task goals and does not assign a privileged role to comprehension per se. According to this view, different types of representations may be involved in the mapping of spoken language onto a visual scene, depending on the actual task that the participant is performing. Most generally, as in the literature on vision in natural tasks, we believe that a richer understanding of both language processing and how it is coupled with visual search will require using paradigms where the goal structure can be modeled.
In taking preliminary steps toward a goal-based model of the integration of visual and linguistic information in the visual world paradigm, we consider a model developed by
Sprague, Ballard, and Robinson (2007) which implements some of the insights of researchers studying visual search in natural tasks. This model assumes that the basic elements of visually guided behavior are visuomotor routines. These routines are sequences of elemental operations that are involved in the selection of task-specific information from the visual world. In the case of visual search, the elemental operations include shifting the focus of processing and mentally marking specific locations for further processing or for future reference (cf.
Ullman, 1984). A routine comprises a coordinated sequence of these basic operations that is assembled and selected to complete a particular visual task, such as monitoring the rising water level when filling a cup (
Hayhoe, 2000). When confronted with a new task, new task-specific routines can be compiled and stored in memory for future use. These routines can in turn be combined to form a vast array of more complex representations and behaviors.
The
Sprague et al. (2007) model simulates the scheduling of eye movements in a virtual walking environment with three tasks: litter collection, sidewalk navigation, and collision avoidance. The model learns to associate the states of each of these tasks with a discrete set of visuomotor routines, which involve the extraction of only the information from the visual environment that is relevant for the successful execution of a particular motor sequence. Uncertainty about task-relevant properties of the environment (e.g., the exact location of a piece of litter) reduces the probability of successful task completion. Thus, eye movements are scheduled and directed in such a way as to minimize the total cost associated with the set of tasks performed at a given point in time, weighted by the reward associated with each task. To capture the effects of reward on task performance, Sprague et al. incorporated a reinforcement learning component into their model. This learning component captures key findings on incremental changes in the timing of saccades in the acquisition of complex skills (e.g.,
Land & McLeod, 2000;
Sailer, Flanagan, & Johannson, 2005). Crucially, the model predicts that the pattern of fixations and frequency of fixations to different objects changes as a participant learns.
The idea of visuomotor routines as the primitive units of visually guided behavior can be applied straightforwardly to the integration of visual and linguistic information in visual world experiments. Routines relevant within the visual world domain would include some general-purpose visual routines, such as shifting the focus of processing and mentally marking certain locations for future processing (
Ullman, 1984). Others would be more specific to visual-linguistic processing. One important routine would correspond to mapping the perceptually-based representations that become available as a spoken word unfolds onto objects in the visual world with matching affordances, resulting in fixations that are in service of identifying a potential referent of a spoken noun (for discussion see
Dahan & Tanenhaus, 2005;
Salverda & Tanenhaus, 2010).
In task-based visual world studies, in which a participant carries out spoken instructions, additional routines are necessary to enable the participant to execute the motor behaviors that constitute the task (e.g. clicking on or moving an object in a visual display on a computer screen). Thus, the eye movement record in task-based visual world studies usually consists of a succession of different patterns of eye movements, which each consistently reflect the execution of clearly distinguishable task-related routines.
Previous research has shown that goal representations in working memory include only immediately task-relevant components of the visual display (
Patsenko & Altmann, 2010); only local subtasks or routines are predicted to significantly affect patterns of fixations. This prediction explains why potential referents that do not match the affordances of a local goal attract relatively few fixations, as in the
Chambers et al. (2002,
2004) studies. It further predicts that even basic word recognition phenomena, such as looks to a cohort competitor, would be sharply reduced (even when the referent is in foveal vision) when that referent is incompatible with a local goal. The goal-based view also predicts that looks to the referent of an anaphoric expression, such as a pronoun, would be reduced when the referent is highly salient in memory or is not relevant to a local task. Thus, looks to a referent – or the absence of looks to a referent – do not provide direct evidence about when a referential expression has been interpreted.
The mapping from a goal to a set of routines is straightforward in studies in which the participant has an explicit task (e.g.,
Allopenna et al., 1998, and variants of this paradigm). When the goal is less articulated, for instance in “look-and-listen” studies in which participants are not given an explicit task, we can assume that these routines might also be activated and thus compete for control of saccades, much as high saliency areas might attract saccades in viewing a scene without a explicit task. Thus, the goal-based linking hypothesis can accommodate evidence for basic effects of visual-linguistic integration that we might consider to be “automatic” (see
Salverda & Altmann, in revision, for evidence that task-irrelevant named objects can capture attention). However, according to the goal-based view, these automatic effects arise from routines that constitute the bottom level of a hierarchically organized goal structure. The magnitude of these effects should therefore be influenced by the subgoals dominating the routines in this hierarchical structure. In look-and-listen studies, then, we might expect to see a smaller proportion of saccades to potential referents than we would see in goal-based tasks, where the visual routines that control fixations are guided by and reinforced by components of the task. Most importantly, however, the goal-based view predicts more fixations to objects that are task-relevant, and fewer fixations to objects that provide an equally good match to the linguistic input but that are not task-relevant.
Comparisons of linking hypotheses
Although few studies have directly compared goal-based linking hypotheses with comprehension-based hypotheses, there are some suggestive results in the literature that support the goal-based approach. Studies using the visual world paradigm typically use either the task-based or the look-and-listen version of the paradigm. This makes it hard to establish if, and to what degree, eye movements in visual world experiments are strongly influenced by the presence or absence of an explicitly defined task. However, the classic paper that reintroduced the look-and-listen version of the visual world paradigm (
Altmann & Kamide, 1999) actually showed clear effects of experimental task. Participants viewed visual scenes that depicted a person situated amongst several objects. For instance, a scene might show a boy sitting on the floor, surrounded by a ball, a cake, a toy train, and a toy truck. In an initial study, participants performed a sentence verification task while their eye movements were recorded. They heard a spoken sentence accompanying the visual scene, and assessed whether or not the sentence described something that could happen in the scene. To this end, participants pressed a “yes” or “no” key on a button box. On trials where the sentence plausibly described something that could happen in the scene, the target noun in the sentence was preceded by a verb with selectional restrictions that were either compatible with any of the objects in the scene (e.g., “The boy will move the cake”) or compatible with one particular object in the scene (e.g., “The boy will eat the cake”, when the cake was the only edible object in the scene). In both experimental conditions, hearing the target noun was more likely to trigger a fixation to the target object than to any other objects in the display, a result that was expected on the basis of
Cooper’s (1974) results. Importantly, if the verb preceding the noun provided semantic constraints that were consistent with only one object in the display, participants were more likely to initiate a fixation to the target object prior to hearing the target noun. Thus, when listeners process spoken language, they can use verb-based constraints to generate expectations that restrict the domain of subsequent reference to particular entities in the visual world.
Altmann and Kamide next considered the role that the sentence verification task might have played in driving the anticipatory eye movements observed during the presentation of the verb. They hypothesized that “the requirement to make such judgments may have induced processing strategies which do not reflect normal processing” (p. 255). In order to address this concern, the initial study was replicated using the same stimuli. This time, the participants did not receive a task and instructions. Instead, they were only told the sequence of events that constituted a trial. The results of this look-and-listen experiment were qualitatively similar to those of the initial experiment. However, the anticipatory eye movements occurred slightly later during the processing of the verb, i.e., with a delay of about 350 ms compared to when the effect was observed in the initial task-based experiment. Moreover, participants were almost twice as likely to fixate relevant objects in the task-based version of the experiment than in the look-and-listen version of the experiment. This aspect of Altmann and Kamide’s results, which has not received much attention in the visual world literature, is a clear demonstration that the participant’s task can have strong quantitative as well as qualitative effects on eye movements in visual world studies.
Effects of task-relevance on referential domains
The goal-based perspective predicts that the immediate relevance of visual information to goal execution is a key predictor of fixation patterns. A corollary of this claim is that factors influencing the difficulty of executing a task (e.g., the difficulty of grasping an object, the complexity of a sequence of movements or task-induced constraints on attentional or memory resources) are likely to affect the proportion of fixations to objects in a visual world study.
Support for these predictions comes from an experiment reported by
Eberhard et al. (1995). In this experiment, participants viewed a display with playing cards and followed instructions such as “Put the five of hearts that is below the eight of clubs above the three of diamonds”. On trials of interest there were two fives of hearts, one of which was the target card. The displays for three “point of disambiguation” conditions are displayed in . In all conditions, the target card (i.e., one of the fives of hearts) was directly below an eight of clubs. In the so-called early disambiguation condition only the target five was directly below another card. In the mid-disambiguation condition, the other five of hearts was below a card of a different denomination (e.g., a ten of clubs). In the late disambiguation condition, both fives were below cards with eights but the distractor card was below an eight of a different suit (e.g., an eight of spades). Importantly, there were relatively few looks to the eight of clubs in the early disambiguation condition. This finding can be interpreted in terms of the relative difficulty of evaluating information in the context of the task across conditions. Because the presence or absence of a card is an easier visual discrimination than identifying the specific features that distinguish two cards, such as the number or suit, attention was not needed to identify the eight of clubs.
Likewise, the nature of the task can influence the set of potential referents in a display that a listener is likely to look at. Examples of this type of effect of task come from a series of studies examining how listeners circumscribe referential domains, including the
Chambers et al. (2002,
2004) studies that we mentioned earlier.
Brown-Schmidt and Tanenhaus (2008) found clear effects of task-relevance in a collaborative task. Pairs of participants worked together to arrange a set of Duplo™ blocks in a matching pattern. Partners were separated by a curtain and seated in front of a board with stickers and a resource area with blocks. Boards were divided into five distinct sub-areas, with 57 stickers representing the blocks. Stickers were divided between the boards; where one partner had a sticker, the other had an empty spot. Thirty-six blocks were assorted colored squares and rectangles. The participants’ task was to replace each sticker with a matching block and instruct their partner to place a block in the same location to make their boards match. The positions of the stickers were determined by the experimenter, allowing for experimental control over the layout of the board. The shapes of the sub-areas and the initial placement of the stickers were designed to create conditions where the proximity of the blocks and the constraints of the task were likely to influence the strategies adopted by the participants.
Brown-Schmidt and Tanenhaus found that for point-of-disambiguation manipulations only blocks that were both proximal and task-relevant were taken into account by the speaker when generating referring expressions and by the listener when interpreting them. For example, speakers would use instructions such as “put the green block above the red block” in situations where there were two red blocks in a sub-area, but one of the red blocks could not be the referent because it didn’t have an empty space above it. Crucially, there was no evidence that listeners were confused by the presence of two red blocks, and no evidence that the red block with no space above it attracted more fixations than any of the unrelated blocks. Similarly, Brown-Schmidt and Tanenhaus found that a cohort competitor (e.g., a block with a picture of a cloud) did not attract fixations when the speaker use the word “clown” to refer to a nearby block with a picture of a clown, during the interactive task, presumably because the cloud was outside of the relevant local domain. These results suggest that visual search did not take into account task-irrelevant blocks. This finding bears a close resemblance to the types of findings we reviewed from visual search in natural tasks, where salient but task-irrelevant objects do not attract fixations.
Developing goal-based linking hypotheses: future challenges
If the goal-based hypothesis is correct, then accounting for patterns of fixations will require explicitly modeling the participant’s goal structure. Under relatively simple circumstances, e.g., a simple instruction and a simple display depicting a single event, comprehension-based models might approximate the goal structure. However, models that account for a high proportion of fixations are likely to require explicit tasks and explicit models of the goal structure.
We want to emphasize, however, that merely using a task is insufficient to meet the challenges associated with the goal-based approach. The goal-based view assumes that the instruction in a task-based visual world experiment triggers a basic language-vision routine, namely, e.g., mapping a word onto a potential referent for purposes of visually-guided reaching. When nearly all fixations begin on a fixation cross and the instructions are simple, as in
Allopenna et al. (1998), it should in principle be possible to account for most of the variance in the timing and location of fixations with a model that focuses on how the participant’s execution of the task mediates the mapping between spoken word recognition and visually identified referential candidates. Indeed, this was true for the models developed by
Allopenna et al. (1998),
Dahan, Magnuson, Tanenhaus, and Hogan (2001),
Dahan, Magnuson, and Tanenhaus (2001), and
Magnuson, McMurray, Tanenhaus, and Aslin (2003).
Once we move toward tasks involving slightly more complex instructions, however, developing a task model is more challenging and without such a model we lose a clear link between fixations and the unfolding language. Thus, we may encounter a weak form of an obstacle that plagued scientists who attempted to build upon Yarbus’s classic work: Whereas it is easy to demonstrate that task affects fixation patterns, interpreting the processes underlying a pattern of fixations can be extremely difficult (for a valuable discussion, see Vivianni, 1990). For example, in the visual world experiments reported in
Tanenhaus et al. (1995) and
Spivey et al. (2002), participants followed instructions such as “Put the apple on the towel into the box”. The crucial conditions involved three types of displays, each of which included an apple on a towel, another towel, and a box, as illustrated in . The data of most interest were the proportion of fixations to the empty towel. Examination of the proportion of fixations over time (Figures 4–6 in
Spivey et al., 2002) demonstrates tight time-locking of fixations to the initial referent or potential referents (a singleton apple) and the goal, with the proportion of looks reaching asymptotes of over .8. In contrast, participants made many fewer fixations to the “incorrect” or “garden-path” goal, the empty towel. Even in the condition that is interpreted as showing the largest garden-path effect, () the proportion of looks to the empty towel never rises above .2. This same pattern is observed in other studies using similar structures (e.g.,
Chambers et al., 2004;
Trueswell et al., 1999;
Novick et al., 2008).
Both of the objects which received a high proportion of fixations at a particular point in time were involved in visually guided reaching: the object to be grasped and moved (the apple) and the location to which this object was to be moved (the box). It is less clear, however, to what we can attribute the fixations to the incorrect goal. These fixations could be influenced by a number of distinct subtasks engaged by the participant. For example, the stage of execution of the motor plan is likely to influence fixations to the empty towel. For example, participants on some proportion of trials may verify the expected goal with an eye movement before reaching for the apple. On other trials, they may instead have already begun reaching for the apple, and may not be ready to make an eye movement to the goal until after they have grasped the apple. This would decrease the probability of fixating on the incorrect goal. Moreover, the subgoal structures used to accomplish a task may differ across participants, and these differences may give rise to systematic variation in fixation patterns. For example, we have informally observed that participants who have expertise in video games or in a signed language make far fewer fixations prior to initiating an action. These results are consistent with a classic line of research initiated by
Green and Bavelier (2003), demonstrating that experienced “gamers” process information more efficiently in the periphery.
Fixation patterns are also likely to change as the specific allocation of resources to immediately task-relevant information changes throughout an experiment. In most standard experiments in cognitive psychology, and in particular visual world experiments, factors such as learning, potential effects of contingencies, and individual differences in how efficiently information is accessed from a visual display are controlled for by standard practices such as counterbalancing items across conditions and averaging data from individual participants and items by condition, thereby reducing the effects of order of presentation on observed outcomes. There are, however, some potential problems associated with this approach. In particular, we are likely to observe effects of factors that have strong priors, but not important variables that might be subject to rapid perceptual learning. If, however, our goal is to model fixations under a goal-based hypothesis, then we need to consider how performance on a task typically changes as a function of experience and learning.
Learning effects can have a considerable impact on an individual’s fixation patterns over the course of an experiment. For example, studies examining the effects of fine-grained acoustic-phonetic variables, such as voice onset time (VOT), often involve repeating pictures many times throughout an experiment (e.g.,
McMurray, Tanenhaus, & Aslin, 2002;
Clayards, Tanenhaus, Aslin, & Jacobs, 2008). In these studies, we frequently observe that the proportion of non-target-related fixations decreases and the proportion of trials in which a mouse-click is not preceded by a look to the target increases.
Within a task-based framework that takes into account effects of learning, it may be possible to speculate further about effects that may be particular to task-based visual world experiments. In such experiments, participants must perform a specific action or set of actions to progress through the experiment. The execution of the visuomotor routines necessary to complete the experimentally defined task is therefore associated with some reward structure and is actively reinforced throughout the experiment. In contrast, look-and-listen experiments do not require participants to extract visual information from the environment to progress through the experiment. Thus, task-based and look-and-listen visual world experiments differ with respect to their reward structure, in that task-based experiments reinforce the use of visuomotor routines above and beyond the reinforcement associated with understanding the language with respect to the visual display. It is possible that this difference in reward structure causes eye movements associated with linguistic processing to be more frequent and more rapidly deployed in task-based experiments than in passive listening experiments. It is also possible that the disparities between two experiments differing only in the presence or absence of a task (e.g. in
Altmann & Kamide, 1999, which to our knowledge is the only study that reports such a comparison) would increase in magnitude over the course of the experiment, because effects of implicit reward structure may incrementally change participants’ behaviors over time.