Previous work examining prosodic cues in online spoken-word recognition has focused primarily on local cues to word identity. However, recent studies have suggested that utterance-level prosodic patterns can also influence the interpretation of subsequent sequences of lexically ambiguous syllables (Dilley, Mattys, & Vinke, Journal of Memory and Language, 63:274–294, 2010; Dilley & McAuley, Journal of Memory and Language, 59:294–311, 2008). To test the hypothesis that these distal prosody effects are based on expectations about the organization of upcoming material, we conducted a visual-world experiment. We examined fixations to competing alternatives such as pan and panda upon hearing the target word panda in utterances in which the acoustic properties of the preceding sentence material had been manipulated. The proportions of fixations to the monosyllabic competitor were higher beginning 200 ms after target word onset when the preceding prosody supported a prosodic constituent boundary following pan-, rather than following panda. These findings support the hypothesis that expectations based on perceived prosodic patterns in the distal context influence lexical segmentation and word recognition.
Prosody; Expectations; Spoken-word recognition; Lexical competition; Perceptual organization; Visual-world paradigm
When referring to named objects, speakers can choose either a name (mbira) or a description (that gourd-like instrument with metal strips); whether the name provides useful information depends on whether the speaker’s knowledge of the name is shared with the addressee. But, how do speakers determine what is shared? In 2 experiments a naïve participant (director) learned names for novel objects, then instructed another participant (matcher), who viewed 3 objects, to click on the target object. Directors learned novel names in 2 phases. First, the director and the matcher learned (shared) names either together or alone; second, the director learned (privileged) names alone. Directors typically used a name for items with shared names and a description for items with privileged names. When the director and matcher learned the names individually but with knowledge of what the other learned, directors were much more likely to use privileged names than when director and matcher learned shared names together. Experiment 1b separated effects of collaborative learning from partner-specific effects, showing collaborative learning experience with 1 person helps a speaker distinguish shared and privileged information with a new partner who has the same knowledge. Experiment 2 showed that partner-specific effects persisted even when semantic category was a reliable cue to which names were privileged. The results are interpreted as evidence that ordinary memory processes provide access to shared knowledge in real-time production of referring expressions and that shared experience when learning shared names provides a strong memory cue to the ground status of names.
language production; common ground; reference; memory; shared experience
The notion of common ground is important for the production of referring expressions: In order for a referring expression to be felicitous, it has to be based on shared information. But determining what information is shared and what information is privileged may require gathering information from multiple sources, and constantly coordinating and updating them, which might be computationally too intensive to affect the earliest moments of production. Previous work has found that speakers produce overinformative referring expressions, which include privileged names, violating Grice’s Maxims, and concluded that this is because they do not mark the distinction between shared and privileged information. We demonstrate that speakers are in fact quite effective in marking this distinction in the form of their utterances. Nonetheless, under certain circumstances, speakers choose to overspecify privileged names.
Common ground; Language production; Perspective taking; Referring expressions; Names
Listeners expect that a definite noun phrase with a pre-nominal scalar adjective (e.g., the big …) will refer to an entity that is part of a set of objects contrasting on the scalar dimension, e.g., size (Sedivy, Tanenhaus, Chambers & Carlson, 1999). Two visual world experiments demonstrate that uttering a referring expression with a scalar adjective makes all members of the relevant contrast set more salient in the discourse model, facilitating subsequent reference to other members of that contrast set. Moreover, this discourse effect is caused primarily by linguistic mention of a scalar adjective and not by the listener’s prior visual or perceptual experience. These experiments demonstrate that language processing is sensitive to which information was introduced by linguistic mention, and that the visual world paradigm can be use to tease apart the separate contributions of visual and linguistic information to reference resolution.
Two experiments examined the restriction of referential domains during unscripted conversation by analyzing the modification and on-line interpretation of referring expressions. Experiment 1 demonstrated that from the earliest moments of processing, addressees interpreted referring expressions with respect to referential domains constrained by the conversation. Analysis of eye movements during the conversation showed elimination of standard competition effects seen with scripted language. Results from Experiment 2 pinpointed two pragmatic factors responsible for restriction of the referential domains used by speakers to design referential expressions and demonstrated that the same factors predict whether addressees consider local competitors to be potential referents during on-line interpretation of the same expressions. These experiments demonstrate for the first time that on-line interpretation of referring expressions in conversation is facilitated by referential domains constrained by pragmatic factors which predict when addressees are likely to encounter temporary ambiguity in language processing.
Eye-tracking; conversation; alignment; on-line; language processing; referential communication; cohort; point-of-disambiguation
There is an emerging literature on visual search in natural tasks suggesting that task-relevant goals account for a remarkably high proportion of saccades, including anticipatory eye-movements. Moreover, factors such as “visual saliency” that otherwise affect fixations become less important when they are bound to objects that are not relevant to the task at hand. We briefly review this literature and discuss the implications for task-based variants of the visual world paradigm. We argue that the results and their likely interpretation may profoundly affect the “linking hypothesis” between language processing and the location and timing of fixations in task-based visual world studies. We outline a goal-based linking hypothesis and discuss some of the implications for how we conduct visual world studies, including how we interpret and analyze the data. Finally, we outline some avenues of research, including examples of some classes of experiments that might prove fruitful for evaluating the effects of goals in visual world experiments and the viability of a goal-based linking hypothesis.
Tworating studies demonstrate that English speakers willingly produce reduced relatives with internal cause verbs (e.g., Whisky fermented in oak barrels can have a woody taste), and judge their acceptability based on factors known to influence ambiguity resolution, rather than on the internal/external cause distinction. Regression analyses demonstrate that frequency of passive usage predicts reduced relative frequency in corpora, but internal/external cause status does not. The authors conclude that reduced relatives with internal cause verbs are rare because few of these verbs occur in the passive. This contrasts with the claim in McKoon and Ratcliff (McKoon, G., & Ratcliff, R. (2003). Meaning through syntax: Language comprehension and the reduced relative clause construction. Psychological Review, 110, 490–525) that reduced relatives like The horse raced past the barn fell are rare and, when they occur, incomprehensible, because the meaning of the reduced relative construction prohibits the use of a verb with an internal cause event template.
Lexical semantics; Language comprehension; Language production; Meaning through syntax (MTS)
Two experiments evaluated the time course and use of orthographic information in spoken-word recognition in a visual world eye-tracking experiment using printed words as referents. Participants saw four words on a computer screen and listened to spoken sentences instructing them to click on one of the words (e.g., Click on the word bead). The printed words appeared 200 ms before the onset of the spoken target word. In Experiment 1, the display included the target word and a competitor with either a lower degree of phonological overlap with the target (bear) or a higher degree of phonological overlap with the target (bean). Both competitors had the same degree of orthographic overlap with the target. There were more fixations to the competitors than to unrelated distracters. Crucially, the likelihood of fixating a competitor did not vary as a function of the amount of phonological overlap between target and competitor. In Experiment 2, the display included the target word and a competitor with either a lower degree of orthographic overlap with the target (bare) or a higher degree of orthographic overlap with the target (bear). Competitors were homophonous and thus had the same degree of phonological overlap with the target. There were more fixations to higher-overlap competitors than to lower-overlap competitors, beginning during the temporal interval where initial fixations driven by the vowel are expected to occur. The authors conclude that orthographic information is rapidly activated as a spoken word unfolds and is immediately used in mapping spoken words onto potential printed referents.
spoken-word recognition; orthography; phonology; visual-world paradigm; eye movements
Scalar inferences are commonly generated when a speaker uses a weaker expression rather than a stronger alternative, e.g., John ate some of the apples implies that he did not eat them all. This article describes a visual-world study investigating how and when perceivers compute these inferences. Participants followed spoken instructions containing the scalar quantifier some directing them to interact with one of several referential targets (e.g., Click on the girl who has some of the balloons). Participants fixated on the target compatible with the implicated meaning of some and avoided a competitor compatible with the literal meaning prior to a disambiguating noun. Further, convergence on the target was as fast for some as for the non-scalar quantifiers none and all. These findings indicate that the scalar inference is computed immediately and is not delayed relative to the literal interpretation of some. It is argued that previous demonstrations that scalar inferences increase processing time are not necessarily due to delays in generating the inference itself, but rather arise because integrating the interpretation of the inference with relevant information in the context may require additional time. With sufficient contextual support, processing delays disappear.
Pragmatics; Sentence processing; Scalar implicature; Eye movements
Five experiments monitored eye movements in phoneme and lexical identification tasks to examine the effect of within-category sub-phonetic variation on the perception of stop consonants. Experiment 1 demonstrated gradient effects along VOT continua made from natural speech, replicating results with synthetic speech (McMurray, Tanenhaus & Aslin, Cognition, 2002). Experiments 2–5 used synthetic VOT continua to examine effects of response alternatives (2 vs. 4), task (lexical vs. phoneme decision), and type of token (word vs. CV). A gradient effect of VOT in at least one half of the continuum was observed in all conditions. These results suggest that during on-line spoken word recognition lexical competitors are activated in proportion to their continuous distance from a category boundary. This gradient processing may allow listeners to anticipate upcoming acoustic/phonetic information in the speech signal and dynamically compensate for acoustic variability.
Speech Perception; Categorical Perception; Word Recognition; Subphonemic Sensitivity; Visual World Paradigm
We explored how speakers and listeners use hand gestures as a source of perceptual-motor information during naturalistic communication. After solving the Tower of Hanoi task either with real objects or on a computer, speakers explained the task to listeners. Speakers' hand gestures, but not their speech, reflected properties of the particular objects and the actions that they had previously used to solve the task. Speakers who solved the problem with real objects used more grasping handshapes and produced more curved trajectories during the explanation. Listeners who observed explanations from speakers who had previously solved the problem with real objects subsequently treated computer objects more like real objects; their mouse trajectories revealed that they lifted the objects in conjunction with moving them sideways, and this behavior was related to the particular gestures that were observed. These findings demonstrate that hand gestures are a reliable source of perceptual-motor information during human communication.
We present four experiments on the interpretation of pronouns and reflexives in picture noun phrases with and without possessors (e.g. Andrew’s picture of him/himself, the picture of him/himself). The experiments (two off-line studies and two visual-world eye-tracking experiments) investigate how syntactic and semantic factors guide the interpretation of pronouns and reflexives and how different kinds of information are integrated during real-time reference resolution. The results show that the interpretation of pronouns and reflexives in picture NP constructions is sensitive not only to purely structural information, as is commonly assumed in syntactically-oriented theories of anaphor resolution, but also to semantic information (see Kuno, 1987; Tenny, 2003). Moreover, the results show that pronouns and reflexives differ in the degree of sensitivity they exhibit to different kinds of information. This finding indicates that the form-specific multiple-constraints approach (see Kaiser, 2003; Kaiser, 2005; Kaiser & Trueswell, 2008; Brown-Schmidt, Byron & Tanenhaus, 2005), which states that referential forms can exhibit asymmetrical sensitivities to the different constraints guiding reference resolution, also applies in the within-sentence domain.
Spoken word recognition shows gradient sensitivity to within-category voice onset time (VOT), as predicted by several current models of spoken word recognition, including TRACE (McClelland & Elman, Cognitive Psychology, 1986). It remains unclear, however, whether this sensitivity is short-lived or whether it persists over multiple syllables. VOT continua were synthesized for pairs of words like barricade and parakeet, which differ in the voicing of their initial phoneme, but otherwise overlap for at least four phonemes, creating an opportunity for “lexical garden-paths” when listeners encounter the phonemic information consistent with only one member of the pair. Simulations established that phoneme-level inhibition in TRACE eliminates sensitivity to VOT too rapidly to influence recovery. However, in two Visual World experiments, look-contingent and response-contingent analyses demonstrated effects of word initial VOT on lexical garden-path recovery. These results are inconsistent with inhibition at the phoneme level and support models of spoken word recognition in which sub-phonetic detail is preserved throughout the processing system.
Spoken Word Recognition; Speech Perception; Gradiency; Eye Movements; Lexical ambiguity
Listeners are exquisitely sensitive to fine-grained acoustic detail within phonetic categories for sounds and words. Here we show that this sensitivity is optimal given the probabilistic nature of speech cues. We manipulated the probability distribution of one probabilistic cue, Voice Onset Time (VOT), which differentiates word-initial labial stops in English (e.g., “beach” and “peach”). Participants categorized words from distributions of VOT with wide or narrow variances. Uncertainty about word identity was measured by four-alternative forced-choice judgments and by the probability of looks to pictures. Both measures closely reflected the posterior probability of the word given the likelihood distributions of VOT, suggesting that listeners are sensitive to these distributions.
speech perception; word recognition; ideal observer model; categorization
In many domains of cognitive processing there is strong support for bottom-up priority and delayed top-down (contextual) integration. We ask whether this applies to supra-lexical context that could potentially constrain lexical access. Previous findings of early context integration in word recognition have typically used constraints that can be linked to pair-wise conceptual relations between words. Using an artificial lexicon, we found immediate integration of syntactic expectations based on pragmatic constraints linked to syntactic categories rather than words: phonologically similar “nouns” and “adjectives” did not compete when a combination of syntactic and visual information strongly predicted form class. These results suggest that predictive context is integrated continuously, and that previous findings supporting delayed context integration stem from weak contexts rather than delayed integration.
Three eye movement studies with novel lexicons investigated the role of semantic context in spoken word recognition, contrasting three models: restrictive access, access-selection and continuous integration. Actions directed at novel shapes caused changes in motion (e.g., looming, spinning, etc.) or state (color, texture, etc.). Across the experiments, novel names for the actions and the shapes varied in frequency, cohort density, and whether the cohorts referred to actions (Experiment 1) or shapes with action-congruent or incongruent affordances (Experiments 2 and 3). Experiment 1 demonstrated effects of frequency and cohort competition from both displayed and non-displayed competitors. In Experiment 2 a biasing context induced an increase in anticipatory eye movements to congruent referents and reduced the probability of looks to incongruent cohorts, without the delay predicted by access-selection models. In Experiment 3, context did not reduce competition from non-displayed incompatible neighbors as predicted by restrictive access models. We conclude that the results are most consistent with continuous integration models.
Two experiments examined the role of common ground in the production and on-line interpretation of wh-questions such as What’s above the cow with shoes? Experiment 1 examined unscripted conversation, and found that speakers consistently use wh-questions to inquire about information known only to the addressee. Addressees were sensitive to this tendency, and quickly directed attention toward private entities when interpreting these questions. A second experiment replicated the interpretation findings in a more constrained setting. These results add to previous evidence that the common ground influences initial language processes, and suggests that the strength and polarity of common ground effects may depend on contributions of sentence type as well as the interactivity of the situation.
common ground; eye-tracking; perspective taking; conversation; referential communication; question; comprehension
The authors argue that a more complete understanding of how people produce and comprehend language will require investigating real-time spoken-language processing in natural tasks, including those that require goal-oriented unscripted conversation. One promising methodology for such studies is monitoring eye movements as speakers and listeners perform natural tasks. Three lines of research that adopt this approach are reviewed: (i) spoken word recognition in continuous speech, (ii) reference resolution in real-world contexts, and (iii) real-time language processing in interactive conversation. In each domain, results emerge that provide insights which would otherwise be difficult to obtain. These results extend and, in some cases, challenge standard assumptions about language processing.
eye movements; language comprehension; spoken word recognition; conversation; parsing; speech perception
Importance and predictability each have been argued to contribute to acoustic prominence. To investigate whether these factors are independent or two aspects of the same phenomenon, naïve participants played a verbal variant of Tic Tac Toe. Both importance and predictability contributed independently to the acoustic prominence of a word, but in different ways. Predictable game moves were shorter in duration and had less pitch excursion than less predictable game moves, whereas intensity was higher for important game moves. These data also suggest that acoustic prominence is affected by both speaker-centered processes (speaker effort) and listener-centered processes (intent to signal important information to the listener).
Speech perception requires listeners to integrate multiple cues that each contribute to judgments about a phonetic category. Classic studies of trading relations assessed the weights attached to each cue, but did not explore the time-course of cue-integration. Here we provide the first direct evidence that asynchronous cues to both voicing (b/p) and manner (b/w) contrasts become available to the listener at different times during spoken word recognition. Using the Visual World paradigm, we show that the probability of eye movements to pictures of target and competitor objects diverge at different points in time after the onset of the target word. These points of divergence correspond to the availability of early (voice-onset-time or formant transition slope) and late (vowel length) cues to voicing and manner contrasts. These results support a model of cue-integration in which phonetic cues are used for lexical access as soon as they are available.
Spoken word recognition; Speech perception; trading relations; time course; cue integration
Eye movements were monitored as participants followed spoken instructions to manipulate one of four objects pictured on a computer screen. Target words occurred in utterance-medial (e.g., Put the cap next to the square) or utterance-final position (e.g., Now click on the cap). Displays consisted of the target picture (e.g., a cap), a monosyllabic competitor picture (e.g., a cat), a polysyllabic competitor picture (e.g., a captain) and a distractor (e.g., a beaker). The relative proportion of fixations to the two types of competitor pictures changed as a function of the position of the target word in the utterance, demonstrating that lexical competition is modulated by prosodically-conditioned phonetic variation.
Adult knowledge of a language involves correctly balancing lexically-based and more language-general patterns. For example, verb-argument structures may sometimes readily generalize to new verbs, yet with particular verbs may resist generalization. From the perspective of acquisition, this creates significant learnability problems (Baker 1979), with some researchers claiming a crucial role for verb semantics in the determination of when generalization may and may not occur (Pinker, 1989). Similarly, there has been debate regarding how verb-specific and more generalized constraints interact in sentence processing (Trueswell et al 1993; Mitchell 1987) and on the role of semantics in this process (Hare et al 2003). The current work explores these issues using artificial language learning. In three experiments using languages without semantic cues to verb distribution, we demonstrate that learners can acquire both verb-specific and verb-general patterns, based on distributional information in the linguistic input regarding each of the verbs as well as across the language as a whole. As with natural languages, these factors are shown to affect production, judgments and real-time processing. We demonstrate that learners apply a rational procedure in determining their usage of these different input-statistics and conclude by suggesting that a Bayesian perspective on statistical learning may be an appropriate framework for capturing our findings.
Language Acquisition; Sentence Processing; Verb Argument Structures; Eye-tracking; Artificial Language Learning