We have presented three eye-tracking experiments examining the role of pitch accent in discourse comprehension during a relatively complex real-world visual search task. Participants followed pre-recorded instructions for a tree decoration task that necessitated a search through over 40 to 52 small ornaments in an organized grid display. Each instruction was followed by a visual search and the hanging of the selected ornament (for some, a relatively delicate manual process). This naturalistic task captured participants’ attention, and combined with head mounted eye movement monitoring, allowed us to measure implicitly the time course of listeners’ use of intonational patterns to anticipate and recognize discourse referents. Even with the inclusion of intonationally infelicitous trials, post-experiment debriefing interactions with participants informed us that they uniformly did not consider accents on words as a possible focus of the study, with few even attending to intonation. Many participants thought that the experiment was testing their responses to color distribution or their memory of the objects’ locations (ornament boards vs. trees).
Experiment 1 compared felicitous and infelicitous use of L+H*. When L+H* felicitously marked contrast on a color adjective modifying a repeated noun (“First hang the green drum.” → “Next, hang the BLUE drum.”), fixation proportions to target cells increased more quickly than when L+H* infelicitously marked the immediately repeated noun (“Hang the blue onion.” → “Next, hang the blue DRUM.”). In contrast, felicitous L+H* on the noun (“Hang the blue onion.” → “Then, hang the blue DRUM.”) did not lead to an early increase in fixations to the target as compared to infelicitous L+H* on the immediately repeated color adjective (“First, hang the blue onion.” → “Then, hang the BLUE drum.”). The effect of felicitous L+H* on the noun appeared after the noun’s segmental information was fully available. The results of Experiment 1 suggested that listeners may have ‘tuned’ to tonal cues that were relevant to the task environment, where no two objects within a cell had the same color and different objects of the same color were distributed across cells. Within this visual context, a contrastive accent on the non-repeated prenominal adjective was useful to predict the upcoming object type, and participants were able to rapidly integrate pitch accent information to guide visual search. In contrast, the pitch accent pattern on repeated adjectives did not lead to an immediate difference in eye movements. While these results demonstrated clear effects of pitch accent type and location on establishing a contrast set from which to choose the upcoming referent during discourse comprehension, they could not establish whether the effects were due to a processing advantage conferred by felicitous use of L+H*, or instead to disruption from infelicitous use.
Experiment 2 confirmed that the presence of L+H* (as opposed more neutral H*) on a non-repeated adjective provided a processing advantage, triggering the selection of a candidate for the following noun. The anticipatory nature of the effect was demonstrated by ‘garden-path’ eye movements when the pitch accent pattern on a non-repeated adjective was misleading. Upon hearing a contrastive L+H* accent on the color (red angel → GREEN drum), participants immediately fixated the incorrect, previously-mentioned object cell. The effect was clearly due to the accented adjective, because it began before segmental information for the noun became available. Importantly, the proportion of fixations to the incorrect object cell continued to increase from before noun onset until 300 ms into the noun. That is, listeners’ eyes continued to be drawn to the anticipated object even as they listened to conflicting segmental information. Such incorrect initial fixations were not observed in the absence of L+H* on the adjective modifying the non-contrastive referent, ruling out the possibility that participants had simply developed a strategy of looking back to the previous cell.
In Experiment 3, L+H* was placed on discourse markers (e.g., And NEXT) that preceded the repetitive command “… hang the …” to test whether participants had developed a selective interpretation of the presence of a L+H* as ‘contrast-on-color’ as a strategy to perform the visual search task. No sign of specific anticipation due to the contrastive accent on a DM was observed in the eye fixation patterns before the target noun phrase. Instead, when a L+H* on the DM prompted an upcoming contrast on the color, the attention shift (presumably to the tree) seemed to be speeded. In addition, L+H* on the DM may have generally drawn attention to the target noun phrase, as indicated by early more frequent fixations to the non-contrastive target.
Our results are consistent with those from previous eye movement studies that have demonstrated very early use of prosodic information during real-time processing, with anticipatory looks to targets made on the basis of intonational cues and before confirming lexical information (e.g. Snedecker & Trueswell, 2003, Dahan et al, 2002
). The effects support a model of spoken language processing that assumes immediate, parallel processing of segmental and suprasegmental information such as pitch accent, despite the latter’s distribution across multiple phonemes in the speech stream. Our participants showed immediate sensitivity to the presence of and type of pitch accent, integrating it with information from the discourse representation, and using it to speed ongoing identification of the object of visual search. Taken together, the results of the three experiments presented here also make two novel contributions to our understanding of how listeners use pitch accent information to establish referential domains during discourse comprehension. Specifically, the work demonstrates how the immediate integration of pitch accent information into the discourse representation can generate an expected referent, and also how grammatical roles of words and referential context constrain the domain of accent-based referential resolution.
First, a L+H* on a prenominal adjective immediately evoked a contrast between the accented discourse entity (i.e., the accented color) and the most salient entity that shared the same grammatical role in the discourse background (i.e., just mentioned color). Simultaneously, this contrastive link between the two prenominal modifiers evoked a mapping between the two modified nouns, projecting a specific candidate in the discourse foreground. The data indicate that eye movements to incorrect targets were planned based on the accentual information of the prenominal modifier, and executed even in the presence of conflicting segmental information. Because saccades are ballistic motor movements, they cannot be re-programmed after they are initiated. Thus they may not reflect the exact time course of processing of conflicting speech signals. However, the finding that the misleading intonational cue delayed fixations to the object even while its phonemic information was available strongly suggests that pragmatic information from contrastive accent is processed immediately, on par with other acoustic information in the speech stream. Not only did pitch accent produce an incremental update of the informational status of the currently processed word, but it also initiated predictive lexical access.
Second, our results show that the effect of accentual cues seems to be constrained by the discourse/grammatical role of the word conveying the accent. Although we found a robust anticipatory effect with the contrastive accent on the adjective, the same prominent accent did not produce an equivalent effect in the discourse marker position. Although speakers often use contrastive accent on DMs to draw the listener’s attention to a specific (or maybe an entire) part of upcoming message, because DMs are largely independent of the syntactic structure of the following utterance, listeners have insufficient information to generate specific hypotheses about upcoming referent possibilities. In contrast, a determiner-adjective sequence provides enough information for the listener to project a head noun. Presumably, the accentual cues can be integrated faster when they accompany words whose grammatical roles constrain the upcoming informational structure. At the same time, the accentual cues associated with particular grammatical roles may be constrained by referential context, as demonstrated by the weaker effect of accentual cues on the repeated adjective in the present study’s search task environment. Thus, our present results suggest that both referential context and grammatical structure may define the domain and the strength of accentual effects. Further research is needed to explore both the scope of referential constraint and the scope of syntactic constraint on the effect of accentual cues during speech comprehension.
The present study demonstrated robust, pervasive effects of contrastive accent on the processing of discourse referents during visual search. The current experiments stand in contrast to previous work by Sedivy et al (1999)
, which failed to demonstrate any effect of intonationally marked contrast with substantially similar materials. In their Experiment 1B, Sedivy et al. (1999)
monitored participants’ eye movements while they followed auditory commands to touch one of four objects: a minimal contrast pair with a pronominal modifier (e.g., a pink comb and a yellow comb), a competitor that shared the contrast property (e.g., a yellow bowl), and a distracter that did not share the contrast (e.g., a metal knife). For each critical trial, participants first heard an instruction that mentioned one of the minimal-pair objects (e.g., “Touch the pink comb
.”). The following instruction mentioned either the counterpart of the minimal pair (e.g., the yellow comb) or the competitor (e.g., the yellow bowl), and modifiers were produced either with L+H* or H*. Results showed that modifiers were immediately interpreted as contrastive, with fast eye movements to the minimal-pair counterparts. However, no effect of the H* vs. L+H* accentual difference was observed.
We notice three major differences between Sedivy et al.’s work and that presented here, which may have led to the difference in findings across the two studies: display complexity, informativeness of the adjective modifier in the discourse context, and consistency of phonological information in the spoken materials. First, display complexity differed substantially across the studies, with four objects in Sedivy et al, and more than ten times that number in the current experiments. In a set of four, the relationship between the minimal pair and competitor objects would be salient to participants, who observed each display change for about 20 seconds before the initial instruction on each trial. Thus the preview and relatively simple display may have allowed listeners to establish a double contrast (pink/yellow comb and yellow comb/bowl). We suspect that this display-oriented overt contrast, rather than the general ‘contrastive interpretation’ of the adjectival modifier (Sedivy, et al., 1999
, p127) led to a ceiling effect (mean fixation latencies ranged from 270–281ms). In the present experiments, ornaments were sorted into 11 cells, and each cell contained three to five ornaments. This visual complexity engaged participants in visual search rather than simple object selection from a known field. In addition, in the current study there was no display-oriented referential bias, as there were multiple competitors of the same color across cells. In other words, participants had no way of guessing the next target ornament based on non-linguistic contrast within the display. In order to make a direct comparison of the timing of effects across the two studies, we calculated the average latency of first fixations for each condition in Experiment 1 and 2. The average first fixation latencies for the felicitous L+H* conditions in Experiment 1 and 2 were 407ms and 388ms, respectively. We argue that the approximate 120 ms difference in the fixation latency across the two studies was driven by the inequality in display complexity and in non-linguistic contextual bias. The relatively slow increases in fixations in the present experiments may be due also to the visual complexity in the experimental setup. (For the detailed discussion on the relation between the visual complexity, preview time and fixations in scene perception, see Henderson & Ferreira, 2004
A second difference between studies is in the informativeness of the prenominal modifiers used. Sedivy et al. (1999)
posit that their adjectives were interpreted contrastively regardless of their accents because the wide range of modifiers used increased their informativeness (objects were described with size, color and material terms, and also presented unmodified). They argue further that the presence of a modifier in the initial instruction (pink comb) may have drawn extra attention to the contrasting modified object (yellow comb) during the initial instruction. Thus excessive informativeness of modifiers across the experiment led to very fast eye movements to the minimal pair object (yellow comb) as compared to the competitor (yellow bowl) – a ceiling effect that obscured the effect of felicitous contrastive accent. We instead attribute the ceiling effect to the varied informativeness of modifier within a display – while the modifier was informative for minimal pair members (pink/yellow comb), the modifier was also used in a less informative manner to describe the single object that shared color with a member of the minimal pair (yellow bowl). When the modifier conveyed unnecessary or confusing information about the target, participants may have interpreted it as the modifier of the contrastive object (e.g., the yellow comb) which required the modification in order to be distinguished from the other member of the pair. We suggest that the overt visual contrast led to frequent looks to the contrastive object regardless of the accentual property on the modifier. In the present study, the informativeness of color modifier was consistent. There were multiple ornaments of multiple colors, and no ornament could be singled out without a color modifier. Therefore, the color modifier in each instruction was equally informative, allowing the examination of the effect of accent uninfluenced by differences in informativeness.
Finally, the phonetic consistency of instructions is crucial for examining prosodic effects. Sedivy et al. (1999)
gave the oral instructions by reading aloud the script, pronouncing L+H* accents in the ‘Stress’ condition and H* in the ‘No stress’ condition. Since no prosodic transcriptions or acoustic analysis of the instructions were provided, it is difficult to ascertain the phonetic distinction between the two conditions. Pitch range and speech rate are highly variable within speakers, and it is not easy to consistently pronounce equivalent tunes across utterances. It is possible that some L+H* instructions of Sedivy et al. (1999)
were produced with less (or more) prominent accents than our speech materials. The present study employed careful ToBI analysis for screening the speech stimuli. Although the debate remains open over phonological distinction between L+H* and H* (Ladd and Morton, 1997
; Ladd and Schepman, 2003
), we ensured that the accentual values of our speech stimuli were phonetically distinct across conditions. Although we do not wish to devalue the importance of investigating the online processing of non-cardinal accents produced in spontaneous speech, we feel it is critical to provide phonetic and phonological analysis of experimental materials used in research on processing and intonation. Current work in our laboratory examines the production and detection of categorical boundaries across different accent types in online spontaneous dialogue comprehension.
The research presented here adds to a growing body of work that employs natural tasks to structure the attention and intentions of interlocutors during discourse production and comprehension. Here, we have successfully used a visual search task and head mounted eyetracking to examine the time-course of contrastive pitch accent use. We feel that the combination of such measures and tasks with careful phonetic and phonological analyses of spoken materials will lead to an accurate characterization of the use of intonation in language processing.