|Home | About | Journals | Submit | Contact Us | Français|
We introduce a novel paradigm for studying the cognitive processes used by listeners within interactive settings. This paradigm places the talker and the listener in the same physical space, creating opportunities for investigations of attention and comprehension processes taking place during interactive discourse situations. An experiment was conducted to compare results from previous research using videotaped stimuli to those obtained within the live face-to-face task paradigm. A headworn apparatus is used to briefly display LEDs on the talker’s face in four locations as the talker communicates with the participant. In addition to the primary task of comprehending speeches, participants make a secondary task light detection response. In the present experiment, the talker gave non-emotionally-expressive speeches that were used in past research with videotaped stimuli. Signal detection analysis was employed to determine which areas of the face received the greatest focus of attention. Results replicate previous findings using videotaped methods.
Previous investigations of attention during language comprehension have utilized many different methodologies and have produced many interesting findings. For example, research going back to the 1960s using dichotic listening tasks has demonstrated a consistent right ear advantage (REA) in reporting words. That is, when stimuli are presented in the right ear, they are processed faster and more accurately than when stimuli are presented in the left ear. This effect is observed in experiments using auditory-alone (e.g., Broadbent & Gregory, 1965; Grimshaw, Kwasny, Covell, & Johnson, 2003; Hiscock, Inch, & Kinsbourne, 1999), and auditory-visual (Thompson, Garcia, & Malloy, 2007; Thompson & Guzman, 1999) stimuli. These results are interpreted as evidence of left hemisphere specialization for linguistic processing. When auditory speech understanding includes a greater emphasis on the interpretation of emotional prosody, however, dichotic listening tasks almost universally show a left ear advantage (LEA) for comprehension of emotional prosody (e.g., Bryden & MacRae, 1989; Erhan, Borod, Tenke, & Bruder, 1998; Grimshaw, 1998; Grimshaw, et al. 2003; Stirling, Cavill, & Wilkinson, 2000), reflecting right hemisphere specialization for interpreting emotional prosody.
Supporting the results of the behavioral measure are results from brain imaging and event-related brain potential (ERP) measures, which find activation for emotional prosody primarily in right-hemisphere brain regions (Buchanan, Lutz, Mirasazade, Specht et al., 2000; George, Parekh, Rosinsky, Ketter, et al., 1996; Erhan et al., 1998; Shapiro & Danly, 1985; Shipley-Brown, Dingwall, & Berlin, 1988). Brain imaging and ERP studies have also shown primary activation in the left hemisphere during silent lipreading (Calvert, et al. 1997) and visible speech processing (Campbell, De Gelder, & De Haan, 1996). Using tachistoscopic presentations of stimuli in left and right regions of the visual field, Smeele, Massaro, Cohen, and Sittig (1998) also found a right visual field advantage in visible speech syllable and nonspeech (nonemotional) mouth movement identification, indicating primary involvement of the left hemisphere during the encoding of both types of mouth movements. Finally, Prodan, Orbelo, Testa, and Ross (2001) found that the right hemisphere was more highly activated when processing the upper region of the face while the left hemisphere was more highly activated while processing the lower region of the face. Differential activation of the left hemisphere could imply greater attention to the mouth for visible speech encoding; likewise, differential activation of the right hemisphere could imply greater pickup of cues from the eye region for emotion interpretation.
Another paradigm used to investigate cognitive processes of multimodal language understanding is the eye movement paradigm. By tracking eye movements, it has been found, for example, that under more adverse listening conditions young adults tend to focus more on the mouth area of a (videotaped) face than other areas of the face (Vatikiotis-Bateson, Eigsti, Yano, & Munhall, 1998). However, it has been known for three decades that human attention shifts prior to the initiation of a saccadic movement of the eyes (Remington, 1978, cited in Posner, 1980). Furthermore, whereas the eyes may be directed at a given region in space (overt attention), attention may be inwardly directed, to thought processes (covert attention). As Posner (1980) aptly stated, “On a very general level it seems that evolution has selected similar principles of movement for the hand, the eye and covert visual attention. [However] attention is not intrinsically tied to the foveal structure of the visual system nor slaved to the overt movements of the eye” (p. 22). In other words, it cannot be presumed that, just because the eyes are aimed at a particular region of space, attention is directed at that specific location, because there is not a 1:1 correspondence between attention and eye fixations.
We recently introduced a behavioral methodology referred to as the dot detection paradigm for investigating the distribution of visuospatial attention across a talker’s face during the comprehension of lengthy speech passages using video presentations (Thompson, Malmberg, Goodell & Boring. 2004). In our past research, dots were superimposed on the talker’s face for either 16 msec (Thompson, et al. 2004; Thompson & Malloy, 2004) or 8 msec (Thompson, Malloy, & Le Blanc, 2009). Participants completed a comprehension task (the primary task), and the secondary task of dot detection. Dot detection performance for different areas of the face (right, left, above eyes, below mouth) indicated the degree of visuospatial attention on those areas. Our results using speeches containing neutral prosody showed greater attention was distributed on the right side of the talking face image (right visual field), compared to the left side. In addition to the right-side-of-face-image effect, we found that greater attention was paid to the mouth than the eye area (Thompson & Malloy, 2004; Thompson, et al. 2004; Thompson et al., 2009).
In contrast, using taped speeches with the same semantic content but with a high degree of emotional prosody, attention was greater for the left side of the face image and at the eye region (Thompson et al., 2009). We argued that these effects are predicted by a model of attention that proposes that human attention is directed to: (a) regions of the face contralateral to the activated hemisphere, (b) the mouth region for the pickup of visible speech cues originating in left hemisphere activation, and (c) the eye region to facilitate emotion processing originating from right hemisphere activation.
While the results of studies of attention employing scripted videotaped methods might be logically predicted from the brain imaging and dichotic listening findings, they have weak generalizing power to predict what happens to attention during face-to-face discourse processing situations in which both the talker and the perceiver take a far more active and social role in communication. Whereas discourse understanding via scripted videotaped methods involves no social interaction between the talker and perceiver, discourse understanding in the presence of a live speaker introduces many additional facets of social communication which must affect how attention functions. Human social communication has a great degree of unpredictability associated with it, forcing us to monitor when to take one’s turn in the conversation, gauge how to pace the length of the interaction, and attend to contextual cues that may aid in understanding the speaker’s meaning or in formulating something to say. In addition, during live interactions a talker must engage in a myriad of cognitive processes that occur when translating an internally represented meaning into spoken utterances.
Unscripted discourse processing also involves dynamically-made adjustments to all levels of language representation and production, adjustments in intention, planning, and execution. Some adjustments are made to promote better understanding, such as when a teacher tries to explain something to a student while assessing the student’s understanding. If the student or listener appears not to understand, the teacher may repeat or recast what was just said, or query to find out what the misunderstanding was. The adjustments play another very important role, which is to inform the student of their role as a communicative partner in a social exchange of information. The social aspect of language is crucial for Grey parrots’ learning of the referential use of labels. That is, when Grey parrots view videos of other language learners modeling the process of learning a label for an object, the birds do not learn from the input; a live human must make the adjustments in order for referential learning to occur (Pepperberg & Wilkes, 2004). Fortunately for humans, complete failures of understanding are not the typical result of a virtual communication presentation. Recent advances in video satellite conferencing demonstrate that we are obviously not as limited as the Grey parrot. Still, our scientific understanding of the workings of the human cognitive system during multimodal communication is almost entirely based on data collected during very passive, non-interactive settings.
Moreover, even when the listener is not expected to talk, in other words, s/he is listening but not speaking, inculcated “rules of engagement” still apply. For example, in Western European cultures, persons engaged in active listening during face-to-face interaction usually maintain eye contact, presumably to indicate interest and involvement in the dialogue (e.g. Kendon, 1990) and perhaps also to maximize the receipt of multimodal communication. Listeners trying to understand the auditory-visual speech of a videotaped talker may still use the same eye fixation strategies as they would normally during live interaction, either out of habit or because it is more effective, but the social pressure to do so is absent.
Gullberg and Holmqvist (2006) compared eye fixation patterns and duration in subjects across three formats of stimulus presentation. The first was face-to-face presentation, with one such session recorded and subsequently presented to subjects as a life-sized projection or on a 28” television. Time spent viewing the face did not differ significantly across conditions, leading them to conclude that: “Neither social setting nor size seems to influence the tendency to fixate the face” (p. 68). However, a significant difference in number of gestures fixated upon was noted. More gestures were targeted (7.4%) during the face-to-face condition than in television presentation (3%), with the decline in fixations offset by an increase in fixations on other body areas, attention to which is generally precluded by the presence of the person being fixated. The reduction of gestural fixations during television presentation was attributed to difference in size, and not social setting, since fixation patterns between face-to-face and projected images (differing in social influences but not size) did not differ significantly.
In contrast, two studies of eye fixations to gestures presented by an anthropomorphic agent, shown via video, found that the majority of gestures (70–75%) were fixated (Nobe, Hayamizu, Hasegawa, & Takahashi, 1998; 2000). This huge difference in fixation rates may not be rooted in presentation modality, per se, but rather in inherent differences between the representational systems humans build when interpreting the actions of actual human agents versus simulated human agents (Tsai & Brass, 2007). Even within the literature involving videos of actual human gesture displays, there is controversy surrounding whether or not gestures inform the listener. On the one hand, iconic, or representational, gestures have been found to improve young adults’ comprehension of discourse (Thompson, 1995), while on the other hand, other evidence suggests that gestures contribute little or nothing to the listener’s understanding beyond that contributed by speech (Krauss, Morrel-Samuels, & Colasante, 1991; Williams, 1977).
The aforementioned differences of opinion and findings across laboratories, and the relative dearth of knowledge about the nature of cognitive processes in true interaction settings motivated us to create a new lab setting and paradigm for face-to-face interaction research. The present study is the first investigative step towards understanding human attention during interactive multimodal discourse understanding within a controlled laboratory environment. Most central to the present study is the question: Is the pattern of human attention to different face regions exhibited in passive, videotaped research paradigms also found in more active, truly face-to-face discourse settings?
Currently, our knowledge of human attention processes in communication has been created almost solely from experimental paradigms that bear little resemblance to the real world. When less artificial conditions are used sometimes surprising results are obtained, prompting some to argue that we must break out of the old mold of studying attention in impoverished settings, and begin to study attention in more real-world settings (Kingston, Smilek, Ristic, Friesen, & Eastwood, 2003). However, to date, despite J. J. Gibson’s admonition “The laboratory must be like life!” (1986, pg. 3), no studies of attention during face-to-face discourse understanding have been conducted with both talker and listener sitting directly across from each other. We developed a novel methodology, the face-to-face light detection paradigm, to begin investigating how attention is distributed across the talker’s face in shared physical space. Since our most recently published investigation employing video presentations also included noise distraction embedded in speech (Thompson et al., 2009), we incorporated noise distraction as a variable in the present design, to explore how attention might vary across face regions when speech understanding is challenged by noisy conditions.
Thirty college students (17 – 29 years old, M = 21 years; 15 female) participated. Participants sat across from a young adult female (the “talker”) in comfortable chairs. The design used was a 2 (Noise Distraction: No, High) X 4 (Light Position: Right, Mouth, Left, Eyes) mixed design, with noise distraction between-subjects and light position within-subjects. The semantic content of speeches used in the present study was exactly the same as was previously used in a non-interactive video task paradigm (Thompson et al., 2009), however, a different female talker gave the speeches. Across studies, talkers were young adult females of Hispanic ethnic origin. Each of four speeches was divided into thirty-two 20–25 sec segments. The talker conveyed little emotion to simulate Neutral condition presentations from our past research (Thompson & Malloy, 2004; Thompson et al., 2004; Thompson et al., 2009). She memorized each speech, but also occasionally glanced at text of the speech scrolling in large font on a television screen situated behind the participant.
Face-light stimuli were 5-msec flashes of white light-emitting diodes (LEDs) modified to provide nearly constant luminance over the range of viewing angles expected in the experiment. The LEDs were positioned in front of the talker’s left ear (the “right” position), in the center of the chin area (the “mouth” position), in front of the talker’s right ear (the “left” position), and in the center of the forehead above the eyes (the “eyes” position) by the apparatus shown in Figure 1. The talker’s speech was fed through a small microphone located behind her mouth light position into small earphones worn by the participant. A 74 dbA recording of Multi-Talker Babble 20 (Auditec of St. Louis) was mixed with the talker’s speech for the high noise distraction condition.
Prior to every 8th speech segment participants were cued by computer to the location of the possible light stimulus. Half of the trials contained a light. The timing of the light presentations was controlled by a software program to occur within a few seconds prior to the end of each speech segment. Immediately after each segment, participants responded whether they perceived a light, and then made a multiple-choice comprehension response referring to information just mentioned. For example,
“Professor Woodrow Wilson said that every college graduate should be a man of his nation and a man of his __? (a) pride (b) time (c) honor (d) family.”
Instructions included the statements: “Your primary task is to pay as much attention to the speeches as you can while ignoring the background noise because you will be quizzed in detail on the specifics of the speech. Near the end of each segment you may or may not see a light flash on the person’s face. The lights will appear on the face at four different locations for only a very short period of time. The secondary task is to correctly identify whether or not you saw the light flash on the talker’s face. This task is not as important as comprehending the speech.” Position cues were presented and responses were recorded using a tablet PC located on the right armrest.
We chose to recreate, as closely as possible, the original laboratory context that our studies were conducted in, by projecting video images of this context onto the four surrounding walls of the new laboratory. We did this by mounting four cameras onto a single tripod, aimed in all four directions, simultaneously filming the audio and video images. The videos were then replayed from a video server computer located in an adjacent control room, and projected from ceiling-mounted video projectors into the experiment room. Image sizes were slightly less than 2 m high by about 2.5 m wide.
To conduct a Signal Detection Theory (Green & Swets, 1966) analysis, hits and false alarms for secondary task light detection responses were converted to D prime (d′; discriminability) and Beta (β; criterion) scores, and were submitted to separate Analyses of Variance (ANOVAs). There was a significant main effect for light position (F(3, 84) = 6.25, MSE = 5.43, p < .001), but no effect of distraction noise (p = .91) and no interaction (p = .94). Discriminability scores were higher at the right position (d′ = 4.07) compared to the left position (d′ = 2.20), t(29) = 3.30, p = .003, d = .77, and at the mouth position (d′ = 3.93) compared to the eyes position (d′ = 2.11), t(29) = 2.54, p = .017, d = .68. Criterion was set more conservatively at the right position, compared to the left position, t(29) = 4.05, p = .0001, d = 1.05, but mouth and eyes position criterion scores were not significantly different. The main effect of distraction noise and the interaction were ns. Interpretation of the discriminability measure was not complicated by criterion score considerations. These results replicate our other studies using videotaped presentations of speeches and non-emotional displays (Thompson & Malloy, 2004; Thompson et al., 2004; Thompson et al., 2009). The data, averaged across distraction condition, appear in Table 1.
An ANOVA on primary task comprehension scores (percentage answers correct) revealed non-significant effects of light position (p = .11) and noise distraction (p = .16), and a significant interaction (F = (3, 84), p = .02). One-tailed t-tests were conducted on comprehension scores for questions that were preceded by lights at each position, separately, to test the hypothesis that comprehension scores would be higher in the absence of noise. The results of these tests showed that, compared to participants in the no distraction condition, participants in the noise distraction condition exhibited lower comprehension when lights had just appeared in the right position, t(28) = 1.71, p = .049, d = .62, and in the eyes position, t(28) = 2.05, p = .025, d = .75. Our previous comprehension results using the same speeches and videotape methodology showed no interaction between face position and distraction level. The interaction obtained in the present study may have resulted from participants’ greater focus on visible speech and facial cues to meaning in conditions of high noise distraction, in order to compensate for reduced speech information. The preceding light stimulus at the mouth and eye regions of the face may have promoted this compensatory strategy more so than when the light stimuli were in the right and left positions.
The question of what areas of a talker’s face receive greater attention during face-to-face discourse understanding has been studied with different videotaped methods. Nicholls, Searle, and Bradshaw (2004) covered up the right (in the viewer’s left hemi-space) or left (viewer’s right hemi-space) sides of a talker’s face, dubbed auditory speech syllables (“ba”) onto both normal-orientation and mirror-reversed video images of mouth movements (“ga”) and presented listeners with bimodal syllables for identification. McGurk illusions of “da” (McGurk & MacDonald, 1976) were higher when the left, as compared to the right, side of the talker’s face was covered in both normal and mirror-reversed conditions, exposing the side of the face that naturally exhibits greater amplitude and velocity of motion (Wolf & Goodale, 1987), which suggests that listeners’ attention is biased toward the talker’s right side (viewer’s left hemi-space). This is the opposite pattern found in the present study, as well as in our past research. We attribute the difference in results across laboratories to varying attention strategies that are utilized in syllable classification tasks employing McGurk-illusion stimuli, versus more lengthy discourse processing tasks such as are employed in our studies (cf. Thompson et al., 2009).
The face-light dual-task paradigm results from the present study show the same pattern of attention distribution across face regions as the results from our recent research using videotaped speech stimuli, when the speeches were spoken with neutral affect. It is not known if our previous results using the emotionally expressive versions of the speeches (Thompson et al., 2009) would also be replicated using the methodology from the present study. In our previous study, the opposite effects were obtained in the emotional prosody condition, specifically, greater attention in the left-versus the right-hemifield and at the eyes, compared to the mouth region. Future research will address how facial and prosodic cues to emotion conveyed by the talker affect the listener’s attention distribution.
The face-light dual-task paradigm may also be utilized in other creative ways to study many interesting facets of attention during human communication. Because the researcher is not constrained to using videotaped stimuli, the present methodology can be implemented in studying attention during much more spontaneous and interactive forms of discourse between two or more conversational partners. Two individuals, sitting directly opposite each other in comfortable chairs, can wear the face-light apparatus, thus presenting the face-light stimuli to each other while actively engaged in any type of dialogue that the researcher wishes to study. Currently, we are conducting research in our laboratory addressing how listeners’ trait anxiety affects attention during discourse understanding. Similarly, this method provides a new structure to study the attentional patterns of those with autism or Asperger’s syndrome, whose deficiencies in communication and affective understanding (Frith, 2001) may lead them to differentially respond to face-to-face stimuli. Furthermore, the capabilities of the technology can be vastly extended into a variety of new research domains by controlling the nature of the digital video images being projected onto the surrounding walls. However, while our new paradigm provides opportunities for investigating attention during communication within a broad array of contexts, researchers must be mindful of the potential loss of experimental control over some variables. As just one example, if the speaker’s eyes shifted direction, instigating the listener to shift attention in the direction of the speaker’s gaze or elsewhere in the environment, the present version of our experimental paradigm would have no way of keeping track of this type of variability. Nevertheless, we believe that the capability of investigating issues of cognitive processing during dynamic, face-to-face interaction between individuals within a shared physical context that are afforded by our methodology facilitates progress toward Gibson’s goal of making laboratory research more life-like.
The research was funded by NSF grant 0421502, NIH NIGMS grant (#3S06 GM008136), and the NIH MBRS RISE grant GM61222. We thank Irma Diaz and Elizabeth Prather for their assistance with the study.
Laura A. Thompson, Ph.D. (1987), University of California, Santa Cruz. She is a Professor of Psychology at New Mexico State University. Daniel M. Malloy, Ph.D. (2006), John M. Cone, and David L. Hendrickson were all graduate students in the NMSU Psychology Department when the study was conducted.