|Home | About | Journals | Submit | Contact Us | Français|
This paper provides the first demonstration that the content of what a talker says is sufficient to imbue the acoustics of his voice with affective meaning. In two studies, participants listened to male talkers utter positive, negative, or neutral words. Next, participants completed a sequential evaluative priming task where a neutral word spoken by one of the same talkers was presented before each target word to be evaluated. We predicted, and found, that voices served as evaluative primes that influenced the speed with which participants evaluated the target words. These two experiments demonstrate that the human voice can take on affective meaning merely based on the positive or negative value of the words uttered by that voice. Implications for affective processing, the pragmatics of communication, and person-perception are discussed.
Imagine picking up the phone and hearing the warm, soothing, tone of a loved one's voice. The mere sound can make you feel pleasant and happy. Now imagine hearing the voice of a disliked political figure on the radio. The sound of can make you feel so unpleasant that you turn off the radio, even if you want to be informed about the latest political news. Despite the fact that humans are born with the innate capacity to have these sorts pleasant and unpleasant reactions to objects and people in our environments, we must, for the most part, learn whether another person is friend or foe. In other words, people's voices, like their faces and other defining personal attributes, must acquire affective value in order to have the capacity to influence our internal states. Once people's voices have acquired affective value, simply hearing those voices is likely enough to influence our perception of and interaction with people. Understanding how learning about others proceeds is critical to understanding the mechanisms of person perception. In the present report we provide the first evidence that human vocal acoustics can acquire affective value based on the affectively valued content of their speech.
People can quickly identify others by the sounds of their voices. Human voices are richly endowed with acoustic cues to identity, along with other markers for biological sex and age (e.g., Bachorowski & Owren, 1995; Ladefoged & Broadbent, 1957; Owren & Bachorowski, 2003; Owren & Rendall, 1997), and much of this information is carried by vowel (or vowel-like) sounds (Bachorowski & Owren, 1999; Owren & Cardillo, 2006; also see Dudley, 1939; Traunmüller, 1994). According to the affect-conditioning (Owren & Rendall, 1997) or affect-induction model (Owren, Rendall, & Bachorowski, 2005), identity-specific vocal signals acquire meaning for the person hearing the signal (i.e., the perceiver) by influencing the perceiver's affective state. In non-human primates, a distress call from a conspecific does not intrinsically signal or represent threat, but takes on negative meaning (e.g., the presence of a threat) because the acoustics of call induce a negative affective state in the perceiver. A similar process is presumed to be at work in humans: a human laugh is pleasant because the sound of laughter induces a positive affective state in the perceiver. Similarly, a human shout or a yell derives its value as threatening, at least in part, because the acoustics of vocalization induce a negative affective state in the perceiver. In humans, then, it should be possible for a talker's voice to take on negative meaning when it is paired with a negative affective state in the perceiver (a listener).
In this report, we present the first evidence that human voices can acquire affective meaning based on the affective state created by the content spoken by those voices. Despite the large literature on affective learning in humans (for reviews, see De Houwer, Thomas, & Baeyens, 2001, and Delgado, Olsson, & Phelps, 2006), never before have talkers' unique vocal acoustics served as conditioned stimuli, while the content of their speech (i.e., the words spoken) served as unconditioned stimuli of known value. In the present studies, participants listened to male talkers utter positive, negative, or neutral words in a neutral tone. We reasoned that listening to a talker utter a series of valenced words should be sufficient to induce a similarly valenced affective state in the listener (i.e., hearing a positive word, such as “lucky”, would be enough to induce a mild positive change in the affective state of the perceiver). As the talker utters many valenced words, the acoustics of his or her voice will become associated with that affective state. Over a number of repetitions, the voice should come to acquire positive (or negative) value, such that subsequent vocal signals from that talker will induce a positive (or negative) affective state in the listener. In this way, the talker's vocal acoustics come to acquire affective value, such that his or her voice will be experienced as pleasant (or unpleasant) on future occasions. That person's voice will have affective value on future occasions even if the semantic content uttered by the person is devoid of affective meaning (i.e., neutral words).
To assess affective learning for voices, we examined the extent to which talkers' voices served as affective primes in a sequential evaluative priming paradigm. In this procedure, participants categorize a series of target stimuli (e.g., words) as being positive or negative. Prior to the presentation of each target word, another stimulus (the prime) is quickly presented. Valenced, as opposed to neutral, primes influence the speed with which targets are evaluated. Two patterns of response latencies are typically observed—assimilation and reverse priming effects. Most typically, primes that match the target in valence (e.g., positive prime, positive target) serve to speed the target judgment, while primes that do not match the target in valence (e.g., positive prime, negative target) slow the target judgment (e.g., Fazio, Sanbonmatsu, Powell, & Kardes 1986; Bargh, Chaiken, Govender, & Pratto, 1992). These are called assimilation priming effects. In some cases, however, reverse priming effects emerge such that primes that match the target in valence (e.g., positive prime, positive target) serve to slow the target judgment, whereas primes that do not match the target in valence (e.g., positive prime, negative target) serve to speed the target judgment (for a review and discussion, Klaurer, Teige-Mocigemba, & Spruyt, 2009; Glaser, 2003). Reverse priming effects are thought to emerge when the influence of the prime falls outside the window during which the target is evaluated but close enough temporal proximity that it is able to engender an evaluative stance (Klaurer, Teige-Mocigemba, & Spruyt, 2009). In other words, when the time between the onset of the prime and onset of the target is not short, reverse priming is likely to be observed. Given that the time between prime and target onset was necessarily long in this experiment, we thought it was possible that rather than assimilation priming effects we might observe reverse priming.
Whether assimilation or reverse priming effects emerge, the degree to which the prime stimulus serves to influence the speed of target judgments essentially indexes the prime's affective value (as positive or negative). Using this logic, we reasoned that if talkers' voices acquired affective value because they spoke evocative words at an earlier point in time, then the mere sound of their voices (when later uttering neutral words) would serve to change the speed of evaluation during a subsequent evaluative priming task.
We present two experiments to test the hypothesis that human voices can acquire affective meaning based on the content of the words spoken by those voices. Target talkers uttered positive, negative, or neutral words during an initial learning phase. Following learning, the vocal acoustics present in neutral words uttered by the same target talkers modulated the speed with which participants evaluated the valence of positive and negative words during a sequential evaluative priming procedure. In essence, the acoustics of a human voice acquired affective potency by virtue of the words it uttered.
Participants were 39 Boston College students (19 male, 20 female) who completed the study for either course credit or $15. Two additional participants were excluded because they did not comply with the task instructions insofar as they responded randomly during the evaluative priming test phase (e.g., pressing one key for most of or the entire test phase).
Participants completed the affective learning task, which was implemented in E-Prime Version 1 on a Dell Pentium VI PC using a 17” CRT monitor. Participants wore Sennheiser headphones for the duration of the task and the volume of the auditory stimuli was maintained at an equal level for all participants. During the learning phase, participants heard a total of nine European-American male talkers. Each talker uttered only positive, negative, or neutral nouns and adjectives (i.e., three talkers uttered only positive words, three talkers uttered only negative words, and three talkers uttered only neutral words; see Appendix, Table 1 for words). Spoken words were selected from the Affective Norms for English Words database (ANEW; Bradley & Lang, 1999) based on their valence and were matched on their degree of arousal. Each talker uttered ten words. All talkers were native speakers of American English. The stimuli were recorded individually in a double-walled sound-deadened room using a Special Projects headworn microphone, and a Tascam DA-P1 digital audio recorder. An experimenter was present during each recording session, coaching the talkers as needed in speaking clearly at an appropriate amplitude level. Talkers were instructed to speak in a neutral tone and the tone of the recorded words was assessed to ensure neutrality by one of the co-authors (MO) who has expertise in assessing vocal acoustics. Recordings were re-digitized at 44.1 kHz and 16-bit wordwidth using Praat acoustics software (Boersma, 2001), amplitude-scaled to the full representational range, and edited so as to include 100-ms silence at the beginning and end of each file. Talker identity and word valence pairings were counterbalanced across participants. Each spoken word was approximately 1000-ms long with a 1000-ms inter-trial interval between words. Each word was played four times and words were played in a random order. Participants were instructed to listen carefully to the words that were spoken as later in the experiment they would be asked to select the words they had heard from a list that also included words they had not heard.
Immediately following the learning phase, participants completed a sequential evaluative priming phase, with the instructions that they would be judging words presented on the computer screen as being positive or negative. They were told to use two keys (i.e., the `p' and `q' on a standard QWERT keyboard) which had been marked with `+' and `−' signs to indicate whether the target words were positive or negative. Response keys were counterbalanced across participants such that for half of participants the `p' key was used to indicate positive targets and for half of participants the `p' key was used to indicate negative targets. Participants were also told to keep their fingers on the response keys at all times and to make the most accurate judgments possible, while responding as quickly as possible. Finally, participants were instructed that they would hear a distracter word immediately before a target word, and to ignore the distracter while focusing on evaluating and categorizing the target word.
The primes were neutral words selected from the ANEW database (Bradley & Lang, 1999) and spoken by the same nine target talkers included in the learning phase (see Appendix for words). Each prime was played in its entirety (up to 1000 ms), followed by an inter-stimulus interval of 100 ms during which a fixation cross was displayed. Following the inter-stimulus interval, the target word appeared on the computer screen. Targets were selected from the word list in Bargh, Chaiken, Govender, & Pratto, (1992) and were moderately positive and negative target words for which there was a high degree of consensus (60% or more of respondents agreed on the valence). These words appeared on the screen until the participant responded, or for a maximum of 3000 ms. If the participant did not register a response within the 3000-ms window, the trial was counted as an error and the program moved on to the next trial. The inter-trial interval was 4000 ms.
The evaluative priming procedure consisted of a total of 180 trials, of which 60 were potential congruent trials where a talker's voice was paired with target words that matched the learning words in valence (e.g., a talker who uttered the word “excitement” in the learning phase now uttered a neutral word “seat” which served as the potential prime to a valence-congruent word “kitten”). There were also 60 potential incongruent trials, in which a talker's voice was paired with target words that were opposite in valence to the words spoken in learning (e.g., “taxes”). Finally, there were 60 baseline trials where talkers who uttered neutral words during learning again uttered neutral words as potential primes for positive and negative target words.
Judgments and response latencies were recorded on each trial. Trials with reaction times less than 300 ms were removed because such reaction times are thought to occur largely as a result of anticipatory error (Bargh, Chaiken, Raymond, & Hymes, 1996). Reaction times for correct trials with reaction times greater than 300 ms (95% of all trials) were aggregated by trial type.
Immediately following the evaluative priming phase, participants completed an explicit evaluation phase. Participants heard each of the nine talkers from the learning phase utter a sequence of twenty neutral words (the ten neutral words from the learning phase, plus the ten neutral prime words). Words were played in a random order. At the end of the sequence of words, participants were asked to rate the speaker's voice on a scale of 1–5 where 1 equaled very unpleasant, 2 equaled unpleasant, 3 equaled neutral, 4 equaled pleasant and 5 equaled very pleasant. Once a rating had been keyed the program advanced to the next speaker's word series.
Our a priori hypothesis was that voices acquired affective value after speaking valenced words corresponded to a test of the interaction between prime and target valence. We conducted a set of doubly centered contrasts that allowed us to examine the specific effect of the interaction, independent of main effects (see Ableson & Prentice, 1997). A doubly centered interaction contrast removes the contribution of main effects and the grand mean from the cell means to specifically examine the pattern of the interaction. This is comparable to creating an interaction term from centered variables in multiple regression (Aiken & West, 1991). We conducted doubly centered contrasts on the reaction time data that compared value congruent prime-target trials (i.e., prime faces paired with positive information during learning-positive primes; prime faces paired with negative information during learning-negative primes) to value incongruent prime-target trials (i.e., prime faces paired with negative information during learning-negative primes; prime faces paired positive information during learning). To assess positive and negative affective learning specifically, we ran two similar sets of double centered contrasts (one set for positive affective learning and one set for negative affective learning) that compared all three valenced primes (positive, negative, neutral) for a given target type.
Raw aggregated reaction times, doubly centered mean reaction times, and standard errors from the evaluative priming procedure are presented in Table 1.
The pattern of reaction times indicated that reverse priming occurred. Participants were significantly faster to evaluate target words after hearing a voice speaking neutral words when that voice had spoken oppositely valenced words during the learning phase, F(1, 76) = 18.13, p < .01. For example, for voices that spoke the word “excitement” during learning, speaking the word “seat” slowed the judgment of “kitten” as “positive” compared to its effect on the judgment of “taxes” as “negative.”
This reverse priming effect was driven primarily by negative affective learning. Voices that previously uttered negative words during learning slowed the judgment of negative targets when compared to the influence of voices that previously spoke neutral or positive words, and significantly sped the evaluation of positive targets compared to those other voices F(1,76) = 33.20, p < .01. Voices that spoke positive words during learning did not differentially influence the judgment of positive targets when compared to the effect of voices that spoke neutral or negative words during learning, however, F(1,76) = 2.60, ns, suggesting that talkers' vocal acoustics did not acquire positive affective value by speaking a series of positive words.
There were no differences in the explicit valence ratings of voices, F(2, 76)=.14, p<.87. In fact, voices that spoke positive words during learning (M=2.98, SD=.65), voices that spoke negative words during learning (M=3.05, SD=.59), and voices that spoke neutral words during learning (M=3.01, SD=.57) were all rated as being neutral (i.e., not different from “3” using one-sample t-tests); all ts <.5, p>.6. These findings suggest that the evaluative priming findings were not the result of strategic, explicit responding.
These findings demonstrate that talkers' vocal acoustics acquired affective valued based on the semantic content of their speech. This learning occurred despite the fact that pre-testing showed that listeners could not distinguish among the talkers with reliability and all voices were judged to be neutral during the explicit evaluation phase. Yet in this study, the acoustic properties of talker's voices (when speaking neutral words) were sufficient to influence the evaluation of target words.
We further hypothesized that we observed stronger negative, as compared to positive, affective learning because the negative words spoken during the learning phase were more likely to be experienced as self-relevant to the university students who were our participants (e.g., hangover, smoking, cavities, divorce, taxes) and therefore subserved robust negative affective learning; in contrast, the positive words were less self-relevant (e.g., kitten, butterfly, strawberries, money, silk). In Experiment 2, we therefore recorded talkers uttering positive, negative, and neutral adjectives that were all likely to be experienced as self-relevant by the listeners because they can be used to describe characteristics of people. In addition, we increased the number of words each talker uttered from 10 to 20 in order to increase the likelihood that at least a number of words spoken in each valence type would be perceived as highly self-relevant by participants. We also increased the number of times each word was spoken during learning (from 1 to 4) to increase the number of experience the listeners had with each talker's voice.
Participants were 29 Boston College students (10 male, 19 female) who completed the experiment for either course credit or $15. Two additional participants were excluded because of technical problems and one addition participant was excluded because that person did not comply with the task instructions.
The materials and procedure for Experiment 2 were nearly identical to Experiment 1 with exceptions as indicated. During the learning phase, participants heard a total of six European-American male talkers. Each talker uttered only positive, negative, or neutral adjectives referring to personal characteristics (i.e., two talkers uttered only positive descriptive adjectives, two talkers uttered only negative descriptive adjectives and two talkers uttered only neutral descriptive adjectives; see Appendix for words). Each talker uttered twenty words. Immediately following the acquisition phase, participants completed a sequential evaluative priming phase which was identical to the sequential evaluative priming procedure from Experiment 1, with one exception. The prime stimuli were neutral words spoken by the same six talkers who uttered either positive, negative, or neutral words during the affective learning task (see Appendix, Table 2 for words). As in Experiment 1, reaction times for correct trials with reaction times greater than 300 ms (93% of all trials) were aggregated by trial type. No explicit evaluations of the voices were collected.
As in Experiment 1, we used a series of doubly centered contrasts to evaluate the reaction-time data. Raw aggregated reaction times, doubly centered mean reaction times and standard errors are presented in Table 2. As in Experiment 1, we observed reverse priming effects. Participants were significantly slower to evaluate target words after hearing a voice that had spoken a similarly valenced word during the learning phase when compared to hearing a voice that had spoken opposite valenced words during learning, F(1, 56) = 191.65, p <.01, indicating that voices acquired affective value based on the valence of the words spoken during the learning phase.
Further contrasts indicated that both negative and positive affective learning occurred. Voices that previously spoke negative words during learning slowed the judgment of negative targets when compared to voices that previously spoke neutral or positive words, and significantly sped the evaluation of positive targets compared to those other voices F(1, 56) = 109.90, p <.01. Voices that spoke positive words during learning slowed the judgment of positive targets when compared to voices that spoke neutral or negative words during learning, and significantly sped the evaluation of negative targets compared to those other voices F(1, 56) = 182.11, p < .01. Experiment 2 thus demonstrated that hearing talkers utter positive and negative adjectives was sufficient to imbue their voices (specifically, their utterances of neutral words) with positive and negative meaning.1
In two experiments, we demonstrated that a human voice becomes affectively potent and acquires the capacity to influence behavior simply based on the content of the words spoken by that voice on a previous occasion. These studies demonstrate, for the first time, that a perceiver's (or listener's) responses to a particular talker can be influenced by the literal sound of that talker's voice based on the affective connotations of the words that talker produced in the past.
One possibility is that learning was enhanced by subtle affective cues present in the talker's pronunciation of words during the learning. In this view, when talkers spoke positive words their vocal acoustics were positive in affective tone simply because they were speaking words with positive connotations; these tonal attributes served to enhance the degree of positivity of the spoken words. We cannot rule out this possibility because we did not collect affective ratings of the talkers' vocal acoustics prior to learning so as not to bias perception of the voices during learning or test. Nevertheless, even if learning was enhanced by subtle affective cues present in the voices during learning, the words spoken during test were neutral in content and tone. As a result, we can conclude that the priming effects were based on affective value acquired by the identity linked vocal acoustics of each talker.
Interestingly, the pattern of reaction times was reversed from those traditionally observed in evaluative priming. There are two possible explanations for this pattern of results: that reverse learning occurred (i.e., hearing a voice speak positive words lead to that voice acquiring negative value) or that reverse priming occurred (i.e., voices that had acquired positive value facilitated the judgment of negative targets). A secondary index of the voices' acquired affective value would be required to rule out the possibility of reverse learning. In Experiment 1, we assessed participants' explicit evaluations of all of the voices but all voices were judged to be neutral. Rather than allow us to rule out the possibility of reverse learning, this finding suggests that the evaluative priming effects were not the result of strategic responding based on explicitly recalled valued information about the voices. In future studies, additional measures of voice valence should be included in order to rule out the possibility of reverse learning.
Reverse priming, or contrast effects, have been documented in diverse set of evaluative priming experiments (for reviews, Klaurer, Teige-Mocigemba, & Spruyt, 2009; Glaser, 2003). A new theoretical approach proposed by Klaurer and colleagues (2009) suggests that reverse priming effects occur when prime stimuli occur outside of the time window during which target stimuli are evaluated but in recent temporal proximity to the prime. In other words, there is a critical, early evaluation window during which traditional priming effects emerge (called assimilation effects) wherein the prime occurs in the window during which the target is evaluated. Immediately following the assimilation priming evaluation window, is a temporally later window during which reversed priming is observed. In this window, the prime has already activated an evaluative stance but is too temporally distant to exert direct influence on the evaluation of the target. Finally, after the reverse priming evaluation window, the effect of the prime eventually dissipates all together and no priming effects are observed. The specific timing of the different evaluation windows depend on the properties of the prime and targets and their relative timing (or stimulus onset asynchrony, SOA, time between the onset of the prime and onset of the target).
One way to vary the position of evaluation windows is by varying the SOA between the prime and the target. Typical SOAs in evaluative priming experiments that results in traditional assimilation priming effects are around 300 ms (e.g., Bargh et al., 1992). A good deal of research has demonstrated priming effects are quite sensitive to the SOA although the pattern of results is somewhat unclear. In some experiments priming effects appeared at 300 ms but not 1000 ms (e.g., Hermans, De Houwer & Eelen, 1994), while in other cases priming effects emerged at 0 and 100 ms SOAs and reverse priming was documented at 1200 ms SOA (e.g., Klauer, Robnagel, & Musch, 1997). In the present experiment the SOA was necessarily long because our prime stimuli were full spoken words (around 1 second in length) which required a longer time to play than is typical for visually presented prime stimuli; additionally a fixation cross was presented for 100 ms between the prime and target to cue participants that the target was going to be presented2. Given this long SOA and the fact that the target evaluation window was likely cued by the intertribal interval fixation cross, it is possible, even probable, that the prime stimulus fell outside of the target's evaluative window resulting in reverse priming effects. Regardless of the mechanism underlying the effects, the fact that the voices speaking neutral words served to modulate the speed with which target words were evaluated clearly indicates that the acoustics of those voices acquired affective value.
The present findings have important implications for understanding the mechanisms of person perception. Little empirical research has addressed how vocal cues influence complex social judgments made about others or how vocal cues influence social interactions. One recent study suggests that cues from voices may have a large impact on social perception-- women whose voices were judged to be more feminine were associated to a greater degree with highly stereotypic female descriptions (Ko, Judd, & Blair, 2006). We know that much information can be extracted from the human voice--not only can people extract information about a speaker's sex and age (e.g., Bachorowski & Owren, 1995; Ladefoged & Broadbent, 1957; Owren & Bachorowski, 2003; Owren & Rendall, 1997), but also information about a speaker's personality (Scherer, 1979), emotional state (Russell, Bachorowski, & Fernandez-Dols, 2003 for a review), attractiveness (Zuckerman & Miyake, 1993; Berry, 1992), maturity (Berry, 1992) and even probable occupation (Yamada, Hakoda, Yuda, & Kusuhara, 2000). Despite evidence that such information can be extracted from human voices, how vocal acoustics come to have such meaning has not been addressed. While it is possible that some types of human vocal communication are innately pleasant (e.g., the sound of laughter; Owren & Bachorowski, 2003), it is likely that the pairing between particular vocal acoustics and complex social constructs such as occupation are learned through experience. The present findings suggest that such learning is not only possible, but occurs through relatively mild and limited experience (hearing a disembodied voice speak 10 or 20 valued words) and is highly specific (individual speakers' voices could not be explicitly distinguished, yet still acquired specific affective value). How such learning influences subsequent interpersonal interactions is a fruitful avenue for future research.
The extent to which the affect-inducing properties of a voice become context-independent, change over time, and explicitly influences person perception are potentially fruitful avenues of future research. For example, if a given talker routinely induces negative affect in others by predominantly discussing topics that are negative or unpleasant, listeners may become more likely to attribute negative intentions to that individual even on occasions when the linguistic content of the speech is neutral or even positive. Conversely, if a talker's voice quality elicits generally positive affective responses through associations with positive content, then listeners will be more likely to make positive attributions. These pragmatic considerations are important in everyday linguistic interactions, where a listener's understanding of the significance of a communicative interaction is highly dependent on perceptions of a talker's communicative intentions. Such pragmatic considerations may become even more important in long-term relationships that are marked by repeated, socially significant interactions between two parties.
Preparation of this manuscript was supported an NSF grant (BCS 0527440), an NIMH Independent Scientist Research Award (K02 MH001981), an NIH Director's Pioneer Award (DP1OD003312) to Lisa Feldman Barrett; NIMH Prime Award 1 R01 MH65317-01A2, and Subaward 8402-15235-X to Michael Owren; and by the Center for Behavioral Neuroscience, STC Program of the National Science Foundation under Agreement No. IBN-9876754. The authors wish to thank Ray Becker for his help with the audio recordings, Lauren Brennan and William Shirer for their help with data collection, and Karl Christoph Klauer and Jan De Houwer for comments on an earlier version of this manuscript.
|Acquisition Phase Words|
|Sequential Evaluative Priming Phase Words|
|Spoken Primes||Visually Presented Targets|
|Acquisition Phase Words|
|Sequential Evaluative Priming Phase Words|
|Spoken Primes||Visually Presented Targets|
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
1To assess whether the there was variation in the extent to which the individual voices acquired affective value, we considered talker identity as a within subjects factor within a series of subsequent ANOVAs and follow-up t-tests. There were no significant effects of talker identity on response time to judge target words.
2We added this fixation cross after a pilot during which response latencies were long because participants were looking away from the screen when the target appeared.