When we communicate vocally, it is often not just what we say – but how we say it – that matters. For example, in expressing joy our voices become increasingly melodic, while our voicing of sadness is more often flat and monotonic. Such prosodic aspects of speech precede formal language acquisition, reflecting the evolutionary importance of communicating emotion (Fernald, 1989
Vocal communication of emotion results from gestural changes of the vocal apparatus that, in turn, cause collinear alterations in multiple features of the speech signal such as pitch, intensity, and voice quality. There are relatively distinct patterns of such acoustic cues that differentiate between specific emotions (Banse and Scherer, 1996
; Cowie et al., 2001
; Juslin and Laukka, 2003
). For example, anger, happiness, and fear are typically characterized by high mean pitch and voice intensity, whereas sadness expressions are associated with low mean pitch and intensity. Also, anger and happiness expressions typically have large pitch variability, whereas fear and sadness expressions have small pitch variability. Regarding voice quality, anger expressions typically have a large proportion of high-frequency energy in the spectrum, whereas sadness has less high-frequency energy (as the proportion of high-frequency energy increases, the voice sounds sharper and less soft). We present the first study to experimentally examine neural correlates of these acoustic cue-dependent perceptual changes.
We employed a parametric design, using emotional vocal stimuli with varying degrees of acoustic cue saliency to create graded levels of stimulus-driven prosodic ambiguity. A vocal stimulus with high cue salience has high levels of acoustic cues that are typically associated with the vocal expression of a particular emotion and presents an acoustic signal rich in affective information, whereas a vocal stimulus with low cue salience has low levels of the relevant acoustic cues and is more ambiguous. We generated a four-choice vocal emotion identification task (anger, fear, happiness and no expression) to examine how acoustic-cue level impacts affective prosodic comprehension. As our independent variable, we used the acoustic cue which best correlated with performance on the emotion identification task – this cue served as a proxy for “cue saliency”. For happiness and fear, we utilized pitch variability – the standard deviation of the fundamental frequency (F0SD
) as a cue salience proxy, and for anger we used proportion of high-frequency spectral energy [i.e. elevated ratios of energy above vs. below 500
)]. These cues are important predictors of recognition of the respective emotions (Banse and Scherer, 1996
; Juslin and Laukka, 2001
; Leitman et al., 2008
) and pitch variability and spectral energy ratios are important for emotion categorization (Ladd et al., 1985
; Juslin and Laukka, 2001
; Leitman et al., 2008
For each emotion, our vocal stimuli set contained stimuli exhibiting a wide range of the emotion-relevant cue. We then examined behavioral performance and brain activation parametrically across each emotion as a function of this cue level change across items. We hypothesized that variation in cue salience level would be reflected in activation levels within a reciprocal temporo-frontal neural circuit as proposed by Schirmer and Kotz (2006
) and others (Ethofer et al., 2006
as a proxy for cue salience in fear and happiness allowed further differentiation: Saliency-related performance increases are expected to positively correlate with pitch variability (F0SD
) for happy stimuli, and negatively correlate with F0SD
for fear stimuli. Therefore, a similar activation pattern for increasing cue saliency for both happiness and fear would suggest that the activation observed relates to emotional salience as predicted, rather than to pitch variation alone.
The proposed temporo-frontal network that we expect to be affected by changes in cue saliency is grounded in neuroscience research. Initial lesion studies (Ross et al., 1988
; Van Lancker and Sidtis, 1993
; Borod et al., 1998
) linked affective prosodic processing broadly to right hemispheric function (Hornak et al., 1996
; Ross and Monnot, 2008
). More recent neuroimaging studies (Morris et al., 1999
; Adolphs et al., 2001
; Wildgruber et al., 2005
; Ethofer et al., 2006
; Wiethoff et al., 2008
) related prosodic processing to a distributed network including: posterior aspects of superior and middle temporal gyrus (pSTG, pMTG), inferior frontal (IFG) and orbitofrontal (OFC) gyri, and sub-cortical regions such as basal ganglia and amygdala. In current models (Ethofer et al., 2006
; Schirmer and Kotz, 2006
), affective prosodic comprehension has been parsed into multiple stages: (1) elementary sensory processing (2) temporo-spectral processing to extract salient acoustic features (3) integration of these features into the emotional acoustic object, and (4) evaluation of the object for meaning and goal relevance. Together these processing stages comprise a circuit with reciprocal connections between nodes.
Prior neuroimaging studies compared prosodic vs. nonprosodic tasks [i.e. (Mitchell et al., 2003
)], or prosodic identification of emotional vs. neutral stimuli [i.e. (Wiethoff et al., 2008
)], and thereby identified a set of brain regions likely involved in affective prosody. Based on knowledge of functional roles of temporal cortex and IFG (‘reverse inference’; Poldrack, 2006
; Van Horn and Poldrack, 2009
), it was assumed that temporal cortex mediates sensory-integrative functions while IFG plays an evaluative role (Ethofer et al., 2006
; Schirmer and Kotz, 2006
). However, these binary ‘cognitive subtraction’ designs did not permit a direct demonstration of the distinct roles of temporal cortex versus IFG.
Our parametric design, using stimuli varying in cue salience to create varying levels of stimulus-driven prosodic ambiguity, has two major advantages over prior study designs: First, analysis across varying levels of an experimental manipulation allow more robust and interpretable results linking activation to the manipulated variable than designs that utilize a binary comparison. Second, the parametric manipulation of cue saliency should produce a dissociation in the relationship of sensory vs. evaluative regions to the manipulated cue level. This allows direct evaluation of the hypothesis that IFG plays an evaluative role distinct from the sensory-integrative role of temporal cortex.
We hypothesized that during a simple emotion identification task, the presence of high levels of affectively salient cues within the acoustic signal should facilitate the extraction and integration of these cues into a percept that would be reflected in temporal cortex activation increases. We also hypothesized that increased cue saliency would correlate with amygdala activation. Amygdala activation is correlated with perceived intensity in non-verbal vocalizations (Fecteau et al., 2007
; Bach et al., 2008b
). Such activity may reflect automatic affective tagging of the stimulus intensity level (Bach et al., 2008a
). Conversely, we predicted that decreasing cue saliency would be associated with increasing IFG activation, reflecting increased evaluation of the stimuli for meaning (Adams and Janata, 2002
) and difficulty in selecting the proper emotion (Thompson-Schill et al., 1997
). We thus expected that increased activation in this evaluation and response selection region (IFG) would be directly associated with decreased activity in feature extraction and integration regions (pSTG and pMTG). Thus, our parametric design aimed to characterize a reciprocal temporo-frontal network underlying prosodic comprehension and examine how activity within this network changes as a function of cue salience.