|Home | About | Journals | Submit | Contact Us | Français|
Individuals with schizophrenia show reliable deficits in the ability to recognize emotions from vocal expressions. Here, we examined emotion recognition ability in 23 schizophrenia patients relative to 17 healthy controls using a stimulus battery with well-characterized acoustic features. We further evaluated performance deficits relative to ancillary assessments of underlying pitch perception abilities. As predicted, patients showed reduced emotion recognition ability across a range of emotions, which correlated with impaired basic tone matching abilities. Emotion identification deficits were strongly related to pitch-based acoustic cues such as mean and variability of fundamental frequency. Whereas healthy subjects’ performance varied as a function of the relative presence or absence of these cues, with higher cue levels leading to enhanced performance, schizophrenia patients showed significantly less variation in performance as a function of cue level. In contrast to pitch-based cues, both groups showed equivalent variation in performance as a function of intensity-based cues. Finally, patients were less able than controls to differentiate between expressions with high and low emotion intensity, and this deficit was also correlated with impaired tone matching ability. Both emotion identification and intensity rating deficits were unrelated to valence of intended emotions. Deficits in both auditory emotion identification and more basic perceptual abilities correlated with impaired functional outcome. Overall, these findings support the concept that auditory emotion identification deficits in schizophrenia reflect, at least in part, a relative inability to process critical acoustic characteristics of prosodic stimuli and that such deficits contribute to poor global outcome.
Individuals with schizophrenia show well-documented disturbances in ability to identify emotions based upon tone of voice (prosody). These deficits—along with parallel deficits in visual emotion identification—have been associated with poor functional outcome.1,2 Yet, the nature and etiopathology of these social communication deficits remain obscure. Some argue that such failures may reflect specific emotional disabilities3,4 attributable to dysfunction of limbic circuitry5,6 and/or global right hemisphere dysfunction.7,8 Still others,9,10 beginning with Chapman and Chapman,11 have advocated a more generalized cognitive and perceptual deficit. It may also be that social communicatory deficits stem from an inability to recognize the physical/acoustical cues that characterize emotional distinctions. This would be akin to not being able to distinguish a smile from a frown and therefore not being able to discern happiness from anger. Within emotional prosody, such distinctions are represented by the configuration of a variety of acoustical cues including pitch, voice intensity, voice quality, and temporal cues.12 Prior studies indicate that prosodic deficits in schizophrenia correlate consistently with deficits in basic tonal processing ability. This suggests that patient dysprosodia could stem in part from elemental pitch processing deficits2,13–15 and that patients may simply not “pick up” on pitch changes which signal affective intent. However, to date, this hypothesis has not been directly tested. In order to examine whether specific cue processing impairments may drive emotional prosodic deficits in schizophrenia, the present study utilized a stimulus battery consisting of vocal expressions that have been extensively characterized in normative populations and for which acoustic features have been well described.16 This is the first study to utilize an acoustically characterized prosodic dataset for use in schizophrenia research.
Previous research has shown that there are relatively distinct patterns of acoustic cues that differentiate between specific emotions,17 as summarized in table 1. For example, anger, happiness, and fear are typically characterized by high mean pitch and mean voice intensity, whereas sadness is characterized by low mean pitch and mean voice intensity. Also, anger and happiness expressions typically have large pitch variability, whereas fear and sadness expressions instead have small pitch variability. Regarding voice quality, anger and happiness expressions typically have a large proportion of high-frequency energy (HF500) in the spectrum, whereas sadness has less high-frequency energy (as the proportion of high-frequency energy increases, the voice sounds sharper and less soft). Finally, happiness and fear expressions typically have fast speech rate, whereas sadness expressions are characterized by a slow speech rate.
Although individuals with schizophrenia show significant deficits in their ability to identify auditory emotion based upon prosodic cues, their ability to utilize specific auditory cues has not previously been analyzed. Based upon our prior demonstration of tone matching deficits in schizophrenia, we hypothesized that patients would show impairments at least in ability to utilize pitch-based cues (eg, mean pitch [F0M], pitch variability [F0SD]) and that these impairments would correlate with independent estimates of basic deficits in pitch processing. Given that some features are important to more than one emotion (eg, high pitch contributes to the percepts of anger, fear, and happiness, while low pitch contributes to the percept of sadness), failure to utilize specific cues would lead not only to an overall reduction in accuracy of identification but also potentially to a characteristic pattern of misidentification across emotions. To date, no study of auditory emotion processing in schizophrenia has analyzed such misidentification patterns. Based upon prior demonstrations of impaired pitch processing in schizophrenia, we hypothesized that patients with schizophrenia would show deficits in detection of emotional stimuli that relied primarily on pitch-based discriminations. We further hypothesized that such discrimination deficits may underlie differences in the misidentification patterns observed between healthy controls and schizophrenia patients.
Finally, while many studies have assessed the ability of patients to recognize emotions from vocal expressions, as well as identification accuracy for strongly posed vs weakly posed expressions,10 no studies to date have examined subjects’ ratings of the emotional intensity of such expressions. Emotional intensity assessment, like emotion identification itself, requires ability to detect and interpret alterations in the basic perceptual components of speech. Further, assessment of the intensity of another person's emotional feelings may be as critical for social communication as assessment of his/her specific emotional state. In the battery utilized for this study, actors were asked to portray the same emotion both weakly and strongly, and subjects’ ability to both recognize emotions and to judge the emotional intensity of the expressions was assessed. Our study therefore uniquely assessed the ability of patients with schizophrenia to differentiate among intended strong and weak emotion portrayals relative to controls. As with pitch, we hypothesized that pitch-based contributions to emotion intensity detection would be impaired.
Twenty-three chronically ill (illness duration=18.2 ± 9.2 years) patients (4 females, age=39.4 ± 10.9 years) meeting Diagnostic and Statistical Manual of Mental Disorders (DSM), Fourth Edition, criteria for either schizophrenia (N=20) or schizoaffective disorder (N=3) as determined by Structured Clinical Interview for DSM diagnosis and 16 healthy volunteers (3 female, age=34.3 ± 11.5 years) with no history of mental illness participated in the study. Patients were drawn from inpatient and outpatient units associated with the Nathan S. Kline Institute, and they were all receiving typical or atypical antipsychotics at the time of testing (mean dose=1265 ± 830 CPZ equivalents18,19). Volunteers were staff or responded to local advertisement. Groups did not differ on mean parental socioeconomic status (SES)20 (schizophrenic [scz]: 54.2 ± 28.4, control [ctl]: 49.4 ± 18.7, t=0.67, P=.6), although, as expected, patients had lower individual SES (scz: 25.8 ± 10.6, ctl: 47.3 ± 14.6, t=5.24, P < .001) and verbal IQ (scz: 95.9 ± 28.4, ctl: 109.5 ± 11.2, t=3.99, P < .001) as measured by the Ammons and Ammons Quick Test21 than did controls.
Clinical assessments included ratings on the Positive and Negative Syndrome Scale (PANSS)22 and the Independent Living Scale-Problem Solving Scale (ILS-PB),23 which predicts ability of individuals with schizophrenia to live independently.24 Mean PANSS scores were 11.1 ± 3.6, 12.5 ± 5.6, and 10.9 ± 2.9 for positive, negative, and cognitive factors, respectively. Mean scale ILS-PB score was 34.8 ± 12.1.
All procedures conducted received approval from the Institutional Review Board at Nathan S. Kline Institute. All subjects had the procedure explained to them verbally before giving their written informed consent. All subjects save 2 (one schizophrenia) were right handed.
Recognition of emotional prosody was assessed using stimuli from the prosody task of Juslin and Laukka.16 The stimuli consisted of audio recordings of 2 male and 2 female actors portraying 5 emotions—anger, disgust, fear, happiness, and sadness—with 2 levels of emotion intensity (weak and strong) as well as utterances with no emotional expressions (“no expression”). The sentences spoken were semantically neutral and consisted of both statements and questions (ie, “It is eleven o'clock”, “Is it eleven o'clock?”). These parameters yielded 88 stimuli [4 speakers × 2 intensity levels × 5 emotions (plus 8 no-expression stimuli) × 2 forms (questions or statements)]. All speakers were native British English speakers. For this stimulus set, measurement of acoustic characteristics was conducted in PRAAT25 speech analysis software as described previously16 (see table 1 for a description of the acoustic features included in the present study). All vocal stimuli were presented in random order in the recognition experiments, and the subjects were asked first to identify the emotional expression of each utterance by choosing 1 of 6 alternatives (anger, disgust, fear, happiness, sadness, or no expression) and then to rate the utterance's emotional intensity on a scale from 1 (very low intensity) to 10 (very high intensity).
Ancillary assessment of pitch processing was measured with a version of the Distorted Tunes Task (DTT)26 and a simple tone matching task. The DTT consists of 26 familiar tunes adapted for the United States ranging in length from 12 to 26 notes. Seventeen of the tunes are rendered melodically incorrect by changing the pitch of 2–9 notes within the tune. Subjects respond “yes” or “no” as to whether the melody is correct. Subject scores are calculated based on the percentage of correctly categorized melodies. The tone matching task consists of pairs of 100-ms tones in series, with 500-ms intertone interval. Within each pair, tones are either identical or differed in frequency by specified amounts in each block (2.5%, 5%, 10%, 20%, or 50%). In each block, half the tones are identical and half are dissimilar. Subjects respond by pressing 1 of 2 keys to answer whether the pitch was the same or different. Tones are derived from 3 reference base frequencies (500, 1000, and 2000 Hz) to avoid learning effects. In all, the test consisted of 5 sets of 26 pairs of tones.
All auditory tasks were presented on a CD player at a sound level that was comfortable for each listener in a sound-attenuated room.
Between-groups effects for both identification and intensity ratings were assessed using multivariate analysis of variance (MANOVA) using a within-subject factor of emotion (6 levels) and a between-subject factor of diagnostic group.
A second stage of analysis examined contributions of specific acoustic properties to between-group performance. For this analysis, cues were selected if they differentiated a particular emotion from the mean of all others. For example, as shown in table 2, mean pitch variability (F0SD) was significantly higher for intended happy and anger stimuli and significantly lower for intended fear stimuli than for the remaining emotions. Similarly, mean pitch of the stimuli (F0M) was significantly higher for happy stimuli and lower for disgust and neutral (no expression) stimuli than for other emotions. Cues chosen for this analysis are shown in table 2, and values that are differentiated for each cue are bolded within the table. Other cues that have been related to emotion perception, such as the temporal cues of speech rate and pause proportion (table 1) nevertheless did not differentiate significantly between intended emotions and so were not included in this analysis.
In order to examine the contributions of these cues on performance between groups, stimuli were then divided into low, medium, and high levels based upon parameter strength relative to mean scores for each emotion. In general, divisions were made according to the following rule: for each feature, the mean value was calculated for the emotion of interest, as well as for all emotions with emotion of interest excluded. These mean values were used as cut points with stimuli falling either outside of these mean values (high or low, respectively, depending upon emotion) or between, reflecting medium values. For features relevant to multiple emotions, additional cut points gave rise to division into additional levels designated med1and med2, although for each emotion parameters varied across only 3 levels. For example, for F0SD, 4 levels were used (low, med1, med2, high). However, because happy stimuli are associated with increased F0SD vs others, all happy stimuli were contained entirely within the 3 levels of med1, med2, and high. Similarly, because fear stimuli are associated with decreased F0SD, all fear stimuli were contained entirely within the levels of low, med1, and med2 (figure 2). Mean values for each feature and emotion are provided in table 2. Cutoff values are provided in the legends of figures 2 and 3.
For selected emotion/cue parings, between-group analyses were conducted using MANOVA with a within-subject factor for cue level and a between-subject factor of diagnostic group. Potential valence effects were assessed by between-group MANOVA as well, using positive (ie, happy) vs negative (ie, anger, disgust, fear, and sadness) valence as the between-group factor.
Misattribution patterns for emotion identification were examined using the ALLSCAL scaling algorithm (SPSS 15.0) to scale and calculate Euclidean distances between the dependent variables. R2 and stress parameters were used to assess model fit. Features contributing to the misattribution scaling values were assessed using Pearson correlations across emotions. The relationship between emotion identification and pitch perception (tone matching and DTT) was assessed using linear regression across subjects.
All statistical tests were 2-tailed, with α < .05, and computed using either JMP (SAS Institute Inc, Cary, NC, 2005) or SPSS 15.0 (Chicago, IL, 2006).
As in prior studies of acoustic emotion identification, patients were substantially less accurate than controls in identifying intended emotion (ctl: 44.8 ± 7.0; scz: 34.2 ± 10.0; F1,37 = 24.7, P < .0001, d=1.2). Performance in both groups was well above chance (16.7%). Both groups showed significant variation in identification accuracy across intended emotions (F5,33 = 13.4, P < .001) with greatest accuracy for sadness and neutral and least for disgust (figure 1), suggesting that some emotions may be easier to portray and/or identify than others. The group × emotion interaction was not significant (F5,33 = 0.6, P > .7).
In order to evaluate the degree to which deficits in utilization of specific cues might contribute to impaired ability to identify intended emotion, cue types were identified that differentiated most strongly between emotions, as shown in table 2. Four pitch-based measures (F0SD, F0M, F0contour, F0floor), 2 intensity-based measures (voiceintM, voiceintSD), and 1 voice quality measure (HF500) were identified as having mean cue levels that significantly differentiated at least one emotion from all others. For F0SD and F0M, log values were used to normalize values. Each measure contributed to one or more emotion identification types (table 2). For each measure, stimuli were grouped into levels depending upon mean values for specific emotions of interest.
Emotion identification performance as a function of pitch-based measures is shown in figure 2. Pitch variability (F0SD) differed significantly across intended emotions, with intended happy stimuli showing on average greater levels of F0SD than those intended to portray other emotions and intended fear stimuli showing, on average, lower levels (table 2). F0SD was thus analyzed relative to accuracy in decoding happiness and fear. Both patients (F2,21 = 9.4, P < .001) and controls (F2,14 = 82.2, P < .001) showed greater accuracy in correctly identifying intended emotion for stimuli characterized by high pitch variability (high F0SD) vs those who expressed lower degrees. However, patients showed less utilization of this cue than controls, leading to a significant group × level interaction (F2,36 = 4.7, P < .05), and highly significant (P < .001) between-group performance differences for correct identification of intended happy stimuli incorporating high, but not moderate or low, levels of F0SD (figure 2A).
Similarly, patients showed less variation in their fear identification performance as a function of F0SD than controls, leading also to a significant group × level interaction (F2,36 = 3.3, P < .05) (figure 2B). Further, in this case, significant variation in performance as a function of F0SD was seen only for controls (F2,14 = 4.7, P < .05), whereas no significant variation was seen within the patient group (F2,21 = 0.07, P > .9). As a result, between-group differences in accuracy of identifying fear were observed only at the lowest (most fearful) levels of F0SD.
F0SD also differentiated anger from other emotions as a group, with levels for intended anger stimuli being similar to those for happy stimuli and higher than those for other emotions. However, neither group showed significant variation in accuracy of response for anger based upon level of F0SD, suggesting that both groups used other anger-associated cues (eg, intensity cues) as primary cues for anger determination.
The mean pitch of each stimulus (F0M) differentiated stimuli intended to portray either disgust or happiness from those portraying other emotions, with disgust being associated with low levels of F0M and happiness with high levels (table 2). For controls, accuracy of performance varied significantly by F0M cue level in identifying disgust (F2,14 = 4.2, P < .05), whereas for patients performance did not vary significantly (F2,21 = 0.19, P > .8) (figure 2B). Despite association of high F0M levels with happiness, neither group showed significant variation in accuracy across levels of F0M.
The slope or trajectory of the pitch contour across the duration of the stimulus (F0contour) differentiated sadness and fear from other emotions, with intended sadness being associated with F0contour levels below 1 (ie, downward sloping pitch trajectory) and intended fear associated with values above 1 (ie, upward sloping trajectory) (table 2). For controls, accuracy of performance varied significantly as a function of F0contour for both sadness (F2,14 = 4.5, P < .05) and fear (F2,14 = 5.2, P < .05), whereas for patients it did not (sadness: F2,21 = 0.57, P > .5; fear: F2.21 = 0.44, P > .6) (figure 2C). The minimum pitch levels (F0floor) associated with each stimulus also differentiated intended disgust from all other emotions (table 2) and thus served as a potential cue. However, neither group showed significant variation in accuracy of performance as a function of this measure.
In contrast to pitch-based measures, which were particularly relevant to perception of happiness, sadness, and disgust, intensity-based measures significantly differentiated intended anger stimuli from all others. Two measures were particularly important: voiceintSD, which reflects the degree of variability of intensity over the course of a stimulus, and voiceintM, which reflects overall stimulus energy (table 2). Both patients (F2,21 = 17.78, P < .001) and controls (F2.14 = 21.6, P < .05) showed increased accuracy in identifying intended emotion as voiceintSD increased, with no significant difference between groups in overall accuracy (F2,36 = 2.32, P=.14) and no group × feature level interaction (F2,36 = 0.22, P=.81) and (figure 3A). Similarly, both patients (F2,21 = 20.6, P=.0001) and controls (F2,14 = 12.1, P=.001) showed increased accuracy in identifying intended emotion as voiceintM increased (figure 3B). In this case, patients showed greater variation in performance across levels of voiceintM than did controls, leading to a significant group × feature interaction (F2,36 = 4.11, P=.025). Further, patients showed decreased accuracy compared with controls only at the lowest (least angry) levels of voiceintM, suggesting that patients utilized low voiceintM levels to exclude anger as an intended emotion to a greater extent than did controls. Anger exemplars with low voiceintM levels (all of which were intended as weak portrayals of anger) nevertheless had F0SD values (1.63 ± 0.19) significantly above the expected range for neutral stimuli (P < .03), suggesting that controls may have been able to utilize F0SD as a secondary cue for recognition of intended anger emotion for these stimuli whereas patients were not.
Of potential voice quality cues, only HF500—reflecting vocal timbre—distinguished between intended emotions, with angry stimuli showing greater mean levels of HF500 than others (table 2). As with intensity cues, patients and controls showed similar variation in performance across level of HF500. Thus, both patients and controls showed a similar tendency to identify intended angry stimuli correctly as levels of HF500 increased, leading to significant group × level interactions for both patients (F2.21 = 29.01, P < .001) and controls (F2,14 = 7.52, P < .05) and no group × level interaction (F2,36 = 0.69, P > .5) (figure 3C).
In order to further identify the basis for impaired emotion identification in schizophrenia, the pattern of misidentifications were also analyzed. In this analysis, both groups showed a nonhomogeneous pattern of misattributions, with some distinctions (eg, sad/fear) yielding higher rates of misidentification of intended emotion than others (eg, sad/disgust). Overall, patients showed a similar pattern of errors to controls (table 4). Nevertheless, patients were more likely than controls to misidentify happy stimuli as being neutral (P=.021), fearful stimuli as expressing either anger (P=.007) or disgust (P=.03), disgust stimuli as expressing sadness (P=.006), and no expression stimuli as expressing fear (P=.001). Some, but not all, of these distinctions were also difficult for controls, suggesting that the basis for emotional distinctions may differ between patients and controls.
Multidimensional scaling (MDS) was used to map apparent distances between emotions based upon the overall misattribution confusion matrix. MDS analysis has previously been used to map subjective affective representations in schizophrenia27 but has not been previously applied to analysis of physical determinants of prosodic identification.
Two-dimensional models accounted well for observed attribution pattern with good parametric properties for both schizophrenia patients (stress=0.11, R2=0.93) and controls (stress=0.09, R2=0.95) (figure 4A). For both controls and patients, scaling values along the first dimension of the MDS correlated most strongly with the voice quality cue F1BW, with sadness and anger occupying opposite ends of the scale for controls and sadness and no emotion occupying opposite ends of the scale for patients (figure 4B). The strength of correlation, however, was higher for controls (r=0.96, P=.003) than for patients (r=0.72, P=.11). For controls, spacing along the second dimension correlated more strongly with the pitch cues F0mean (r=0.91, P=.03) and F0SD (r=0.79, P=.06) than with the voice intensity cue voiceintSD (r=0.68, P=.17) (figure 4C). In contrast, patients showed significant correlation with spacing along the second dimension and voiceintSD (r=0.88, P=.02) but no significant correlation with either F0Mean (r=0.37, P=.48) or F0SD (r=0.48, P=.34), suggesting different usage of pitch vs intensity cues in differentiating among emotional percepts. For controls, happiness and no emotion occupied opposite ends of the second MDS dimension, consistent with primary use of pitch variables. In contrast, for patients, anger and no emotion occupied opposite ends of this dimension, consistent with primary differentiation of emotions by intensity.
A final approach for assessment of the relationship between emotion processing deficits and basic perceptual abilities utilized linear regression analyses between the 2 sets of measures. Basic perceptual abilities were assessed using both tone matching, across a range of levels, and DTT score (table 3). A linear regression of emotion identification accuracy vs perceptual performance showed a significant overall relationship (adjusted R2=0.51, P=.03) with significant correlations between performance and tone matching accuracy at 2.5% (P=.02), 5% (P=.02), and 50% (P=.005) and DTT score (P=.012). Including generalized cognitive measures, such as digit symbol, in the regression did not significantly improve overall fit (adjusted R2=0.44, P=.12) vs fits obtained with perceptual measures alone. No similar relationship was observed for controls (adjusted R2=0.26, P=.45).
Finally, performance on emotion identification across emotions correlated significantly with the cognitive subscale of PANSS (r=−0.49, P=.021) and with ILS-PB score (r=0.59, P < .01) but not with either positive (r=−0.3, P=.18) or negative (r=−0.05, P=.8) factor scores or medication dosage (CPZ) (r=0.04, P=.83).
In addition to portraying specific emotions, actors for this stimulus set had been asked to provide both weak and strong exemplars of each emotion. Sensitivity of subjects to the intended intensity of emotions, irrespective of identity, was assessed using MANOVA, with a within-subject factor of intended strength of portrayal (weak/strong) as well as emotion and cohort. As expected, there was a significant main effect of intended strength of portrayal on rated intensity (F1,37 = 82.2, P < .0001), suggesting that across groups, subjects were able to successfully differentiate weak and strong portrayals (figure 5). Nevertheless, patients differentiated weak and strong portrayals to a significantly lesser degree than controls, as reflected in a highly significant group × intensity interaction (F1,37 = 7.6, P < .009). Patients also showed a significant tendency to rate all portrayals as being more intense than controls (F1,37 = 6.2, P < .02), with the differences being more profound for intended weak (F1,37 = 8.58, P=.006) than for intended strong (F1,37 = 2.88, P=.1) portrayals. As expected, there was no significant main effect of emotion (F4,33 = 1.8, P=.15) or group × emotion interaction (F5,33 = 1.1, P>.37). As with emotion identification, alterations in emotion intensity rating strength (r=0.51, P = .007) and difference in perceived intensity for intended weak vs intended strong portrayals (r=0.59, P = .003) correlated significantly with deficits in tone matching performance, such that individuals with the worst tone matching performance tended to rate intensity most strongly regardless of intended intensity and observed least difference in intensity between intended emotions.
During normal conversation, emotion is conveyed largely by modulation of tone and rhythm of voice. Decoding of such information in real time, therefore, is critical for normal social interaction. Rules for decoding of emotion are either learned implicitly28 or may reflect an innate ability17 because individuals are never taught exact rules for emotion identification.28 As shown in this and previous studies,16,29 identification of intended emotions by voice alone is an inexact science, with most individuals correctly identifying intended emotion only 50%–60% of the time. Given the complexity of the task and its dependence upon basic perceptual abilities, it is axiomatic that deficits in basic sensory abilities, such as the ability to detect tonal patterns over time, in schizophrenia would produce deficits in acoustic emotion identification ability. Nevertheless, the sensitivity of patients with schizophrenia to the specific features of speech that are used to convey emotional intent has not previously been evaluated.
In the present study, performance of patients was compared with that of controls as a function of a range of cues, including those involving pitch (F0M, F0SD, F0contour), intensity (voiceintSD, voiceintM), and voice quality (HF500). Analyses took advantage of the fact that some stimuli within the battery were relatively good exemplars of the intended emotion, as reflected in higher levels of correct identification, whereas others were less easily identified. Thus, for each emotion, it was possible to analyze responses across a range of acoustic values. As expected,16 pitch cues were critical for detecting happiness, sadness, fear, and disgust in controls, with different pitch measures contributing differentially. In contrast, intensity and voice quality cues were critical for detection of anger. These results are consistent with prior findings12,16showing similar cue emotion interactions.
This study demonstrates for the first time that, when compared with controls, patients’ performance did not vary to the same degree as a function of the relative presence or absence of pitch cues but that such variability was roughly equivalent for intensity and spectral cues. This finding suggests that patients were less able than their healthy counterparts to utilize pitch-based cues to identify emotion, whereas their ability to utilize intensity cues was relatively intact. For controls, stimuli with highest levels of pitch variability were 4-fold more identifiable than those with more moderate levels. For patients, the difference was only 2-fold, leading to a large effect size (d=1.34) deficit in identifying happiness only for those stimuli with highest (ie, most happy) F0SD levels.
Similarly, low levels of F0SD serve as a primary cue for fear. For controls, stimuli with low F0SD levels are 5-fold more identifiable than those with more moderate levels. Here, too, patients did not show this variation in the accuracy of their performance in identifying intended fear as a function of F0SD, suggesting an inability to take advantage of low, as well as high, levels of this cue when relevant. Patients also did not show variability in their responses to sadness or fear based upon pitch slope (F0contour) or disgust based upon mean pitch (F0M), suggesting an inability to utilize these pitch-based cues as well. In the MDS analysis, distances between emotions were significantly influenced by pitch-based measures for controls but not for patients (figure 4C), suggesting that the differential pattern of misidentification seen in patients vs controls also relates to relative inability to utilize pitch cues in discriminating emotions.
In contrast to their inability to utilize pitch cues, patients did appear to utilize intensity cues, such as voiceintSD and voiceintM,, to detect anger equivalently to controls. In this case, patients showed increased, rather than decreased, variation in accuracy of detection of anger responses as compared with controls, leading patients to incorrectly reject anger as intended emotion for those exemplars with lowest levels of voiceintM. Further, for patients, MDS distances correlated significantly with the intensity measures voiceintSD and ATTACK, whereas such significant correlations were absent in controls, also suggesting that patients overutilize voice intensity and other secondary cues in discriminating emotion, compensatory to a fundamental deficit in ability to utilize pitch-based cues.
Finally, patients showed some ability to modulate response based upon voice quality cues such as HF500 and F1BW, although the exact pattern of use differed somewhat between groups. In the MDS analysis, both controls and schizophrenia patients appeared to use F1BW as a principal cue in differentiating emotions. F1BW, although not strongly predictive of any single emotion, nevertheless is thought to convey mood/attitude information that is superordinate to emotion, such as the degree to which an individual is relaxed vs stressed.30 For controls, the relaxed/stressed distinction is the voice feature that is most readily perceived.30 Although patients were able to utilize F1BW to differentiate emotions associated with high power (eg, anger, disgust) relative to those associated with low power (eg, sadness), nevertheless, they showed reduced spacing of emotions along this dimension than did controls.
As in prior studies of emotional identification, deficits in performance correlated significantly both with more basic deficits in pitch perception, such as tone matching and DTT score, as well as with deficits in global outcome, as reflected by the ILS-PB. Correlations with tone matching and DTT underscore the importance of treatment strategies aimed at reversing social communicatory disturbance, as well as the importance of correcting underlying deficits in pitch processing. In contrast, correlations with ILS-PB, as noted previously,2,31 underscore the relationship between poor acoustic emotion identification ability and poor functional outcome. Personality researchers have suggested that individuals with higher degrees of empathic ability and greater degrees of “social connectedness” perceive vocal emotional cues better than individuals who have lower degrees of these traits.32 The present findings suggest that the inverse may also be true and that one's basic perceptual abilities may determine in large part one's social experience.
At present, relatively little is known about the development of pitch perceptual abilities over the course of schizophrenia. Pitch detection deficits have been demonstrated consistently in chronic patients in schizophrenia, along with impaired generation of early auditory event–related potentials such as mismatch negativity (MMN), which reflects preattentive detection of stimulus deviance. Pitch processing was also studied in one sample of 15 first-episode (FE) patients, of whom 9 were felt to be stabilized, while 6 were persistently symptomatic. As a group, FE patients showed a moderate effect-size (d=0.58) deficit in tone matching. Further, approximately one-third of FE patients showed tone-matching thresholds outside the control range (>10% difference in pitch), leading to a significant difference vs age-matched controls (P < .02).33
Deficits in MMN generation have also been reported early in the course of schizophrenia, with onset of deficit within 1.5 years of first hospitalization.34–36 Deficits in MMN generation have not been observed in patients during first hospitalization overall,35–37 although the number of subjects remains small. Further, even among patients studied within the first hospitalization, a significant correlation has been observed between MMN amplitude and the education item of the Premorbid Adjustment Scale, such that those patients who failed to complete high school showed significantly reduced MMN at first onset relative to both age-matched controls and FE patients with college education.35
Prosodic detection has, at present, been evaluated in only one study of FE patients of which we are aware. In that study, patients were evaluated within 6 months of discharge following initial hospitalization. Notably, mean years of education for this group was 10.8, with less than 30% of patients having any degree of college education. FE psychotic patients as a group showed a moderate (d = 0.69) decrement in affective prosodic detection relative to age-matched controls.
As with the present study, there was no significant difference across emotions. Patients nevertheless showed apparently greater deficits on detection of fear and sadness vs angry stimuli, similar to the pattern observed in the present study. Thus, while comparison across studies is difficult due to different cohort composition and different definitions of the term “FE,” both tone matching and prosodic detection deficits appear to be present early in the course of the illness and may be enriched with those subjects showing relatively low educational attainment. Effect sizes for both early auditory dysfunction and affective prosodic deficits appear to progress as a function of illness chronicity, although whether this reflects true degeneration within individual subjects, as opposed to distillation of poor-outcome subjects within chronic patient cohorts, remains to be determined. Overall, more study of both basic auditory processing ability and affective prosodic ability over the course of schizophrenia is required.
In the present study, patients also showed significantly reduced sensitivity to differential strength of emotional portrayals, in particular tending to overestimate intensity of intended weak emotions. As with categorization deficits, the failure to discriminate emotional intensity reflected an inability to utilize tonal, rather than intensity, information and correlated with reduced perceptual sensitivity. This is the first study to evaluate the ability of patients to discriminate auditory emotional intensity, along with emotion identity. As with deficits in emotion identification, the tendency to overestimate strength of weak emotional portrayals may also lead to significant misinterpretations in social communicatory situations.
The present study represents an initial attempt to go beyond simple correlational analyses of emotion processing accuracy vs pitch measures (eg, tone matching) and to develop instead a taxonomy of affective dysprosodia in schizophrenia. Development of such a taxonomy is crucial not only for achieving greater understanding of the neurophysiological bases of acoustic prosodic dysfunction in schizophrenia but also for the development of appropriate remediation or compensation techniques. This ecological approach is akin to attempts in both autism38,39 and schizophrenia40 to link abnormal gaze of facial features to facial affect deficits. Nevertheless, because it is the first study of this type, certain limitations must be acknowledged. First and foremost, because multivariate featural batteries of this type have not previously been used for between-group analyses, the statistical approach for between-group analyses was developed post hoc. Because of this, results must be considered exploratory and confirmed with additional stimulus batteries, additional patient samples, or both. Nevertheless, we feel that the observation that controls show strong variation in ability to identify intended emotion based upon pitch-based measures whereas patients do not strongly validates the utilized analysis approach.
Second, the study took advantage of an existing, naturalistic stimulus set rather than using synthetically constructed utterances with predetermined characteristics. As a result, numerous stimulus features were strongly intercorrelated, limiting the extent to which any single feature could be isolated. This battery has the advantage of having physical features that have previously been published. Nevertheless, future synthetic stimulus development is needed in which specific parameters are modulated independently across a continuum of levels. This stimulus set further consisted of posed portrayals of emotional speech rather than evoked prosody from natural discourse. Though it can reasonably be argued that posed expressions must be relatively similar to naturally occurring expressions in order for communication to be successful, it may nevertheless be the case that posed expressions are exaggerated and more intense than authentic expressions and that the acoustic properties between posed and authentic stimuli may differ.12 Similarly, the ability to generalize performance from tasks involving posed portrayals to normal discourse may have its limitations. Nevertheless, because speakers may have complex emotions during normal discourse, obtaining objective validation of intended emotion may be impossible under naturalistic circumstances. All patients were also receiving antipsychotic medication at the time of testing, raising the possibility of a medication effect. However, no correlation was observed between performance and medication dosage.
In conclusion, disturbed emotion identification ability represents a key determinant of social cognition and functional outcome in schizophrenia. This is the first study to evaluate contribution of specific underlying acoustic cues to emotion identification dysfunction, as well as to assess the perception of emotion intensity. Primary findings are that patients show intact ability to utilize intensity-based cues but reduced ability to identify emotions based upon critical pitch modulations. Such deficits contribute to disturbances not only in identification of intended emotions but also in discrimination between intended weak and strong emotional portrayals. Further, as in prior studies, deficits correlated highly with more basic deficits in auditory processing, as well as with global outcome measures. These findings indicate the need for cue-, as well as emotion-, based assessment of prosodic dysfunction in schizophrenia and for ecological approaches to conceptualization and remediation of social communicatory impairments in schizophrenia. Furthermore, a cue-based approach may provide a method to discriminate schizophrenia dysprosodia from dysprosodias found in other illnesses such as Parkinsonism and autism.
National Institute of Mental Health (NRSA F1-MH067339 to D.I.L., K02 MH01439 to D.C.J., R01 MH49334 to D.C.J.).