This study aimed to investigate an early differentiation of vocal emotions in semantically neutral expressions. By utilizing behavioral tasks and ERPs to investigate neutral, angry, and happy emotion recognition, we demonstrated that performance of normal hearing subjects were significantly better for unsimulated than for CI-simulated prosody recognition. Similarly the performance with PACE was better compared to ACE.
For post-offset RTs, participants were faster to identify happy and angry prosodies compared with the neutral emotion. These findings are in parallel with findings in literature on prosody processing that have constantly shown the faster recognition of emotional stimuli compared with neutral stimuli
]. The aforementioned studies have attributed this rapid detection of vocal emotions to the salience and survival value of emotions over neutral prosody. Moreover, an emotional judgment of prosody might be performed faster, as non-ambiguous emotional associations are readily available. In contrast, neutral stimuli may elicit positive or negative associations which otherwise may not exist. Thus, the reaction times may simply reflect a longer decision time for neutral compared with emotional sentences.
For the accuracy rate analysis, near perfect scores (97% correct) were obtained when participants heard original unsimulated sentences. These findings are higher than the results (90 to 95%) reported in previous studies
]. This substantiates that the speaker used in the current study accurately conveyed the three target emotions. Thus, the stimuli bank used in the present experiment appears to be appropriate for conveying the requisite prosodic features needed to investigate different CI strategies on the grounds of emotion recognition.
The ERP data for emotional prosody perception recorded in all the participants demonstrated differential electrophysiological responses in the sensory-perceptual component of emotion relative to neutral prosody. The auditory N100 component is a marker of physical characteristics of stimuli such as temporal pitch extraction
]. Evidence exists in the literature advocating the N100 as the first stage of emotional prosody processing
]. In the current study, N100 amplitude was more negative for ACE strategy use suggesting early stages of prosody recognition might be adversely affected by stimulus characteristics. However, N100 is modulated by innumerable factors including attention, motivation, arousal, fatigue, complexity of the stimuli, and methods of recording etc.
]. Thus, it is not possible to delineate the reasons for presence of the N100 as one cannot rule out the contribution of above mentioned factors to the observed results. The next stage of auditory ERP processing is the P200 component.
The functional significance of the auditory P200 component has been suggested to index stimulus classification
] but the peak P200 is also sensitive to different acoustic features such as pitch
] and duration. For instance, in studies of timbre processing, P200 peak amplitudes were found to increase with the number of frequencies present in instrumental tones
]. The emotional prosody processing occurring around 200 ms reflects the integration of acoustic cues. These cues help participants to deduce emotional significance from the auditory stimuli
]. A series of experiments
] have enunciated that the P200 component is modulated by spectral characteristics and affective lexical information.
In the present study, it was evident that the P200 peak amplitude was largest for the happy prosody compared with the other two. These results are in line with previous reports
] where ERPs were recorded as participants judged the prosodies. It was seen that the P200 peak amplitude was more positive for the happy prosody, suggesting enhanced processing of positive valence. In an imaging study, researchers found that activation in the right anterior and posterior middle temporal gyrus, and in the inferior frontal gyrus, was larger for happy intonations compared with angry intonations
]. This enhanced activation was interpreted as highlighting the role of happy intonation as socially salient cues involved in the perception and generation of emotional responses when individuals attend to the voices. In a study measuring ERPs, Spreckelmeyer and colleagues reported a larger P200 component amplitude for happy voice compared with sad voice tones
]. They attributed these results to the spectral complexity of happy tones, including F0 variation, as well as sharp attack time. In our study the acoustical analysis of the stimuli also revealed higher mean F0 values, and wider ranges of F0 variation for the happy prosody compared with the angry and neutral prosodies. These F0-related parameters of the acoustic signal may thus serve as early cues for emotional significance and accordingly may facilitate task-specific early sensory processing. These results are well in line with earlier work
] confirming pitch cues as the most important acoustical dimension in emotion recognition. The fact that the happy prosody recognition elicited larger P200 peak amplitude, even on simulation, signifies the robustness of F0 parameters that are well preserved, even after the degradation of speech. There is evidence from an ERP study to suggest that negative stimuli are less expected and take more effort to process compared with positive stimuli
]. Thus, the larger F0 variation, as well as lower intensity variation, early in the spectrum of the happy prosody and the social salience could have resulted in improved happy prosody recognition.
Auxiliary to the aim of affective prosody recognition in unsimulated vs. simulated sentences, the study intended to throw light on differences between two types of CI strategies. Irrespective of the type of strategy simulated, all subjects performed above chance level on simulations. It was seen that the performance of subjects for simulations was poorer than unsimulated sentences for all emotions. This could be attributed to a very limited dynamic range that was maintained while creating the simulations to mimic the real implants as much as possible. Secondly, the algorithms used to create simulations degrade the spectral and temporal characteristics of the original signal. As a result, access to several F0 cues essential for emotion differentiation, is not available to the same extent as in the unsimulated situation
]. Although the vocoders used to create simulations adulterate the stimuli, they are still the most analogous to imperfect real-life conditions such as perception through cochlear implants
The final aspiration of this study was to compare the speech-coding strategies and find out which one is better for prosody recognition. From the results of the comparison of prosody perception with two simulation strategies, i.e. PACE and the ACE, the results indicated noticeable advantages of PACE over the currently popular ACE strategy, and the difference was most evident for the happy emotion. The larger P200 component effect for happy prosody was observed for PACE compared with ACE simulations. This larger amplitude seen for PACE may be attributed to its coding principle that result in a greater dispersion and less clustering of the channels stimulated. Past experiments reported that speech perception is better for subjects using PACE compared with the ACE strategy. Similarly,
] predicted that PACE might have an advantage over the ACE in music perception. Although both ACE and PACE are N of M strategies, coding in the PACE strategy is a result of a psychoacoustic masking model. The bands selected by this model are based on the physiology of normal hearing cochlea. This model extracts the most meaningful components of audio signals and discards signal components that are masked by other noisy components and are, therefore, inaudible to normal hearing listeners. Due to this phenomenon, the stimulation patterns inside the cochlea are more natural with the PACE
], meaning that the presented stimuli sounds more natural and less stochastic. As the ACE strategy lacks such a model, a stimulation pattern similar to normal hearing cochlea can never be created, resulting in unnatural perception due to undesirable masking effects in the inner ear. This explains the poor performance on both the behavior and ERPs when ACE simulations were heard. Additionally other reason for this further improvement could be that, unlike for ACE, the bands selected by the masking model are widely distributed across the frequency range in PACE. This decreases the amount of electric field interaction, leading to an improvement in speech intelligibility by preserving important pitch cues. Thus, in PACE only the most perceptually salient components, rather than the largest components of the stimulus, are delivered to the implant, preserving the finer acoustic features that otherwise would have been masked leading to improved spectral and temporal resolution, thereby enhancing verbal identification and differentiation compared with ACE.