|Home | About | Journals | Submit | Contact Us | Français|
Humans are remarkably adept at identifying individuals by the sound of their voice, a behavior supported by the nervous system’s ability to integrate information from voice and speech perception. Talker-identification abilities are significantly impaired when listeners are unfamiliar with the language being spoken. Recent behavioral studies describing the language-familiarity effect implicate functionally integrated neural systems for speech and voice perception, yet specific neuroscientific evidence demonstrating the basis for such integration has not yet been shown. Listeners in the present study learned to identify voices speaking a familiar (native) or unfamiliar (foreign) language. The talker-identification performance of neural circuitry in each cerebral hemisphere was assessed using dichotic listening. To determine the relative contribution of circuitry in each hemisphere to ecological (binaural) talker identification abilities, we compared the predictive capacity of dichotic performance on binaural performance across languages. We found listeners’ right-ear (left hemisphere) performance to be a better predictor of overall accuracy in their native language than a foreign one. The enhanced predictive capacity of the classically language-dominant left-hemisphere on overall talker-identification accuracy demonstrates functionally integrated neural systems for speech and voice perception during natural talker identification.
Human beings’ complex social environment has evolved hand-in-hand with two other distinct capacities: our ability to identify other people as individuals, and our ability to convey complex ideas to them via spoken language. Our person perception abilities allow us to distinguish and uniquely identify numerous other individual human beings (e.g. Bahrick, Bahrick, & Wittlinger, 1975) – an ability that is realized in the auditory modality as voice recognition or talker identification. Despite the social importance of talker identification, the biological bases of this ability remain poorly understood. A small but growing neuroscience literature has begun to identify the brain bases of voice perception abilities; however, neural studies of voice perception have generally not attempted to address a functional integration between speech- and voice-perception systems (cf. Knösche, Latner, Maess, Schauer, & Frederici, 2002; Kaganovich, Francis, & Melara, 2006). Our social auditory experiences are monopolized by listening to speech, predominately that which we can understand. In this regard, recent behavioral research has demonstrated bidirectional interactions between the cognitive abilities of speech and voice perception: Differences in linguistic proficiency affect talker identification abilities both across and within languages (Goggin, Thompson, Strube, & Simental, 1991; Perrachione & Wong, 2007; Perrachione, Chiao, & Wong, 2008), and differential exposure to voices impacts speech perception abilities across talkers (e.g. Nygaard & Pisoni, 1998). Such integration across domains is the hallmark of complex neural information processing mechanisms. However, little research has examined the mechanisms by which the auditory system integrates information across cognitive domains. Talker identification abilities provide a unique opportunity to understand these integration mechanisms, as neuroscientific evidence suggests speech and voice perception abilities are predominately realized in the left and right cerebral hemispheres, respectively (e.g. Hickok & Poeppel, 2000; von Kriegstein, Eger, Kleinschmidt, & Giraud, 2003). In the following study, we use dichotic listening to investigate how biological integration between the neural systems responsible for speech and voice perception may underlie the documented cognitive integration of these two abilities during talker identification. In particular we hypothesize that, compared to a foreign language, native-language talker identification tasks will draw increasingly on linguistic processes supported by neural circuits in the left hemisphere
The biological systems responsible for spoken language have been extensively researched. Since Carl Wernicke’s seminal post-mortem examinations of aphasic patients in the 1800s, we have known neural circuits in the posterior superior temporal lobe of the left hemisphere to be principally implicated in speech perception. After a century of neurolinguistic research, including the advent of functional neuroimaging, circuits in the left hemisphere remain consistently implicated as the neural basis for spoken language perception (e.g. Galuske, Schlote, Bratzke, & Singer, 2000). Contemporary work on primate evolution has furthermore pointed at the potential phylogenetic significance and underpinning of functional lateralization of the human language faculty (e.g. MacNeilage, 1990; 1998). However, little work has investigated how the neural circuits for speech perception might themselves be functionally integrated with regions or networks subserving other auditory and cognitive abilities. Recent research has begun to elucidate how, for example, music and speech processing may rely on overlapping neural circuits (Price, Thierry, & Griffiths, 2005; Wong, Skoe, Russo, Dees, & Kraus, 2007), further raising the possibility that other auditory abilities, such as voice perception, may have an intrinsic association with speech processing in the brain.
The neural systems responsible for talker identification are comparatively less well understood, with widespread investigation of this topic beginning after the discovery of voice-selective regions bilaterally in superior human temporal lobe (Belin, Zatorre, Lafaille, Ahad, & Pike, 2000). Subsequent neuroimaging work has elaborated a right-hemisphere network for voice perception, predominately implicating areas in the right superior temporal sulcus (Belin, Zatorre, & Ahad, 2002; Warren, Scott, Price, & Griffiths, 2006; Koeda et al., 2006; Lattner, Meyer, & Friederici, 2005). This region has been shown to preferentially respond to human vocalizations versus other animals (Fecteau, Armony, Joanette, & Belin, 2004) as well as differentially represent familiar versus unfamiliar voices (von Kriegstein & Giraud, 2004). These contemporary neuroimaging studies agree with the results of earlier neuropsychological studies implicating the right hemisphere as the primary locus of voice perception. Van Lancker and Canter (1982) found that right-hemisphere lesions more frequently led to impairment in voice recognition than did left-hemisphere lesions. Similarly, Van Lancker and Kreiman (1987) showed that patients with right-hemisphere lesions were more impaired in the recognition of familiar voices (a condition known as “phonagnosia”), than those with left-hemisphere lesions or healthy controls. Although studies of disorders resultant from neuropsychological conditions or brain injury might reveal the system components necessary for talker identification abilities, they are inherently limited in their ability to delineate how these systems function normally in the intact brain. Moreover, studies of brain injury do not rule out a role for systems in the left-hemisphere to contribute to talker identification: Van Lancker and Kreiman (1987) further found that both left- and right-hemisphere injuries resulted in the impairment of discrimination abilities for unfamiliar voices.
Although previous functional neuroimaging studies have been successful in identifying regions that preferentially process the human voice, the specific methods they have employed have generally precluded any discovery, even incidentally, of the role of speech perception mechanisms during talker identification. Many neuroimaging studies of talker identification explicitly contrast patterns of neural activation from attending either the identity of a talker or content of the speech within the same auditory stimuli (e.g. Stevens, 2004; von Kriegstein et al., 2003). Such statistical contrasts only show differences in neural activity between speech and voice perception, and are unable to demonstrate the ways in which speech- and voice-perception systems might be functionally integrated during natural talker identification behaviors. The other method used to identify voice-selective regions contrasts vocal with nonvocal auditory stimuli (e.g. Belin et al., 2000) which confounds speech and voice into a single stimulus category. Because speech and voice information are conveyed in the same acoustic signal, it is difficult to examine the unique cognitive and perceptual effects of each, much less how they might be functionally integrated. There are, however, early indications from some neuroimaging studies hinting at how speech and voice information might be integrated. Belin et al., (2002) showed increased activation in voice-selective regions when listening to speech versus non-speech vocal sounds. Using a functional connectivity analysis, results from von Kriegstein and Giraud (2004) indicated correlated activity between the right posterior temporal lobe and two regions in the left temporal lobe during a talker identification task. At least two electrophysiology studies have also directly confirmed that speech and voice information are integrated even in preattentive auditory processing (Knösche et al., 2002; Kaganovich, Francis, & Melara, 2006).
Besides these recent discoveries in neuroimaging, a substantial behavioral literature has consistently shown that the cognitive abilities of speech and voice perception bidirectionally integrate information between the two systems. First, there is growing behavioral evidence that in addition to structural information about a talker’s voice (e.g. vocal tract length, oral cavity volume, and dynamic range of the fundamental frequency) talker identification also relies on the idiosyncratic nuances of his or her speech (phonetics). Listeners who are familiar with the language being spoken demonstrate superior talker-identification abilities to listeners unfamiliar with that language – a phenomenon known as the language-familiarity effect in talker identification (Perrachione & Wong, 2007; Goggin et al., 1991). Listeners with no familiarity with a foreign language appear significantly impaired in achieving native-like accuracy at identifying voices speaking that language, even after substantial training (Perrachione & Wong, 2007), which stands in contrast to other auditory abilities, such as consonant or tone identification, which can reach native-like levels after short-term training (Jamieson & Morosan, 1989, Wong & Perrachione, 2007). This suggests that the targets of natural talker identification are not only the structural voice cues independent of language, but also the phonetic idiosyncrasies of individual talkers. Phonetic perception is likely not a one-way street from acoustic stimulus to auditory percept; experience and expectation have also been shown to play a role in the perception of speech. Although very young infants can discriminate between nearly all human speech sounds, this ability is lost as infants become familiar with the subset of sounds relevant to their native language (Werker & Tees, 1984), and increased exposure facilitates the perception of certain phonetic distinctions by native speakers (Polka, Colantonio, & Sundara, 2002). Moreover, top-down influences on perception, such as lexical frequency and the phonotactics of one’s native language also add bias to phonetic perception (Samuel, 1997; Davis & Johnsrude, 2007). When listeners are identifying voices speaking a familiar language, they are able to take advantage of deviations from their stored auditory representations to help distinguish individual talkers (i.e. intertalker phonetic variability). When the language is unfamiliar, however, there are no stored representations against which to meaningfully compute variability, and therefore any phonetic cues to talker identity are either absent or unreliable. Successful integration of information from the speech-perception system has been proposed as the cognitive basis for the language-familiarity effect in talker identification (Perrachione & Wong, 2007).
A sizable speech perception literature has demonstrated that the converse is also true – variability due to voice can affect the perception of speech. Bradlow, Nygaard, and Pisoni (1999) showed that listeners were faster and more accurate at recognizing words spoken by familiar versus unfamiliar voices. Mullennix and Pisoni (1990) showed that as variability in an unattended feature (e.g. voices) increased, the speed of identification of an attended feature (e.g. phonetic information) decreased, suggesting significant processing effort is expended to account for talker variability during speech perception. In a compelling example of the role of stored representation in auditory perception, Johnson (1990) showed that two identical acoustic stimuli would be perceived as different vowels when listeners had prior expectations about the identity of the speaker (see also Broadbent, Ladefoged & Lawrence, 1956). Similarly, Nygaard, Sommers, and Pisoni (1994) showed that after training to recognize the voices of individual talkers, listeners were subsequently more accurate at recognizing untrained speech tokens from those talkers. Taken together, these results support talker-contingency in speech perception mechanisms, and hint at the possibility of the converse: speech-contingent mechanisms in voice recognition. Despite decades of such behavioral work showing functionally integrated systems for speech and voice perception, current models of talker identification based only on neural data fail to capture or explain this online bidirectional exchange of information (Belin, Fecteau, & Bédard, 2004; Campanella & Belin, 2007).
Here we employ dichotic listening as a measure of the underlying functional neurological configuration. The dichotic listening paradigm is predicated on the notion that stimuli presented to the left ear are primarily processed by the right hemisphere and vice versa. Dichotic listening also involves the simultaneous presentation of a distracter or masking stimulus to the unattended ear to “overwhelm” the contralateral auditory system and decrease the incidence of interhemispheric transfer effects, distinguishing this paradigm from monaural listening (Kimura, 1967). Superior behavioral performance while attending one ear is called an “ear advantage,” reflecting the advantage enjoyed by the contralateral auditory system for processing a stimulus. Dichotic listening paradigms have been used extensively to assess language laterality, and researchers have consistently found right-ear (left-hemisphere) advantages for speech sounds (see Tervaniemi & Hughdal, 2003, for a review). Recent work in neuroscience has confirmed the underlying tenets of this paradigm. Dichotic listening by patients following cerebral hemispherectomy reveals virtually complete suppression of ipsilateral auditory input (de Bode, Sininger, Healy, Mathern, & Zaidel 2007). In healthy control subjects, sounds presented to either ear elicit stronger activation in the contralateral auditory network, including inferior colliculus, medial geniculate body, and auditory cortex, as measured by fMRI (Schönwiesner, Krumbholz, Rubsamen, Fink, & von Cramon 2007; cf. Devlin et al., 2003). Similarly, behavioral performance on speech sound discrimination as measured by dichotic listening is significantly predicted by differences in the latency of left and right N1 component of auditory evoked potentials – an electrophysiological correlate of early cortical auditory processing (Eichele, Nordby, Rimol, & Hugdahl, 2005).
In this study, we test the hypothesis that neural circuits in the classically “language-dominant” left hemisphere play a greater role in talker identification in listeners’ native language than a less familiar language. A greater role of the left hemisphere in native-language talker identification provides a neuroscientific explanation for the language-familiarity effect: Speech perception abilities supported by left-hemisphere neural circuits are integrated with voice perception processes from circuits in the right hemisphere, resulting in more extensive information about talker identity and therefore superior native-language performance. If, as we predict, successful integration of information from speech perception mechanisms is the underlying factor of the language-familiarity effect, then right-ear dichotic performance should better predict overall talker identification accuracy in a native language than a non-native one.
Two groups of listeners participated in this study whose native language (L1) was either American English or Mandarin Chinese. The English L1 group consisted of 14 individuals (12 females) age 18 to 29 years (M = 22.2). None of the participants in the English L1 group had any familiarity with the Mandarin language. The Mandarin L1 group consisted of 13 individuals (9 females) age 18 to 31 years (M = 23.6). At the time of the experiment, the Mandarin L1 participants were living in the United States and had functional English language skills, reporting between 0 and 17 years (median = 7) since their first exposure to English. All Mandarin participants reported speaking predominately Mandarin growing up. (Previous studies have shown that second-language learners still exhibit a language-familiarity effect in talker identification, although they can overcome it with specific training (Perrachione & Wong, 2007).) Participants were all right-handed (Oldfield, 1971) and reported no auditory or neurologic deficits. Participants gave informed written consent overseen by the Northwestern University Institutional Review Board and were compensated with a small cash payment. An additional nine participants (four Mandarin L1, five English L1) were recruited for the study, but were excluded from the subsequent analysis because they performed at or below chance (~20%) on one of the six conditions described below.
Stimuli consisted of recordings of ten sentences in each language condition (Mandarin or English) (Open Speech Repository, 2005), which had been used in our prior talker identification behavioral study (Perrachione & Wong, 2007). These sentences are reproduced here in Appendix A. The English sentences were read by five male native speakers of American English (age 19–26, M = 21.6), and the Mandarin sentences were read by five male native speakers of Mandarin Chinese (age 21–26, M = 22.6). No talker read sentences in both languages, and no one who produced stimuli took part in the listening experiment. Talkers were recorded 1–3 years prior to the sample of listeners participating in this study, lessening the likelihood they came from acquainted peer groups, and no listener reported prior familiarity with any of the voices in the study. Talkers were asked to read the sentences naturally, as though they were having a conversation with a friend. Recordings were made in a sound-attenuated chamber via a SHURE SM58 microphone using a Creative USB Sound Blaster Audigy 2 NX sound card onto a Pentium IV PC. Recordings were digitally sampled at 22.05 kHz using an in-house software Wavax for Windows v2.3 and normalized for RMS amplitude to 70 dB SPL. (Note that amplitude normalization likely removed overall loudness as a cue to talker identity, although individual patterns of amplitude variation are preserved.) Five sentences in each language were arbitrarily designated as “training sentences” and the remaining five as “untrained sentences.”
It bears noting that in the present experiment we intentionally used stimuli recordings of full sentences. Much of the behavioral work on the language-familiarity effect has been conducted with sentence-length stimuli or longer (Thompson, 1987; Goggin et al., 1991; Perrachione & Wong, 2007; cf. Winters, Levi, & Pisoni, 2008), whereas many of the previous behavioral studies investigating the role of voice variation on speech perception opted to use recordings of isolated words (e.g. Bradlow, Nygaard, & Pisoni, 1999). The use of sentence-length stimuli provided not only a richer phonetic environment from which to compute talker identity compared to isolated words, but also sentence-level linguistic information absent from isolated words, such as patterns of intonation, stress, and coarticulation. Additionally, work by Nygaard and Pisoni (1998) indicated listeners learned talker identity faster from sentences than single words.
The experimental design was derived from our previous study, and has previously been shown to effectively measure talker identification ability (Perrachione & Wong, 2007). Participants performed the task in both of two language conditions: English and Mandarin, completing the task in one language before undertaking the other. The order of conditions was counterbalanced across participants. Each condition consisted of a practice phase and a test phase, as illustrated schematically in Fig. 1.
During the practice phase, participants were familiarized with the voices to be recognized. Auditory stimuli were presented binaurally over headphones while participants directed visual attention to instructions on a computer monitor. Each voice read one of the five training sentences while a number designating that voice (1, 2, 3, 4, or 5) appeared on the computer monitor. After participants had heard each voice read the sentence twice, they practiced identifying the voices and received feedback. One of the five voices would read the training sentence, and the participant would indicate which voice they believed was speaking by pushing an appropriate button on a computer keyboard. The computer automatically informed the participant whether they had answered correctly, and, in the case they answered incorrectly, the computer also informed the participant of what the right answer should have been. Participants practiced identifying the voices ten times from each training sentence. Then the participants listened to the five voices reading the next sentence, and practiced identifying them from that sentence in the same way as before. This was repeated until all five voices had read all five training sentences, resulting in approximately 10 minutes of training altogether.
After practicing, participants were tested on their ability to identify the voices from the untrained sentences. Novel utterances were used to ensure participants had learned to recognize the unique features of each talker, and were not relying on more general auditory memory for the practice stimuli.
Participants first identified the voices from dichotic presentation. As shown in Fig. 1, they were directed to attend to one ear (the “target voice”) while ignoring the other ear (the “mask voice”) for blocks of 25 stimuli. The target ear was always indicated on the computer monitor during each trial. For each trial, the same sentence was read by these two different voices, and the target and mask voices were presented separately to each ear. For each ear, each voice served as the target an equal number of times, and each voice served as the auditory mask in the opposite ear an equal number of times. This resulted in 200 stimuli presentations in the dichotic test (5 target voices × 4 other possible mask voices × 5 sentences × 2 ears = 200 trials). The ear participants were directed to attend first was counterbalanced.
Participants concluded each language condition with a binaural test, which served as a measure of overall (baseline) talker identification accuracy. This test was similar to a typical listening experiment and was designed to resemble natural listening conditions, as well as be comparable to prior talker-identification studies. In our design, we opted for the binaural test to always follow the dichotic tests, rather than fully counter-balance their order (Fig. 1). This design was chosen to avoid the situation where some participants had relatively more experience with the voices before beginning the dichotic condition (see e.g. Karpicke & Roediger III, 2008, on the impact of explicit recall on learning), which may have complicated subsequent analysis and interpretation of the results. We consider performance on the binaural test to be the closest laboratory approximation of natural talker-identification ability. The same stimulus was played simultaneously to both ears while participants identified the voice. Each voice read each of the test sentences again during the binaural test, for a total of 25 trials (5 voices × 5 sentences = 25 trials). Participants did not receive feedback in either the dichotic or binaural portions of the test phase. After completing one language condition, participants were offered a short rest before repeating the task in the other condition. In total, the experiment lasted about 50 minutes.
Participants’ performance was assessed by accuracy (defined as the number of correct trials out of the total number of trials to which participants responded) and was measured separately for each ear during the dichotic test, and overall in the binaural test. To understand how neural circuitry in each hemisphere differentially contributes to ecological talker identification, we compared the predictive capacity of left-ear and right-ear accuracy on binaural performance between language conditions using correlation tests. We also submitted participants’ scores from the dichotic listening test to a repeated measures analysis of variance, with Ear (left vs. right) and Condition (English vs. Mandarin) as within-subject factors, and Group (English L1 vs. Mandarin L1) as a between-subjects factor. Using the ANOVA allowed us additionally to determine whether the dichotic task replicated the language-familiarity effect (Perrachione & Wong, 2007), as well as whether participants exhibited any overall ear-advantages for the talker identification task.
The hypothesis we advance in this paper is that neural circuitry in the left hemisphere contributes differentially to talker-identification abilities in a familiar language compared to an unfamiliar one. This effect is independent of any overall hemispheric advantage for the task in general, a consideration we will return to below. Determining whether left-hemisphere neural circuits contribute differentially to native versus foreign language talker identification requires an explicit test between the predictive capacity of the left hemisphere on overall performance in one language condition versus its predictive capacity in the other condition. We implemented this test by first determining the correlation coefficients of performance in the left and right ears with overall performance. The statistic used in these tests was Spearman’s rho (ρ) because the data were nonlinear (percents) and sometimes deviated from a normal distribution. Subsequently, we tested the relative strength of the correlation coefficients of each hemisphere across language conditions (Table 1), as well as across hemispheres within a language condition (Table 2). This analysis technique allowed us to assess the relative contribution of each hemisphere to the natural (binaural) task independent of either any neurologic left-right processing bias (i.e. the ear-advantages or hemispheric asymmetries) or overall better performance in one language than the other (i.e. the language-familiarity effect), which will be considered later. (One Mandarin L1 participant was excluded from the correlation analysis in the English condition because a technical fault during the binaural test caused no data to be collected there; another was excluded from the analysis in the Mandarin condition as an outlier with performance more than 2.5 times below the standard deviation of that group’s mean.)
To assess whether accuracy in either ear was a better predictor of overall accuracy, the difference (z) between each pair of normalized correlation coefficients was computed to determine whether the two correlations had the same strength. Correlation coefficients were normalized using the Fisher-z transformation, and the difference between normalized correlation coefficients was converted to a z-score (Fisher, 1921). These tests investigated whether one ear was a better predictor of overall accuracy in one language versus the other (Table 1), and whether, within a language condition, performance from one ear was a better predictor of overall accuracy than the other ear (Table 2).
English L1 participants demonstrated a significant difference in their correlation coefficients for right-ear accuracy in the English versus Mandarin conditions [z = 1.916, p < 0.03]. For English L1 participants, accuracy in their right ear (left-hemisphere neural circuits) was a better predictor of overall accuracy when listening to their native language versus a foreign one. The same test for the left ear was not significant. Mandarin L1 participants, meanwhile, also demonstrated a significant difference in their correlation coefficients for right-ear accuracy in the English versus Mandarin conditions [z = −1.646, p < 0.05]. For Mandarin L1 participants, accuracy in their right ear (left-hemisphere neural circuits) was also a better predictor of overall accuracy when listening to their native language than a non-native one. Again, the same test for the left ear was not significant. Thus, for both participant groups, accuracy in their right ear (classically language-dominant left hemisphere) was a better predictor of overall accuracy in their native language than in the non-native one. This was not the case for the left ear (right hemisphere with putative voice/indexical processing circuits). Fig. 2 shows the correlations between right-ear accuracy and overall accuracy for both participant groups in each language condition. As evident from the graphs, points representing performance in either group’s native language adhere much more closely to the regression model than those representing the non-native language. There were no reliable differences between the correlation coefficients for either participant group within language condition (Table 2).
Participants’ performance during the dichotic test was also submitted to a repeated-measures ANOVA (see details above). Similar to prior behavioral studies of talker identification (Goggin et al., 1991; Perrachione & Wong, 2007), we found a significant Group × Condition interaction [F(1,25) = 46.640, p < 3.7×10−7], indicating English L1 participants were more accurate when identifying English voices, and Mandarin L1 participants were more accurate on Mandarin voices (Fig. 3). The magnitude of this effect was also similar to that of previous studies. There was no main effect of Condition, confirming neither set of voices was overall easier to identify. There was also a marginal effect of Group [F(1,25) = 4.072, p > 0.054], likely owing to slightly higher performance by the Mandarin L1 participants overall.
The ANOVA revealed a significant Condition × Ear interaction [F(1,25) = 5.068, p < 0.034], which represents a significant left-ear (right cerebral hemisphere) advantage for both participant groups when identifying voices speaking English (Fig. 3, left columns). There was no reliable ear-advantage for either group when listening to Mandarin. Correspondingly, there was neither a Group × Ear interaction, nor a three-way Condition × Group × Ear interaction. The main effect of Ear was also significant [F(1,25) = 4.413, p < 0.046] and was likely driven by superior left-ear performance of both groups in the English condition. We consider possible mechanisms underlying this disparity between languages below.
These results provide evidence for the functional integration of (classically language-dominant) left-hemisphere neural circuitry in a talker identification task, and more generally evidence for the effects of within-modality neural integration of auditory information on human behavior. Overall talker identification accuracy was better predicted only by participants’ right-ear (left-hemisphere) performance in their native language versus a non-native one. The fact that right-ear performance differs in its predictive capacity on overall performance between languages, whereas left-ear performance does not, suggests the difference in accuracy between native- and foreign-language talker identification (the language familiarity effect) is a product of processes relying on left-hemisphere neural circuitry – most likely that of processing phonetic information related to spoken language perception.
This neuroscientific evidence for the functional integration of voice and speech in the brain during ecological talker identification is indicative of larger principles of neural organization and function. The brain has likely encountered selective pressures to develop mechanisms that maximize information about the environment, which are realized in the tendency to integrate and use available information from multiple modalities or domains whenever possible. For example, the identification of talkers by voice is enhanced when participants have both auditory and visual information available during learning (Sheffert & Olson, 2004; von Kriegstein & Giraud, 2006). Changes in auditory information can similarly affect perceptual decisions about motion, even when visual information is held constant (Ecker & Heller, 2004). In fact, numerous phenomena, both cognitive and clinical, arise when information across modalities is not integrated normally, including tinnitus (Kaltenbach, 2006), Capgras’ syndrome (Hirstein & Ramachandran, 1997), and the well-known McGurk effect (McGurk & MacDonald, 1976). Here we see that even within the auditory modality, the brain integrates information from different domains (speech and voice) to optimize performance for identifying talkers, which we observe behaviorally as a language-familiarity effect.
Although carefully controlled experiments can reveal specific brain regions that are primarily dedicated to distinct functions such as speech or voice perception (e.g. von Kriegstein et al., 2003), the brain is likely to have developed to maximally integrate information from many sources to form enriched percepts. (How exactly the brain manages this task, often called the “binding problem”, remains one of the most important questions in neuroscience today). The very complexity of the human brain is defined by how these two principles, modularization and integration, have governed its development phylogenetically and ontogenetically. The language-familiarity effect, with its neural basis in the integration of speech and voice information, is one such example of how selective pressures for integrating information in general may have shaped the cognitive and perceptual systems underlying human behavior, in this case social auditory perception.
In addition to the evidence for a biological integration between speech and voice perception systems reported here, the rest of our results both replicate and extend our current understanding of the neural systems responsible for talker identification. Here we unambiguously replicated the language-familiarity effect seen in other studies of interlingual talker identification (Perrachione & Wong, 2007; Goggin et al., 1991; Thompson, 1987). It is interesting to consider the robustness of the language-familiarity effect across participant groups in light of the fact that the English L1 listeners had no familiarity of Mandarin, whereas the Mandarin L1 listeners had functional English language skills (as they were living in the United States at the time of the experiment, generally as students or employed as university researchers, or their family and friends). Both groups exhibited superior talker-identification skills in their native language, despite the familiarity of English to the Mandarin subjects. These results are consistent with our previous study (Perrachione & Wong, 2007), which used similar subject populations. In that study, Mandarin L1 subjects with some English proficiency could be trained to overcome the language-familiarity effect, whereas English L1 subjects with no Mandarin proficiency could not. The manifestation of the language-familiarity effect in these Mandarin-English bilinguals may be understood by analogy to the spatial reasoning abilities of infants. Infants may show an understanding of a concept (e.g. height, color) and the ability to use it in some situations (e.g. visual occlusion events), but are unable to make use of the same concept during another situation (e.g. a containment event) (Hespos & Baillargeon, 2007). Similarly, Mandarin-English bilinguals may be able to use their knowledge of English phonology in one situation (speech perception) without its being immediately available in a second situation (talker identification). Explicit training on English talker identification facilitates Mandarin-English bilinguals’ bringing phonological information to bear in this task (Perrachione & Wong, 2007), whereas it may not be fully available without such training (present results).
Our results also suggest a general left-ear advantage for identifying voices speaking English, indicating right-hemisphere neural circuitry is optimized for this task. This result is consistent with the neuroimaging literature, which has begun to converge on regions of the right temporal lobe, especially the superior temporal sulcus, as the primary locus of the human voice-perception system (e.g. Belin et al., 2002; von Kriegstein et al., 2003). By demonstrating a left-ear advantage for a talker identification task, we show that dichotic listening paradigms are effective at eliciting a behavioral correlate of the functional lateralization of voice perception abilities. Early dichotic listening studies were largely equivocal in identifying hemispheric specialization for voice perception abilities (Doehring & Bartholomeus, 1971; Tarter, 1984; Kreiman & Van Lancker, 1988; Landis, Buttet, Assal, & Graves, 1982). This may be due in part to the challenge of working with single word stimuli, which are phonetically impoverished compared to full sentences. However, recent work, including that of Francis and Driscoll (2007), González and McLennan (2007), and the present study, has been more consistent in revealing a rightward specialization for talker-specific processing, thus providing converging evidence with results from neuroimaging and neuropsychology for the specialization of right-hemisphere neural circuitry for such tasks. The converging results from these studies may also lend empirical support to models that attribute person identification generally to neural systems situated in the right hemisphere. A growing literature exists in which right-hemisphere correlates for person identification are described in multiple domains, including Van Lancker and Canter (1982), Le Grand et al. (2003), von Kriegstein et al. (2005), Neuner and Schweinberger (2000), and Lewis et al., (2001).
The presence of an overall left-ear advantage when listening to English voices is independent of and complementary to the primary result of our study, that differential contributions of the left hemisphere between language conditions are responsible for giving rise to the language-familiarity effect. Even though the right hemisphere may be functionally optimized for talker identification tasks, our results show that differences in the availability of information provided by left hemisphere neural circuits appears to underlie the accuracy difference listeners exhibit when identifying talkers speaking a familiar versus unfamiliar language. In particular, these results suggest that native language talker identification relies on extensive integration between the information processing mechanisms for speech and voice perception situated primarily in the left and right hemispheres, respectively. When integration between these mechanisms is impeded, such as when listening to an unfamiliar language, decreased performance reflecting the reduction in the signal’s informativeness is reflected behaviorally in the language-familiarity effect.
It is additionally interesting that we did not see a similar left-ear advantage for either listener group identifying the voices speaking Mandarin. Like in English, prior research has shown a right-ear advantage for identifying Mandarin speech sounds (Ip & Hoosain, 1993; Ke, 1992), despite Mandarin being a tone language. In fact, because of the linguistic relevance of lexical tones, native Mandarin speakers exhibit a right-ear advantage for identifying lexical tones (Wang, Jongman, & Sereno, 2001; Wong, Behne, Jongman, & Sereno, 2004), a result consistent with the neuroimaging literature (Wong, Parsons, Martinez, & Diehl, 2004; Gandour et al., 2004). It is worth considering briefly why an equivalent left-ear advantage for the identification of Mandarin-speaking voices did not appear. In addition to functional differences between the left and right temporal lobes for speech and voice processing, a substantial literature has suggested these neural circuits are also sensitive to physical differences in auditory stimuli (Zatorre & Belin, 2001). The left superior temporal lobe appears to exhibit a preference for processing rapidly-varying temporal features of sounds, whereas slowly-varying temporal features are preferentially processed on the right (Boemio, Fromm, Braun, & Poeppel, 2005). Unlike English, which uses slowly-varying intonation contours across an entire sentence, Mandarin is a tone language with different pitch contours on every syllable (Eady, 1982). It may be the case that a leftward asymmetry for processing the rapidly varying temporal envelope of Mandarin obscured the detection of any rightward asymmetry for voice perception in the current dichotic listening paradigm. Because dichotic listening is a behavioral proxy of underlying neurologic organization, it is inherently less sensitive than physiologic measures such as ERP or fMRI, and thus more likely to result in Type II error (e.g. Bethmann, Tempelmann, De Bleser, Scheich, & Brechmann, 2007). For example, the magnitude of the right-ear advantage for speech can be significantly attenuated under unfavorable manipulations of attention (Mondor & Bryden, 1991). Additionally, the spatial resolution of dichotic listening paradigms is severely limited – whereas sensitive neuroimaging techniques can localize activity within millimeters, dichotic listening paradigms can only indicate left- or right-hemisphere advantages. However, such an explanation for the difference between ear-advantaged for identifying English and Mandarin voices is merely speculative, and further research will be necessary to address the interaction between functional and acoustic asymmetries on dichotic listening in a hypothesis-driven manner. For example, a return to the use of isolated words as opposed to sentence-length stimuli may facilitate demonstration of such an effect, as single words show substantially less differences in pitch contour dynamics between languages.
Regardless of any potential differences in processing asymmetries or the ability of dichotic listening to detect them, these results provide compelling evidence for a biological integration between neural circuitry in the left and right cerebral hemispheres for the ecological identification of talkers. Listeners who understand the language being spoken exhibit an advantage for identifying individual talkers over listeners for whom it is unfamiliar. The language-familiarity effect for talker identification is a prominent example of how evolution has tuned the human nervous system to integrate information across domains to produce an optimal representation of the environment. The advanced social structure and complex linguistic capacity for representing and communicating the world are realized in communities where speakers and listeners predominately share the same language. As such, natural talker identification is able to draw on not only the ability to distinguish the auditory features of vocal mechanism structure, but also the ability to perceive the dynamic, phonetic idiosyncrasies of individual talkers’ speech.
The authors wish to thanks Geshri Gunasekera and Tasha Dees for their assistance conducting this research, Ajith Kumar Uppunda and Matt Goldrick for insightful discussions of the project, and Ann Bradlow for discussions about both the project and manuscript. We also thank the Editor and two anonymous reviewers for helpful comments and discussion. Portions of this work were presented at the International Congress on Phonetic Sciences 2007 (Saarbrücken, Germany). This work is supported by Northwestern University, the National Institutes of Health (U.S.A.) grants HD051827 & DC007468 awarded to P.W., and a grant from the James S. McDonnell Foundation to J.P.
Stimulus sentences were taken from those made available on the Open Speech Repository (2005) database online.
Publisher's Disclaimer: The following manuscript is the final accepted manuscript. It has not been subjected to the final copyediting, fact-checking, and proofreading required for formal publication. It is not the definitive, publisher-authenticated version. The American Psychological Association and its Council of Editors disclaim any responsibility or liabilities for errors or omissions of this manuscript version, any version derived from this manuscript by NIH, or other third parties. The published version is available at www.apa.org/journals/xhp.