Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Exp Psychol Hum Percept Perform. Author manuscript; available in PMC 2012 October 1.
Published in final edited form as:
PMCID: PMC3179795

Perceptual Adaptation to Sinewave-vocoded Speech Across Languages


Listeners rapidly adapt to many forms of degraded speech. What level of information drives this adaptation, however, remains unresolved. The current study exposed listeners to sinewave-vocoded speech in one of three languages, which manipulated the type of information shared between the training languages (German, Mandarin, or English) and the testing language (English) in an audio-visual (AV) or an audio plus still frames modality (A+Stills). Three control groups were included to assess procedural learning effects. After training, listeners’ perception of novel sinewave-vocoded English sentences was tested. Listeners exposed to German-AV materials performed equivalently to listeners exposed to English AV or A+Stills materials and significantly better than two control groups. The Mandarin groups and German-A+Stills group showed an intermediate level of performance. These results suggest that full lexical access is not absolutely necessary for adaptation to degraded speech, but providing AV-training in a language that is similar phonetically to the testing language can facilitate adaptation.

Keywords: perceptual adaptation, vocoded speech, cross-language, degraded speech, speech perception

Listeners are able to understand speech despite extensive across- and within-talker variability. Understanding how our perceptual systems adapt to such variability is a fundamental question in speech perception research. Across-talker variability stems from variation in the indexical characteristics of the talker such as gender, age, native language, and dialect (e.g., Abercrombie, 1967). Within-talker variability arises due to a talker’s emotional and physiological state, characteristics of the interlocutor (such as age or hearing status), and social context (e.g., Andruski & Kuhl, 1996; Picheny, Durlach, & Braida, 1985). In addition to across- and within-talker variability, differences in listening environments also introduce significant variability to the speech signal. Characteristics of the room such as size and amount of reverberation and competing sounds such as other talkers or noise also affect the perception of the speech signal (e.g., Assman & Summerfield, 2004; Bradley, Sato, & Picard, 2003).

Despite the presence of such variability in the acoustic signal, speech perception is extremely robust over a wide range of listening conditions. Listeners are able to rapidly adapt to novel talkers and listening environments and can perceive speech with high degrees of accuracy with little effort. The process of adaptation to speech signals occurs in several ways. First, listener experience with a particular talker increases their perceptual accuracy for that talker. As listeners become more familiar with a talker’s voice their word recognition accuracy for that talker increases (Bradlow & Bent, 2008; Nygaard, Sommers & Pisoni, 1994; Nygaard & Pisoni, 1998). Second, listeners adapt to particular groups of talkers who share certain speech production patterns, such as an accent or regional dialect. A beneficial effect of experience on speech intelligibility has been shown for listeners with experience listening to foreign-accented speech (Bradlow & Bent, 2008; Clarke & Garrett, 2004; Sidaras, Alexander & Nygaard, 2009;Weil, 2001), novel native accents (Maye, Aslin & Tanenhaus, 2008), and speech produced by talkers with hearing impairments (McGarr, 1983). Lastly, listeners can adapt to various artificial manipulations of the speech signal, such as speech synthesized by rule (Schwab, Nusbaum & Pisoni, 1985; Greenspan, Nusbaum & Pisoni, 1988), speech that is time compressed (Dupoux & Green, 1997; Mehler et al., 1993; Pallier et al., 1998; Sebastián-Gallés, Dupoux, Costa & Mehler, 2000; Voor & Miller, 1965), noise-vocoded speech (Davis, Johnsrude, Hervais-Ademan, Taylor & McGettigan, 2005; Hervais-Adelman, Davis, Johnsrude, & Carlyon, 2008), sinewave-vocoded speech (Bent, Buchwald & Pisoni, 2009; Loebach & Pisoni, 2008; Loebach, Bent & Pisoni, 2008), and speech embedded in multi-talker babble (Bent et al., 2009). Importantly, these benefits have been reported to extend to new talkers and to new speech signals created using the same types of signal manipulation (Bradlow & Bent, 2008; Dupoux & Green, 1997; Francis, Nusbaum & Fenn, 2007; Greenspan, Nusbaum & Pisoni, 1988; McGarr, 1983).

In addition to demonstrating that listeners are able to adapt to variability in the speech signal, a critical issue remains in determining what type of information is required for such perceptual adaptation to occur. At one extreme, perceptual adaptation could stem primarily from bottom-up acoustic information and higher-level linguistic information would not be necessary to adapt to novel speech signals. Under this view, when adapting to degraded speech signals, exposure to any type of complex acoustic signal (e.g., speech in any language or even complex non-speech sounds) would be adequate for adaptation to occur. At the other extreme, adaptation may rely heavily on top-down neurocognitive processes so that listeners would require access to all levels of language structure including syntactic, lexical, and phonological information. Under this view, listeners would show the most robust adaptation when exposed to meaningful speech in their native language. Indeed, most previous studies of perceptual adaptation to degraded speech have used real words or meaningful sentences in the listener’s native language. Since these materials provide listeners with access to information from all levels of language structure including early low-level acoustic/phonetic information as well as more abstract high-level lexical/semantic information, they cannot be used to determine what types of information are required for perceptual adaptation to occur.

Several studies have manipulated listeners’ access to different levels of language structure to assess their reliance on various types of information during adaptation. Studies using synthetic manipulations of specific phoneme contrasts suggest that listeners adjust their phonemic category boundaries on the fly as needed for specific talkers (e.g., Eisner & McQueen, 2005; Norris, McQueen & Cutler, 2003) and in certain instances for groups of talkers from specific native dialects (Kraljic & Samuel, 2006, 2007). For the methodology used in these studies, ambiguous sounds were embedded in words that did not participate in minimal pairs with the target sounds. For example, when listeners were presented with an ambiguous sound between /f/ and /s/, they would change their category boundaries depending on the words in which it was presented: exposure to words such as “sheri_” lead to a broadening of the /f/ category whereas exposure to words such as “Pari_” lead to a broadening of the /s/ category. Therefore, through the use of lexical knowledge, listeners shifted their category boundaries as needed for particular talkers or group of talkers. These studies suggest that adaptation relies on information from the lexical level, a relatively late level of processing (Eisner & McQueen, 2005; Norris, McQueen & Cutler, 2003).

Adaptation to noise-vocoded speech signals may also be most robust when listeners have access to lexical information. Davis and colleagues (2005) tested perceptual learning of noise-vocoded speech using meaningful sentences, anomalous sentences (in which all words were real English words but the sentence was semantically anomalous), Jabberwocky sentences (in which content words were replaced by nonsense words but real function words remained), and non-word sentences (in which both content and function words were non-words). Listeners showed the greatest benefit from training with meaningful sentences and anomalous sentences, an intermediate level of performance with Jabberwocky sentences, and the lowest level of performance with either no training or with nonword sentences. Similarly, Stacey and Summerfield (2008) found that training with words or sentences improved listeners’ ability to perceive words in sentences more than training that focused specifically on phoneme contrasts. These results suggest that having access to lexical information, even if the sentence as a whole is anomalous, produces robust perceptual learning. However, learning must also be occurring at a sub-lexical level as training generalized to novel lexical items.

In contrast to the findings suggesting that lexical access is necessary for perceptual learning, Hervais-Adelman et al. (2008) found no differences in perceptual adaptation to noise-vocoded words when listeners were trained with word versus non-word stimuli. As in the studies described above, listeners generalized their learning to novel words in testing demonstrating again that perceptual learning of noise-vocoded speech involves the learning of sub-lexical units. Hervais-Adelman et al. (2008) explained the discrepancy between their results and Davis et al.’s (2005) earlier findings by appealing to short-term memory limitations that may occur when listening to entire non-word sentences.

Two studies of time-compressed speech by Pallier et al. (1998) and Sebastian-Gallés et al. (2000) also suggest that lexical information is not required for the adaptation to degraded speech signals. In Pallier et al. (1998), listeners were exposed to time-compressed speech in one language and then tested on their perception of another language (either their native language for monolingual speakers or one of the native languages for bilingual speakers). Exposure in some languages, but not all, aided perception of sentences in the testing language. For example, exposure to time-compressed speech in Catalan improved perception of time-compressed Spanish materials for Catalan/Spanish bilingual and monolingual Spanish listeners. However, exposure to time-compressed speech in French did not facilitate perception of English time-compressed speech for French/English bilingual listeners or monolingual English listeners. The authors suggest that training with a language that shares similar prelexical phonological representations with the testing language will assist the listeners in processing time-compressed sentences. As such, generalization is observed between Catalan and Spanish because they share the same type of prelexical representation (i.e., they are both syllable-timed languages and have stress on the penultimate syllable). In contrast, French and English have different prelexical representations and therefore, adaptation does not generalize between these two languages.

Sebastian-Gallés et al. (2000) employed a similar methodology with time-compressed speech but tested adaptation by Spanish listeners with training materials in Spanish, Italian, French, English, Japanese, and Greek. In their study, perceptual adaptation was found between languages that shared prelexical/rhythmic characteristics (i.e., Spanish listeners adapted when exposed to Spanish, Italian, or Greek) in the absence of lexical knowledge. However, exposure to French did not results in equivalent levels of performance, a finding that the authors hypothesize was related to the differences in the vowel inventories and/or stress pattern differences between Spanish and French. In sum, with time-compressed speech, perceptual adaptation is observed without the influence of lexical information but other phonological similarities between training and testing languages may be necessary to facilitate adaptation.

In the present study, we utilized the cross-linguistic approach described above to assess how exposure to one of three languages influences perceptual adaptation to sinewave-vocoded speech in English. This type of signal degradation was selected because previous studies have demonstrated that listeners are able to adapt to this form of degradation using semantic (meaningful words and sentences), lexical (anomalous sentences), and non-speech (environmental sounds) information (Loebach & Pisoni, 2008). Moreover, listeners are neither at floor or ceiling levels of performance with this particular form of degraded speech.

Two hundred monolingual English-speaking participants were assigned to one of six training conditions or to one of three control conditions. The modalities of the training conditions were either a fully audio-visual (AV) or audio signals with still pictures extracted from the videos (A+Stills). The participants in the audio-visual experimental groups watched videos in one of three languages (English, German, or Mandarin Chinese) in which the audio signal had been processed using a sinewave-vocoder. The participants in the A+Stills conditions heard the same audio signals as the participants in the AV conditions but were only presented with a series of static frames taken from the videos. Participants in one control group were not exposed to any training materials at all, and only completed the post-test phase of the experiment. Participants in the second control group watched the same videos as the English conditions but the audio signal was spectrally rotated (Blesser, 1972) rather than sinewave-vocoded. This condition assessed procedural learning since the training materials were the same, but the training signal was degraded with a different manipulation than the sinewave-vocoder. Participants in the third control condition were presented with only the video from the English condition with no audio signal. This condition was included to control for any incidental procedural learning in which participants were required to maintain attention to the training materials but were not exposed to any audio signals that could lead to auditory perceptual learning (Amitay, Irwin and Moore, 2006). We chose to expose participants to video clips or still frames from the video clips, rather than audio alone, to ensure maintenance of participants’ active attention, particularly for the conditions in which they were listening to a language they did not know. To ensure that participants were paying attention during the training phase particularly for conditions in which the materials were in a language that they did not understand, we asked them to answer multiple-choice questions about the thematic content of the materials after each video clip or series of still frames.

All participants completed a post-test in which they transcribed sinewave-vocoded English sentences and identified environmental sounds. We chose not to include a pre-test due to previous findings demonstrating significant rapid adaptation to noise-vocoded speech even in the absence of feedback (Davis et al., 2005). We wanted to assess the effects of exposure to speech in different languages independent of any prior experience listening to the same signal processing conditions in English. The inclusion of a pre-test could confound the effects of exposure to foreign languages in this study.

Although the three training languages selected (English, German, and Mandarin) share prelexical representations because they are stress-timed languages (Komatsu, Arai, & Sugawara 2004; Rouasa, Farinasa, Pellegrinob, & André-Obrechta, 2005), German and Mandarin differ substantially in their relation to English. German is closely related to English historically and there are many similarities in the phonemic and phonological structure of the two languages. For example, both English and German have similar consonant inventories and have relatively permissive syllable structures (i.e., they allow consonant clusters and a variety of consonants in coda position). In contrast, Mandarin and English are from distinct language families and have very different phonemic and phonological structures. Moreover, Mandarin has a quite restricted syllable structure compared to English and is a tonal language. Therefore, the comparison of these three training languages in the post-test measures will provide insight about the types of information that are necessary for listeners’ adaptation to sinewave-vocoded speech. If listeners trained with the English stimuli perform more accurately on the post-test than listeners trained with German or Mandarin stimuli, then this result would support the claim by Davis and colleagues (2005) that lexical information “drives perceptual learning of distorted speech” (p. 222). Listeners in the English condition would be expected to perform better than those in the German or Mandarin conditions because they have rapid access to lexical information, which helps map the distorted auditory signals to stored exemplars of lexical items in long-term memory. Another alternative is that listeners in the English, German, and Mandarin conditions will all perform similarly at post-test. This possibility would indicate that shared prelexical representations between the language of training and the language of testing will determine generalization of learning as has been observed for time-compressed speech. An alternate interpretation is that the adaptation takes place at the acoustic level. That is, hearing complex non-speech sounds that have been processed with the vocoder will allow for some adaptation.

Recent findings have shown that perception of vocoded speech can be improved by training with non-speech sinewave-vocoded sounds. Loebach and Pisoni (2008) reported that training with feedback using sinewave-vocoded environmental sounds lead to improved perception of sinewave-vocoded speech, although training on sinewave-vocoded speech did not generalize to the recognition of environmental sounds. The authors suggested that explicit training on environmental sounds, which share important information in the spectral and temporal domains with speech, aids speech perception because the training increases listeners’ attention and sensitivity to the spectrotemporal characteristics of acoustic signals shared by both environmental and speech stimuli. In the current experiment, feedback may be present for environmental sounds through the links between visual events and the auditory distorted signals.

A final possibility is that listeners exposed to the English and German conditions will perform more accurately on the post-test than the Mandarin trained listeners. This result would suggest that exposure to a training language that has similar phonemic and phonological structure to the testing language will result in more robust generalization than a training language that has different phonemic and phonological structure from the testing language. Therefore, the necessary critical component for generalization to the testing language is exposure to similar phonemic and phonological structure during training, which would facilitate the learning of sublexical units shared by both the training and testing languages. This outcome would provide support for the earlier findings of Sebastian-Gallés et al. (2000) who demonstrated that adaptation to time-compressed speech was most robust when the training and testing languages shared both rhythmic and phonological characteristics.

The inclusion of training conditions that have identical audio signals but differ in the availability of visual articulatory information (the AV vs. the A+Stills conditions) was included to assess whether visual information in the absence of lexical information aids perceptual learning. It is possible that visual articulatory information could serve as a source of feedback regarding the mapping between the distorted audio signal and sublexical units. Training using the audio-visual modality has resulted in greater amounts of learning for non-native phoneme contrasts than training using audio-only presentation (Hardison, 2003; 2005; Hirata & Kelly, 2010). If visual speech information benefits the adaptation to degraded stimuli, listeners in the AV conditions should perform better than those in the audio + stills conditions within a training language.



A total of 234 listeners were recruited from the Indiana University community to serve as participants. Listeners who were not monolingual speakers of English or who reported a current or previous speech, hearing, or language impairment were excluded from the final data analysis. Seventeen listeners were excluded because they did not meet the language requirement and three listeners were excluded because they reported a prior speech impairment (i.e., stuttering, lisp, articulation disorder). Additionally, 14 participants were excluded from analyses because they did not reach criterion (80% correct or better) on the comprehension questions following the video clips or still frames, suggesting that they may not have adequately attended to the training materials. Thus, data from a total of 200 listeners were included in the final data analyses. Listeners in the German and Mandarin training groups were required to have never studied the language they were exposed to in training. Listeners were either paid ten dollars or received credit in a Psychology course as compensation for their participation.

Stimulus Materials

The training materials included selections from language learning videos that are commercially available: Fokus Deutsch (Delia, Dosch Fritz, Finger, Newton, Daves-Schneider, Schneider, & DiDonato, 2000), Chinese in Action (Liu, 2004) and English for All (Pedroza, 2005). From these videos, short clips (between 46 seconds and 3 minutes 12 seconds in duration) were segmented from the original files using FinalCutPro. The video clips were selected based on the actors speaking continuously during most of the clip and having a coherent series of actions that would be transparent based on video alone. The Mandarin training materials included 7 video clips; English training materials included 9 video clips; German training materials included 10 video clips. The total duration of all video clips was similar across training conditions: (German = 16 minutes 13 seconds; Mandarin = 16 minutes 43 seconds; English = 15 minutes 49 seconds).

For the A+stills conditions, several still-frames were selected from the video clips that represented the major action of the scene. These selections allowed participants to identify the basic events that occurred during the video without being exposed to the dynamic visual articulatory information of the talkers. Key frames were selected based on the overall action of the scene in conjunction with the multiple choice questions presented during the training session. Each frame was cut from the original movie clip, lengthened in time, and concatenated onto the previous frame such that each clip appeared as a series of several still frames that transitioned smoothly over the duration of the original video clip (See Figure 1). For example, one German training video focuses on a young woman packing a suitcase in preparation of an upcoming trip. As the scene unfolds, the mother and daughter alternate in placing items in the suitcase with the mother hiding a box of chocolates in the suitcase while the daughter is not looking. Still frames representing the action were selected such that one frame consisted of the daughter placing an item in the suitcase, followed by the daughter with her back turned, the mother placing an item in the suitcase, a close up showing that it is a basket of a chocolates, and a still of the daughter returning to the suitcase with another item of clothing. In a similar manner, conversations were conveyed by varying the frames between talkers to show turn-taking. Due to the different amount of action in the videos, each video clip varied between 8 and 24 frames, with an average of 15.02 frames (Mandarin = 13.00 frames, English = 19.78 frames, German = 12.20 frames).

Figure 1
Example video clip from the German A+Stills condition. Six still frames (taken from the original motion video) are presented over a span of 35 seconds corresponding to the action in the scene. Each still image is displayed while the auditory track continues ...

After selecting the video clips, the audio channel was extracted from the videos so that it could be manipulated. For the German, Mandarin, and English sinewave-vocoded training conditions, each audio wav file was processed through an 8-channel sinewave vocoder using the cochlear-implant simulator TigerCIS ( Stimulus processing involved two phases: an analysis phase, which used band pass filters to divide the signal into eight nonlinearly spaced channels (between 200 and 7000 Hz, 24 dB/octave slope) and a low pass filter to extract the amplitude envelope from each channel (400 Hz, 24 dB/octave slope); and a synthesis phase, which replaced the frequency content of each channel with a sinusoid that was modulated with its matched amplitude envelope. For the spectral rotation condition, the audio tracks from the English videos were processed in Praat (Boersma, 2004) by spectrally rotating the speech around the middle frequency of 4000 Hz (Blesser, 1972). This transformation inverts the power spectrum, making the high frequencies low and the low frequencies high. Spectral rotation has been shown to significantly reduce speech intelligibility (Blesser, 1972; Scott, Blank, Rosen & Wise, 2000), even though several aspects of the speech signal remain intact (including intonation, rhythm, and periodicity). After processing either with the sinewave-vocoder or spectral rotation, the audio and video streams or the audio and video stills were recombined for presentation. For the English-V-only condition (hence referred to as the V-only condition), the same video was presented as in the English-AV condition but participants did not receive an audio signal.

The post-test assessment materials included sets of meaningful sentences, anomalous sentences and environmental sounds, all of which were processed using the sinewave-vocoder as described above. The meaningful sentences were taken from the Indiana Multi-talker Sentence Database (Karl & Pisoni, 1994) that includes recordings of 100 Harvard sentences (IEEE, 1969). From this database, 40 sentences from four talkers (2 male and 2 female) were selected, with each talker providing 10 different meaningful sentences. The anomalous Harvard sentences were selected from a corpus developed by Herman and Pisoni (2000). From this database, 20 sentences from two talkers (1 male and 1 female) were selected, with each talker providing 10 different anomalous sentences. The anomalous sentences obey English syntax but the content words have been manipulated so that they are semantically anomalous (e.g., A cone is no whole sheep on a sun). The environmental sounds were taken from an environmental sound database (Marcell, Borella, Greene, Kerr & Rogers, 2000). These sounds included 60 biologically significant sounds that are commonly heard in daily life including human non-speech sounds (e.g., snoring), animal sounds (e.g., cat meowing), nature sounds (e.g., rain) and musical sounds (e.g., bongos).

Environmental sounds were included in the post-test as a control to ensure that differences across groups on the post-test were not due to overall group differences unrelated to training. If one group of listeners performed more accurately on all post-tests including speech and environmental sounds, it is possible that some unknown factor(s) about the listeners in the group, unrelated to training, caused the increase in their performance. The order of the post-tests was kept consistent for all participants: meaningful sentences were presented first, followed by anomalous sentences, and environmental sounds presented last. As we expected some degree of perceptual learning during the post-test (as stated above), this ordering is a limitation of the design because differences among training groups may be attenuated in the latter parts of the post-test. However, considering the large number of conditions already included in the study, adding another variable (i.e., the order of the post-tests) was beyond the scope of this project, and should be addressed in future research.


Stimulus presentation and data collection used a custom script written for Psyscript and implemented on four Macintosh Power Mac G4 computers. During testing, each participant listened to auditory signals over Beyer Dynamic DT-100 headphones while sitting in front of a 17 inch Sony LCD computer screen. Sound output was fixed at 72 dB SPL within the presentation program as verified with a voltmeter.

During training, each video clip or series of stills was presented followed by a series of two or three multiple-choice questions about the content of the video. For example, after a video clip about a woman going to the doctor, included in the Mandarin training materials, participants were asked, “How does the doctor examine the woman?” The possible responses were as follows: (A) He takes her blood pressure; (B) He looks down her throat; (C) He looks in her ears; (D) He takes her pulse. The multiple-choice questions were included in the training procedure to ensure that all participants attended to the videos or the series of stills, particularly those for which they did not know the language of training. The questions could be answered accurately by closely watching the actions in the video or the series of still frames. An understanding of what the actors were saying was not necessary to answer the comprehension questions accurately.

After training was completed, the participants were all tested with the same post-test procedures. Participants heard meaningful sentences, followed by anomalous sentences, and environmental sounds. For the post-test, each sentence or environmental sound was played once over headphones, after which a dialogue box appeared on the screen prompting the listener to type what s/he heard. The presentation of the sentences and environmental sounds was self-paced, and no feedback was provided after a participant entered their response. Participants did not have the option of replaying any of the stimuli during the experiment. The listeners heard the sentences once in the same pre-randomized order. The Control group, who received no training, only completed the post-test procedures.


Meaningful and anomalous sentences were scored by the number of keywords correctly transcribed. Each sentence had five keywords (e.g., Rice is often served in round bowls), which consisted of content words (nouns, verbs, adjectives and adverbs) as has been described previously (IEEE, 1969). Keywords with added or deleted morphemes were counted as incorrect but homophones and obvious spelling errors were counted as correct. Each listener, therefore, received a keyword accuracy score for the meaningful sentences (based on 40 sentences, for a total of 200 keywords) and the anomalous sentence (based on 20 sentences, for a total of 100 keywords).

The 60 environmental sounds were scored based on correct identification of the agent (e.g., cow) or the sound (e.g., moo) or both (e.g., cow mooing). For the musical instruments, listeners had to correctly identify the instrument itself (e.g., bongos) for the musical sounds to be counted as correct. A response of just “music” was counted as incorrect.


All participants included in the final data analysis scored at least 80 percent correct on the multiple-choice comprehension questions following the video clips or video stills, which were included to assure that all participants were actively attending to the training materials. The accuracy rates on the comprehension questions for the eight experimental groups were as follows: (1) English-AV = 97%; (2) English-A+Stills = 97%; (3) German-AV = 94%; (4) German-A+stills = 87%; (5) Mandarin-AV = 90%; (6) Mandarin-A+Stills = 87%; (7) V-only = 96%; (8) Rotated = 93%. The Control group did not undergo training and therefore did not provide a score on the comprehension questions.

General Effects

Figure 2 displays the proportion of keywords accurately transcribed on the post-test for the meaningful sentences (left) and anomalous sentences (middle) and of environmental sounds (right) correctly identified by the listeners in each of the nine conditions. A repeated-measures ANOVA was conducted with training condition (Mandarin-AV, Mandarin-A+Stills, German-AV, German-A+Stills, English-AV, English-A+Stills, V-only, Rotated, and Control) as the between-subjects variable and stimulus material (meaningful sentences, anomalous sentences, and environmental sounds) as the within-subjects factor. Highly significant main effects of training condition [F (8,191) = 5.38, p < 0.001, ηp2 = 0.184] and stimulus material [F (2,382) = 985.40, p < 0.001, ηp2 = 0.838] were observed. The interaction of training condition and stimulus material was also significant [F (16,382) = 3.20, p < 0.001, ηp2 = 0.118].

Figure 2
Results from the three post-tests – meaningful sentences, anomalous sentences and environmental sounds – are shown for the nine conditions. Scores for the meaningful and anomalous sentences represent the percent of keywords correctly transcribed ...

Because of the significant interaction observed between training condition and stimulus material, univariate ANOVAs were conducted for each set of stimulus materials separately. Significant effects of training group were observed for the meaningful sentences (Figure 2, left panel) [F (8,191) = 7.40, p < 0.001, ηp2 = 0.236], for the anomalous sentences (Figure 2, middle panel) [F (8,191) = 2.87, p = 0.005, ηp2 = 0.107], and for the environmental sounds (Figure 2, right panel) [F (8, 191) = 2.81, p = 0.006, ηp2 = 0.105]. These results indicate that training condition had a significant influence on post-test performance. Post-hoc tests, with Bonferroni corrections for multiple comparisons, were therefore conducted to assess differences between groups for each set of stimulus materials. The results below describe these findings as relevant for the two central issues addressed in this study: the influence of the language of exposure and the influence of visual information.

Language of Exposure

Overall, the results support a view of perceptual adaptation in which exposure to the same language in training and testing (e.g., English to English) or a training language with a similar phonological inventory to the testing language (e.g., German to English) results in greater benefit than exposure during training to a language with a distinct phonological structure compared to the testing language (e.g., Mandarin to English).

Post-hoc tests indicated that listeners trained with English or German stimuli performed most accurately on the post-test tasks that included speech. For the meaningful sentences, listeners in the English-AV, English-A+Stills, and German-AV conditions performed significantly better than listeners in the Control condition and V-only training condition (all p < 0.01). Additionally, the listeners in the English-A+Stills condition performed significantly better than the listeners in the Rotated condition (p = 0.01). For the anomalous sentence post-test, the only significant difference that was observed was between the English-A+Stills condition and the Rotated condition (p = 0.03). Comparisons between the other training conditions did not reach statistical significance (all p > 0.05).

Training with Mandarin materials provided less benefit on the post-tests that included speech. Listeners exposed to the Mandarin training materials (either AV or A+stills) performed similarly to the listeners in the three control conditions on the speech post-tests (all p > 0.05, except as noted below). The only difference that reached significance was between the Mandarin A+stills and the control group (p = 0.03) suggesting that exposure to training materials in Mandarin provided a benefit over no training at all. However, this training benefit was likely related to incidental procedural learning (i.e., the Mandarin groups were not different than the V-only group).

For the environmental sounds post-test, the only difference observed was for the Mandarin-AV group, who performed significantly better than the V-only group (p = 0.015). None of the other training and control conditions differed statistically (all p > 0.05).

Influence of Visual Information

The addition of visual information did not appear to have an effect on performance for either the English or Mandarin training materials. In fact, listeners in the English and Mandarin trained groups showed a reversed trend in which the participants in the A+Stills groups outperformed the listeners in the AV groups within each training language. However, for the participants exposed to German, the visual information appeared to have a beneficial effect. For the meaningful sentences, although both the German-AV and German-A+Stills groups performed significantly better than the no-training Control group, only the German-AV group performed significantly better than participants in the V-only condition (p < 0.01) showing an influence of training above any incidental procedural learning effects.

In sum, for the meaningful sentences post-test, training on sinewave-vocoded speech in any language using either an AV or A+Stills presentation modality provided a reliable benefit over no training at all (i.e., the Control condition). However, only training on English (in either the AV or A+Stills modality) or German-AV provided a significant benefit over training in the V-only condition. The English-AV and German-AV trained groups showed equivalent levels of performance on all post-tests.


The results of this study support a view of perceptual adaptation to sinewave-vocoded speech in which listeners show most benefit if the training and testing languages are the same or have similar phonological structure. However, exposure to an unfamiliar language during training that is different than the testing language is beneficial but only when presented in an audio-visual modality. Further, these results suggest that listeners learn how the acoustic degradation affects speech at a sublexical level, because training transferred to novel words. We hypothesize that adaptation occurs at the phoneme level through two different learning mechanisms. Using the first learning mechanism, listeners access lexical items from the degraded input. They then map these degraded forms of lexical items onto stored un-degraded lexical representations. During the mapping between the natural and degraded representations of the words, listeners can decompose the degraded lexical item into its component phonemes. Listeners then store information about how the phonemes sound under degraded conditions. The information about how phonemes will sound when processed by the vocoder can be used to assist the perception of novels degraded words. This is the proposed learning mechanism used in the English conditions and results in the most robust perceptual learning. Such a mechanism can explain the results of Davis et al (2005) where listeners showed the greatest amount of learning for vocoded speech when lexical information was available during training.

A second learning mechanism is proposed when participants are exposed to an unknown language. In this case, listeners do not have access to any top-down knowledge-based information to assist in the mapping between the degraded and natural acoustic forms of phonemes. However, visual information can serve as another source of information that participants can use to help them map the degraded audio input onto stored phonemic representations, at least for phonemes for which visual information is available. The process of mapping degraded phonemes onto stored, naturally-produced phonemic representations may be more beneficial when the phonemic inventories and phonotactic structures of the training and testing languages are similar. When the phoneme inventories are more similar in the two languages (such as in German and English), listeners will gain more information about how phonemes in the native language are likely to be realized under vocoded conditions. Further, because English and German include similar syllabic structures, participants in the German training condition were exposed to the degraded versions of the phonemes in several syllabic positions. The acoustic properties of phonemes differ depending on their syllabic position. Thus, exposure to the degraded phonemes during the training phase in all the positions in which they will appear in the testing materials facilitates mapping degraded input onto stored phoneme categories. The availability of visual articulatory information seems to facilitate this mapping process. In comparison, the phonemic inventory of Mandarin is more distinct from English and nearly all consonants are only present as singletons in initial syllabic position. Therefore, the degraded input in Mandarin will not give listeners enough information about how the phonemes in English sound when vocoded, especially in post-vocalic position or in consonant clusters. Without the context-dependent information about how to interpret a variety of English phonemes in various syllabic positions, listeners trained with Mandarin materials have more difficulty accessing English lexical items in the post-test.

These results parallel the earlier findings of Pallier et al. (1998) and Sebastian-Gallés et al. (2000) who demonstrated that training on time-compressed speech generalized from one language to another when both shared common phonological structure. For time-compressed speech, learning was facilitated by exposure to languages that shared prelexical structure and/or similar vowel inventories and stress patterns. It should be noted that the earlier studies by Pallier et al. (1998) and Sebastian-Gallés et al. (2000) that used the cross-linguistic approach employed here used a different form of signal degradation. The mechanisms that underlie the perceptual adaptation process may differ depending on the type of signal degradation. Speech can be degraded in a number of ways including degradation that masks information in the speech signal by the addition of noise, distorts the acoustic information in the speech signal in a consistent way such as time compression or noise or sinewave vocoding, or procedures that change the phonetic or phonological realization of surface linguistic forms (e.g., foreign-accented speech). Perceptual learning and adaptation has been observed for all these different forms of signal degradation (e.g., Bent et al., 2009; Pallier et al., 1998; Davis et al., 2005; Weil, 2001). In this scheme, time compression and sinewave vocoding are similar forms of signal degradation in that they distort the acoustic information in the speech signal. However, time compression largely preserves the spectral information in the signal but distorts the temporal dimension and may restructure informative temporal relationships among phonological segments. Similarity between the training and testing language in terms of timing and rhythmic structure may be necessary to facilitate robust adaptation. In contrast, sinewave vocoding primarily reduces information in the spectral domain. Spectral broadening or smearing may distort or restructure the remaining spectral information that is important in distinguishing speech sounds. Therefore, similarity between the phonemic inventories and phonotactics of the two languages is required to facilitate adaptation.

Studies employing the same training methodology for these two forms of degradation -- time compression and sinewave vocoding -- have not yet been performed. However, comparisons across studies suggest that the time-course of adaptation between the two forms of degradation may differ. Studies with adaptation to time-compressed speech tend to use a relatively short adaptation period of around 10 sentences, which is sufficient to induce asymptotic performance in perception (Dupoux and Green, 1997). In contrast, asymptotic levels of performance are not reached for sinewave-vocoded speech until after the exposure of 60 sentences (Bent et al., 2009). Therefore, although time compression and sinewave vocoding affect different aspects of the acoustic signal and have different time courses of perceptual learning, adaptation to both forms of degradation can occur without lexical access if certain exposure and learning conditions are met.

One possible limitation of this study was the inclusion of German as a training language that might have provided some lexical access to participants even though they had not studied any German. German and English are closely related historically and thus, there are many words that are both phonetically and semantically similar. The inclusion of these cognates may have allowed listeners to access lexical information even though they were unfamiliar with German. In order to address this potential concern, the third author (LP), who is fluent in both English and German, counted the number of cognates in the training materials. Words that were both phonetically and semantically similar in the two languages were included in this analysis. In the training materials, 28 unique words met these criteria. Some of these words were repeated for a total of 52 cognate words, as defined above. Some of these words are phonetically similar in the two languages (e.g., German “Mann” and English “man”) while others are less so (e.g., German “kalt” and English “cold”). As average speaking rates are approximately 150 words per minute, cognates accounted for a very small percentage of the training materials and some of them (such as “kalt”) may not have resulted in successful lexical access. Nonetheless, further investigations with stimulus materials that do not include cognates or with the inclusion of other different languages would allow for a more precise determination of the role of cognates in adaptation to degraded speech. For example, Sebastian-Gallés et al. (2000) found that adaptation to time-compressed Greek transferred to Spanish for Spanish speaking listeners. These two languages are not related to each other and therefore they have little lexical overlap but share similar rhythmic-temporal characteristics and vowel inventories. The assessment of adaptation to sinewave-vocoded speech with languages that differ on these factors would provide more detailed information on the necessary and sufficient conditions for perceptual adaptation to be observed.

In contrast to the findings for the German training conditions, when training stimuli were presented in English, the visual articulatory information did not result in better performance on the post-test. In fact, the listeners in the English-A+stills condition performed better than listeners in the English-AV condition as evidenced by the finding that the listeners in the English-A+Stills condition were the only listeners who performed significantly more accurately than the Rotated condition in both the meaningful sentences and anomalous sentences post-tests. It is very likely that the A+stills presentation format may have forced the listeners to attend more to the auditory signal than in the AV modality. Thus, the availability of other forms of top-down information in the audio-visual condition may have produced a change in attention weights.

A methodological difference between the current study and previous studies of perceptual adaptation to noise or sinewave vocoded and time-compressed speech was the inclusion of multiple talkers in the current study versus the use of a single talker in previous studies. The use of multiple talkers has been shown to result in robust learning of difficult non-native phoneme contrasts (e.g., Lively, Logan and Pisoni, 1993) and foreign-accents (Bradlow and Bent, 2008). Further, the current study demonstrated talker-independent learning since the talkers included in the post-test were different from the talkers used in the training materials. In contrast, previous studies with vocoded speech have demonstrated talker-specific adaptation only with one talker used in both training and testing materials (See Davis et al., 2005). Further study is needed to determine how the inclusion of variability from multiple talkers interacts with the availability of sources of top-down and bottom-up information during the adaptation to degraded speech signals.

The differences among the nine training conditions were significantly attenuated in the anomalous sentences post-test. The ordering among conditions in terms of accuracy remained similar to the meaningful sentences post-test but there were fewer significant differences among conditions. Although listeners in the German-AV, English-AV, and English-A+Stills training conditions were more accurate than listeners in the Control (no training) and V-only conditions for the meaningful sentences post-test, these five groups were not significantly different from each other for the anomalous sentence test. The difference between the listeners in the English-A+Stills condition compared with the Rotated condition remained in the anomalous sentence post-test. The attenuation of differences among the training conditions for the anomalous sentences may have resulted from the order of the post-tests rather than stemming from features of the stimuli. The anomalous sentence post-test always occurred after listeners had transcribed forty sentences during the meaningful sentences portion of the post-test. An analysis of performance during the meaningful sentences post-test demonstrated that listeners adapted, which decreased the differences among groups. These gains were probably due to both procedural learning and additional perceptual learning of the acoustic transformation of the posttest materials. Procedural learning likely occurred even for the trained groups because the training and testing tasks were different. Gains made during training, therefore, appeared to be maintained to some extent but were attenuated as all groups continued to learn and restructure their perceptual analysis for vocoded speech signals. Learning during this portion of the experiment would be expected because both Davis et al. (2005) and Bent et al. (2009) demonstrated significant perceptual learning for meaningful vocoded sentences without any feedback. In future studies, it would be of interest to counter-balance the order of the post-tests to assess performance on anomalous sentences and/or environmental sounds without prior presentation of meaningful sentences.

On the environmental sounds post-test, the listeners in the Mandarin-AV condition showed significantly better performance than the listeners in the V-only condition. The cause of the Mandarin-AV listeners’ better performance may have been due to the presence of several environmental sounds in the post-test that were present in the Mandarin training videos including a coin dropping, a squeaky door, wind blowing, door knocking, and screaming. In contrast, in the German videos, the only environmental sound presented that was in the environmental sound post-test was birds and in the English videos the only sound was a door knocking. Other environmental sounds were present in the videos but none that were presented in the environmental sound post-test.


The current experiment demonstrated that perceptual adaptation to sinewave-vocoded speech can occur without lexical access under some circumstances. Listeners trained with German materials in an audio-visual modality were as accurate on the meaningful sentences post-test as listeners trained with English materials (both AV and A+Stills) and significantly better than listeners who did not receive any training or received visual-only training without any sound. The presence of visual articulatory information in the German-AV condition may have served as a form of feedback that allowed listeners to map the degraded signals onto sublexical units for use in the generalization post-tests. However, training on any language will not result in equivalent levels of performance. Listeners trained with Mandarin materials showed an intermediate level of performance between English-AV, English-A+Stills, and German-AV conditions and the control conditions. For the meaningful sentences post-test, listeners in the Mandarin conditions were significantly better than listeners who did not receive any training but were not significantly different than those trained with visual-only materials. These findings suggest that lexical access is not a necessary prerequisite for perceptual adaptation to be observed with novel sinewave-vocoded materials. However, demonstrating adaptation without lexical access requires similar sound structure between the training and testing languages as well as training in an audio-visual modality.


This work was supported by grants from the National Institute of Health to Indiana University (NIH-NIDCD grants T32 DC-00012, R01 DC-000111, and R21 DC-010027). We thank Luis Hernandez for technical assistance.


Publisher's Disclaimer: The following manuscript is the final accepted manuscript. It has not been subjected to the final copyediting, fact-checking, and proofreading required for formal publication. It is not the definitive, publisher-authenticated version. The American Psychological Association and its Council of Editors disclaim any responsibility or liabilities for errors or omissions of this manuscript version, any version derived from this manuscript by NIH, or other third parties. The published version is available at

Contributor Information

Tessa Bent, Department of Speech and Hearing Sciences, Indiana University, 200 S. Jordan Ave., Bloomington IN 47405.

Jeremy L. Loebach, Department of Psychology, Neuroscience Program, St Olaf College, 1520 St Olaf Avenue, Northfield, MN 55057.

Lawrence Phillips, Department of Psychological and Brain Sciences, Indiana University, 1101 E. 10th Street, Bloomington IN 47405.

David B. Pisoni, Department of Psychological and Brain Sciences, Indiana University, 1101 E. 10th Street, Bloomington IN 47405.


  • Abercrombie D. Elements of General Phonetics. Edinburgh: Edinburgh University; 1967.
  • Amitay S, Irwin A, Moore DR. Discrimination learning induced by training with identical stimuli. Nature Neuroscience. 2006;9(11):1446–1448. [PubMed]
  • Andruski JE, Kuhl PK. The acoustic structure of vowels in mothers' speech to infants and adults. ICSLP-1996. 1996:1545–1548.
  • Assman PF, Summerfield AQ. The Perception of Speech Under Adverse Conditions. In: Greenberg S, Ainsworth WA, Popper AN, Fay R, editors. Speech Processing in the Auditory System. New York: Springer-Verlag; 2004.
  • Bent T, Buchwald A, Pisoni DB. Perceptual Adaptation and Intelligibility of Multiple Talkers for Two Types of Degraded Speech. Journal of the Acoustical Society of America. 2009;126(5):2660–2669. [PubMed]
  • Blesser B. Speech perception under conditions of spectral transformation: I. Phonetic characteristics. Journal of Speech and Hearing Research. 1972;15:5–41. [PubMed]
  • Boersma P, Weenink D. Praat: doing phonetics by computer. [Computer program] 2004 Retrieved from
  • Bradley JS, Sato H, Picard M. On the Importance of Early Reflections for Speech in Rooms. Journal of the Acoustical Society of America. 2003;113(6):3233–3244. [PubMed]
  • Bradlow AR, Bent T. Perceptual adaptation to non-native speech. Cognition. 2008;106:707–729. [PMC free article] [PubMed]
  • Clarke CM, Garrett MF. Rapid adaptation to foreign-accented English. Journal of the Acoustical Society of America. 2004;116:3647–3658. [PubMed]
  • Davis MH, Johnsrude IS, Hervais-Ademan A, Taylor K, McGettigan C. Lexical information drives perceptual learning of distorted speech: Evidence from the comprehension of noise-vocoded sentences. Journal of Experimental Psychology: General. 2005;134:222–241. [PubMed]
  • Delia R, Dosch Fritz D, Finger A, Newton SL, Daves-Schneider L, Schneider K, DiDonato R. Fokus Deutsch. [Video series] Boston, MA: McGraw-Hill Humanities/Social Sciences/Languages; 2000.
  • Dupoux E, Green KP. Perceptual adjustment to highly compressed speech: Effects of talker and rate changes. Journal of Experimental Psychology: Human Perception and Performance. 1997;23:914–927. [PubMed]
  • Eisner F, McQueen JM. The specificity of perceptual learning in speech processing. Perception and Psychophysics. 2005;67:224–238. [PubMed]
  • Francis AL, Nusbaum HC, Fenn K. Effects of training on the acoustic phonetic representation of synthetic speech. Journal of Speech, Language and Hearing Research. 2007;50:1445–1465. [PubMed]
  • Greenspan SL, Nusbaum HC, Pisoni DB. Perceptual learning of synthetic speech produced by rule. Journal of Experimental Psychology: Learning, Memory and Cognition. 1988;14:421–433. [PMC free article] [PubMed]
  • Hardison DM. Acquisition of second-language speech: Effects of visual cues, context, and talker variability. Applied Psycholinguistics. 2003;24(4):495–522.
  • Hardison DM. Second-language spoken word identification: Effects of perceptual training, visual cues, and phonetic environment. Applied Psycholinguistics. 2005;26(4):579–596.
  • Herman R, Pisoni DB. Research on Spoken Language Processing Progress Report No. 24. Bloomington IN: Speech Research Laboratory, Indiana University; 2000. Perception of ‘elliptical speech’ by an adult hearing-impaired listener with a cochlear-implant: some preliminary findings on coarse-coding in speech perception; pp. 87–112.
  • Hervais-Adelman A, Davis MH, Johnsrude IS, Carlyon RP. Perceptual learning of noise vocoded words: Effects of feedback and lexicality. Journal of Experimental Psychology: Human Perception and Performance. 2008;34:460–474. [PubMed]
  • Hirata Y, Kelly SD. Effects of Lips and Hands on Auditory Learning of Second-Language Speech Sounds. Journal of Speech Language and Hearing Research. 2010;53(2):298–310. [PubMed]
  • IEEE. IEEE recommended practices for speech quality measurements. IEEE Transactions on Audio and Eletroacoustics. 1969;17:227–246.
  • Karl J, Pisoni DB. The role of talker-specific information in memory for spoken sentences. Journal of the Acoustical Society of America. 1994;95:2873.
  • Komatsu M, Arai T, Sugawara T. Proc. Speech Prosody. Japan: Nara; 2004. Perceptual discrimination of prosodic types; pp. 725–728.
  • Kraljic T, Samuel AG. Howw general is perceptual learning for speech? Psychonomic Bulletin and Review. 2006;13:262–268. [PubMed]
  • Kraljic T, Samuel AG. Perceptual adjustments to multiple speakers. Journal of Memory and Language. 2007;56:1–15.
  • Lively SE, Logan JS, Pisoni DB. Training Japanese listeners to identify English /r/ and /l/: The role of phonetic environment and talker variability in learning new perceptual categories. Journal of the Acoustical Society of America. 1993;94(3):1242–1255. [PMC free article] [PubMed]
  • Loebach JL, Pisoni DB. Perceptual learning of spectrally degraded speech and environmental sounds. Journal of the Acoustical Society of America. 2008;123:1126–1139. [PMC free article] [PubMed]
  • Loebach JL, Bent T, Pisoni DB. Multiple routes to the perceptual learning of speech. Journal of the Acoustical Society of America. 2008;124(1):552–561. [PubMed]
  • Liu JL-C. Chinese in Action. [Video series] Bloomington, IN: Indiana University Press; 2005.
  • Marcell MM, Borella D, Greene M, Kerr E, Rogers S. Confrontation naming of environmental sounds. Journal of Clinical Experimental Neuropsychology. 2000;22:830–864. [PubMed]
  • Maye J, Aslin RN, Tanenhaus MK. The weckud wetch of the wast: Lexical adaptation to a novel accent. Cognitive Science. 2008;32:543–562. [PubMed]
  • McGarr NS. The intelligibility of deaf speech to experienced and inexperienced listeners. Journal of Speech and Hearing Research. 1983;26:451–458. [PubMed]
  • Mehler J, Sebastian-Gallés N, Altmann G, Pupoux E, Christophe A, Pallier C. Understanding compressed sentences: The role of rhythm and meaning. In: Tallal P, Llinas RR, von Euler C, editors. Temporal information processing in the nervous system: Special reference to dyslexia and dysphasia (Annals of the New York Academy of Sciences, Vol. 682, pp. 272–282) New York: New York Academy of Sciences; 1993.
  • Norris D, McQueen JM, Cutler A. Perceptual learning in speech. Cognitive Psychology. 2003;47(2):204–238. [PubMed]
  • Nygaard LC, Pisoni DB. Talker-specific learning in speech perception. Perception and Psychophysics. 1998;60:335–376. [PubMed]
  • Nygaard LC, Sommers MS, Pisoni DB. Speech perception as a talker-contingent process. Psychological Science. 1994;5:42–46. [PMC free article] [PubMed]
  • Pallier C, Sebastian-Gallés N, Dupoux E, Christophe A, Mehler J. Perceptual adjustment to time-compressed speech: A cross-linguistic study. Memory & Cognition. 1998;26:844–851. [PubMed]
  • Pedroza HA. English for All. [Video series] Los Angeles, CA: Division of Adult Career Education (DACE) of the Los Angeles Unified School District; 2005.
  • Picheny MA, Durlach NI, Braida LD. Speaking clearly for the hard of hearing I: Intelligibility differences between clear and conversational speech. Journal of the Acoustical Society of America. 1985;28:96–103. [PubMed]
  • Rouasa J-L, Farinasa J, Pellegrinob F, André-Obrechta R. Rhythmic unit extraction and modeling for automatic language identification. Speech Communication. 2005;47(4):436–456.
  • Schwab EC, Nusbaum HC, Pisoni DB. Some effects of training on the perception of synthetic speech. Human Factors. 1985;27:395–408. [PMC free article] [PubMed]
  • Sebastian-Gallés N, Dupoux E, Costa A, Mehler J. Adaptation to time-compressed speech: Phonological determinants. Perception & Psychophysics. 2000;62(4):834–842. [PubMed]
  • Scott SK, Blank CC, Rosen S, Wise RJ. Identification of a pathway for intelligible speech in the left temporal lobe. Brain. 2000;123:2400–2406. [PubMed]
  • Sidaras SK, Alexander JED, Nygaard LC. Perceptual learning of systematic variation in Spanish-accented speech. The Journal of the Acoustical Society of America. 2009;125(5):3306–3316. [PubMed]
  • Stacey PC, Summerfield AQ. Comparison of word-, sentence-, and phoneme-based training strategies in improving the perception of spectrally distorted speech. Journal of Speech Language and Hearing Research. 2008;51(2):526–538. [PubMed]
  • Voor JB, Miller JM. The effect of practice on the comprehension of worded speech. Speech Monographs. 1965;32:452–455.
  • Weil SA. Foreign-accented speech: Encoding and generalization. Journal of the Acoustical Society of America. 2001;109:2473. (A)