Listeners are able to understand speech despite extensive across- and within-talker variability. Understanding how our perceptual systems adapt to such variability is a fundamental question in speech perception research. Across-talker variability stems from variation in the indexical characteristics of the talker such as gender, age, native language, and dialect (e.g., Abercrombie, 1967
). Within-talker variability arises due to a talker’s emotional and physiological state, characteristics of the interlocutor (such as age or hearing status), and social context (e.g., Andruski & Kuhl, 1996
; Picheny, Durlach, & Braida, 1985
). In addition to across- and within-talker variability, differences in listening environments also introduce significant variability to the speech signal. Characteristics of the room such as size and amount of reverberation and competing sounds such as other talkers or noise also affect the perception of the speech signal (e.g., Assman & Summerfield, 2004
; Bradley, Sato, & Picard, 2003
Despite the presence of such variability in the acoustic signal, speech perception is extremely robust over a wide range of listening conditions. Listeners are able to rapidly adapt to novel talkers and listening environments and can perceive speech with high degrees of accuracy with little effort. The process of adaptation to speech signals occurs in several ways. First, listener experience with a particular talker increases their perceptual accuracy for that talker. As listeners become more familiar with a talker’s voice their word recognition accuracy for that talker increases (Bradlow & Bent, 2008
; Nygaard, Sommers & Pisoni, 1994
; Nygaard & Pisoni, 1998
). Second, listeners adapt to particular groups of talkers who share certain speech production patterns, such as an accent or regional dialect. A beneficial effect of experience on speech intelligibility has been shown for listeners with experience listening to foreign-accented speech (Bradlow & Bent, 2008
; Clarke & Garrett, 2004
; Sidaras, Alexander & Nygaard, 2009
), novel native accents (Maye, Aslin & Tanenhaus, 2008
), and speech produced by talkers with hearing impairments (McGarr, 1983
). Lastly, listeners can adapt to various artificial manipulations of the speech signal, such as speech synthesized by rule (Schwab, Nusbaum & Pisoni, 1985
; Greenspan, Nusbaum & Pisoni, 1988
), speech that is time compressed (Dupoux & Green, 1997
; Mehler et al., 1993
; Pallier et al., 1998
; Sebastián-Gallés, Dupoux, Costa & Mehler, 2000; Voor & Miller, 1965
), noise-vocoded speech (Davis, Johnsrude, Hervais-Ademan, Taylor & McGettigan, 2005
; Hervais-Adelman, Davis, Johnsrude, & Carlyon, 2008
), sinewave-vocoded speech (Bent, Buchwald & Pisoni, 2009
; Loebach & Pisoni, 2008
; Loebach, Bent & Pisoni, 2008
), and speech embedded in multi-talker babble (Bent et al., 2009
). Importantly, these benefits have been reported to extend to new talkers and to new speech signals created using the same types of signal manipulation (Bradlow & Bent, 2008
; Dupoux & Green, 1997
; Francis, Nusbaum & Fenn, 2007
; Greenspan, Nusbaum & Pisoni, 1988
; McGarr, 1983
In addition to demonstrating that listeners are able to adapt to variability in the speech signal, a critical issue remains in determining what type of information is required for such perceptual adaptation to occur. At one extreme, perceptual adaptation could stem primarily from bottom-up acoustic information and higher-level linguistic information would not be necessary to adapt to novel speech signals. Under this view, when adapting to degraded speech signals, exposure to any type of complex acoustic signal (e.g., speech in any language or even complex non-speech sounds) would be adequate for adaptation to occur. At the other extreme, adaptation may rely heavily on top-down neurocognitive processes so that listeners would require access to all levels of language structure including syntactic, lexical, and phonological information. Under this view, listeners would show the most robust adaptation when exposed to meaningful speech in their native language. Indeed, most previous studies of perceptual adaptation to degraded speech have used real words or meaningful sentences in the listener’s native language. Since these materials provide listeners with access to information from all levels of language structure including early low-level acoustic/phonetic information as well as more abstract high-level lexical/semantic information, they cannot be used to determine what types of information are required for perceptual adaptation to occur.
Several studies have manipulated listeners’ access to different levels of language structure to assess their reliance on various types of information during adaptation. Studies using synthetic manipulations of specific phoneme contrasts suggest that listeners adjust their phonemic category boundaries on the fly as needed for specific talkers (e.g., Eisner & McQueen, 2005
; Norris, McQueen & Cutler, 2003
) and in certain instances for groups of talkers from specific native dialects (Kraljic & Samuel, 2006
). For the methodology used in these studies, ambiguous sounds were embedded in words that did not participate in minimal pairs with the target sounds. For example, when listeners were presented with an ambiguous sound between /f/ and /s/, they would change their category boundaries depending on the words in which it was presented: exposure to words such as “sheri_” lead to a broadening of the /f/ category whereas exposure to words such as “Pari_” lead to a broadening of the /s/ category. Therefore, through the use of lexical knowledge, listeners shifted their category boundaries as needed for particular talkers or group of talkers. These studies suggest that adaptation relies on information from the lexical level, a relatively late level of processing (Eisner & McQueen, 2005
; Norris, McQueen & Cutler, 2003
Adaptation to noise-vocoded speech signals may also be most robust when listeners have access to lexical information. Davis and colleagues (2005)
tested perceptual learning of noise-vocoded speech using meaningful sentences, anomalous sentences (in which all words were real English words but the sentence was semantically anomalous), Jabberwocky sentences (in which content words were replaced by nonsense words but real function words remained), and non-word sentences (in which both content and function words were non-words). Listeners showed the greatest benefit from training with meaningful sentences and anomalous sentences, an intermediate level of performance with Jabberwocky sentences, and the lowest level of performance with either no training or with nonword sentences. Similarly, Stacey and Summerfield (2008)
found that training with words or sentences improved listeners’ ability to perceive words in sentences more than training that focused specifically on phoneme contrasts. These results suggest that having access to lexical information, even if the sentence as a whole is anomalous, produces robust perceptual learning. However, learning must also be occurring at a sub-lexical level as training generalized to novel lexical items.
In contrast to the findings suggesting that lexical access is necessary for perceptual learning, Hervais-Adelman et al. (2008)
found no differences in perceptual adaptation to noise-vocoded words when listeners were trained with word versus non-word stimuli. As in the studies described above, listeners generalized their learning to novel words in testing demonstrating again that perceptual learning of noise-vocoded speech involves the learning of sub-lexical units. Hervais-Adelman et al. (2008)
explained the discrepancy between their results and Davis et al.’s (2005)
earlier findings by appealing to short-term memory limitations that may occur when listening to entire non-word sentences.
Two studies of time-compressed speech by Pallier et al. (1998)
and Sebastian-Gallés et al. (2000)
also suggest that lexical information is not required for the adaptation to degraded speech signals. In Pallier et al. (1998)
, listeners were exposed to time-compressed speech in one language and then tested on their perception of another language (either their native language for monolingual speakers or one of the native languages for bilingual speakers). Exposure in some languages, but not all, aided perception of sentences in the testing language. For example, exposure to time-compressed speech in Catalan improved perception of time-compressed Spanish materials for Catalan/Spanish bilingual and monolingual Spanish listeners. However, exposure to time-compressed speech in French did not facilitate perception of English time-compressed speech for French/English bilingual listeners or monolingual English listeners. The authors suggest that training with a language that shares similar prelexical phonological representations with the testing language will assist the listeners in processing time-compressed sentences. As such, generalization is observed between Catalan and Spanish because they share the same type of prelexical representation (i.e., they are both syllable-timed languages and have stress on the penultimate syllable). In contrast, French and English have different prelexical representations and therefore, adaptation does not generalize between these two languages.
Sebastian-Gallés et al. (2000)
employed a similar methodology with time-compressed speech but tested adaptation by Spanish listeners with training materials in Spanish, Italian, French, English, Japanese, and Greek. In their study, perceptual adaptation was found between languages that shared prelexical/rhythmic characteristics (i.e., Spanish listeners adapted when exposed to Spanish, Italian, or Greek) in the absence of lexical knowledge. However, exposure to French did not results in equivalent levels of performance, a finding that the authors hypothesize was related to the differences in the vowel inventories and/or stress pattern differences between Spanish and French. In sum, with time-compressed speech, perceptual adaptation is observed without the influence of lexical information but other phonological similarities between training and testing languages may be necessary to facilitate adaptation.
In the present study, we utilized the cross-linguistic approach described above to assess how exposure to one of three languages influences perceptual adaptation to sinewave-vocoded speech in English. This type of signal degradation was selected because previous studies have demonstrated that listeners are able to adapt to this form of degradation using semantic (meaningful words and sentences), lexical (anomalous sentences), and non-speech (environmental sounds) information (Loebach & Pisoni, 2008
). Moreover, listeners are neither at floor or ceiling levels of performance with this particular form of degraded speech.
Two hundred monolingual English-speaking participants were assigned to one of six training conditions or to one of three control conditions. The modalities of the training conditions were either a fully audio-visual (AV) or audio signals with still pictures extracted from the videos (A+Stills). The participants in the audio-visual experimental groups watched videos in one of three languages (English, German, or Mandarin Chinese) in which the audio signal had been processed using a sinewave-vocoder. The participants in the A+Stills conditions heard the same audio signals as the participants in the AV conditions but were only presented with a series of static frames taken from the videos. Participants in one control group were not exposed to any training materials at all, and only completed the post-test phase of the experiment. Participants in the second control group watched the same videos as the English conditions but the audio signal was spectrally rotated (Blesser, 1972
) rather than sinewave-vocoded. This condition assessed procedural learning since the training materials were the same, but the training signal was degraded with a different manipulation than the sinewave-vocoder. Participants in the third control condition were presented with only the video from the English condition with no audio signal. This condition was included to control for any incidental procedural learning in which participants were required to maintain attention to the training materials but were not exposed to any audio signals that could lead to auditory perceptual learning (Amitay, Irwin and Moore, 2006
). We chose to expose participants to video clips or still frames from the video clips, rather than audio alone, to ensure maintenance of participants’ active attention, particularly for the conditions in which they were listening to a language they did not know. To ensure that participants were paying attention during the training phase particularly for conditions in which the materials were in a language that they did not understand, we asked them to answer multiple-choice questions about the thematic content of the materials after each video clip or series of still frames.
All participants completed a post-test in which they transcribed sinewave-vocoded English sentences and identified environmental sounds. We chose not to include a pre-test due to previous findings demonstrating significant rapid adaptation to noise-vocoded speech even in the absence of feedback (Davis et al., 2005
). We wanted to assess the effects of exposure to speech in different languages independent of any prior experience listening to the same signal processing conditions in English. The inclusion of a pre-test could confound the effects of exposure to foreign languages in this study.
Although the three training languages selected (English, German, and Mandarin) share prelexical representations because they are stress-timed languages (Komatsu, Arai, & Sugawara 2004
; Rouasa, Farinasa, Pellegrinob, & André-Obrechta, 2005
), German and Mandarin differ substantially in their relation to English. German is closely related to English historically and there are many similarities in the phonemic and phonological structure of the two languages. For example, both English and German have similar consonant inventories and have relatively permissive syllable structures (i.e., they allow consonant clusters and a variety of consonants in coda position). In contrast, Mandarin and English are from distinct language families and have very different phonemic and phonological structures. Moreover, Mandarin has a quite restricted syllable structure compared to English and is a tonal language. Therefore, the comparison of these three training languages in the post-test measures will provide insight about the types of information that are necessary for listeners’ adaptation to sinewave-vocoded speech. If listeners trained with the English stimuli perform more accurately on the post-test than listeners trained with German or Mandarin stimuli, then this result would support the claim by Davis and colleagues (2005)
that lexical information “drives perceptual learning of distorted speech” (p. 222). Listeners in the English condition would be expected to perform better than those in the German or Mandarin conditions because they have rapid access to lexical information, which helps map the distorted auditory signals to stored exemplars of lexical items in long-term memory. Another alternative is that listeners in the English, German, and Mandarin conditions will all perform similarly at post-test. This possibility would indicate that shared prelexical representations between the language of training and the language of testing will determine generalization of learning as has been observed for time-compressed speech. An alternate interpretation is that the adaptation takes place at the acoustic level. That is, hearing complex non-speech sounds that have been processed with the vocoder will allow for some adaptation.
Recent findings have shown that perception of vocoded speech can be improved by training with non-speech sinewave-vocoded sounds. Loebach and Pisoni (2008)
reported that training with feedback using sinewave-vocoded environmental sounds lead to improved perception of sinewave-vocoded speech, although training on sinewave-vocoded speech did not generalize to the recognition of environmental sounds. The authors suggested that explicit training on environmental sounds, which share important information in the spectral and temporal domains with speech, aids speech perception because the training increases listeners’ attention and sensitivity to the spectrotemporal characteristics of acoustic signals shared by both environmental and speech stimuli. In the current experiment, feedback may be present for environmental sounds through the links between visual events and the auditory distorted signals.
A final possibility is that listeners exposed to the English and German conditions will perform more accurately on the post-test than the Mandarin trained listeners. This result would suggest that exposure to a training language that has similar phonemic and phonological structure to the testing language will result in more robust generalization than a training language that has different phonemic and phonological structure from the testing language. Therefore, the necessary critical component for generalization to the testing language is exposure to similar phonemic and phonological structure during training, which would facilitate the learning of sublexical units shared by both the training and testing languages. This outcome would provide support for the earlier findings of Sebastian-Gallés et al. (2000)
who demonstrated that adaptation to time-compressed speech was most robust when the training and testing languages shared both rhythmic and phonological characteristics.
The inclusion of training conditions that have identical audio signals but differ in the availability of visual articulatory information (the AV vs. the A+Stills conditions) was included to assess whether visual information in the absence of lexical information aids perceptual learning. It is possible that visual articulatory information could serve as a source of feedback regarding the mapping between the distorted audio signal and sublexical units. Training using the audio-visual modality has resulted in greater amounts of learning for non-native phoneme contrasts than training using audio-only presentation (Hardison, 2003
; Hirata & Kelly, 2010
). If visual speech information benefits the adaptation to degraded stimuli, listeners in the AV conditions should perform better than those in the audio + stills conditions within a training language.