|Home | About | Journals | Submit | Contact Us | Français|
Speech remains intelligible despite the elimination of canonical acoustic correlates of phonemes from the spectrum. A portion of this perceptual flexibility can be attributed to modulation sensitivity in the auditory-to-phonetic projection, though signal-independent properties of lexical neighborhoods also affect intelligibility in utterances composed of words. Three tests were conducted to estimate the effects of exposure to natural and sine-wave samples of speech in this kind of perceptual versatility. First, sine-wave versions of the easy/hard word sets were created, modeled on the speech samples of a single talker. The performance difference in recognition of easy and hard words was used to index the perceptual reliance on signal-independent properties of lexical contrasts. Second, several kinds of exposure produced familiarity with an aspect of sine-wave speech: 1) sine-wave sentences modeled on the same talker; 2) sine-wave sentences modeled on a different talker, to create familiarity with a sine-wave carrier; and 3) natural sentences spoken by the same talker, to create familiarity with the idiolect expressed in the sine-wave words. Recognition performance with both easy and hard sine-wave words improved after exposure only to sine-wave sentences modeled on the same talker. Third, a control test showed that signal-independent uncertainty is a plausible cause of differences in recognition of easy and hard sine-wave words. The conditions of beneficial exposure reveal the specificity of attention underlying versatility in speech perception.
Speech is often said to be perceptually robust. Intelligibility survives extremes of acoustic distortion and variation, including unpredictable variation. Traditional analyses have tended to explain this remarkable feature of speech perception by appeal to redundancy in the distribution of acoustic correlates of linguistic features (e.g., Miller & Chomsky, 1960). Contemporary analyses have sharpened this description. One source of robustness is the perceptual sensitivity to phonologically governed modulation of an acoustic spectrum (Elliott & Theunissen, 2009; Remez, Rubin, Berns, Pardo & Lang, 1994). Because dynamic variation can be less susceptible to distortion than the elemental acoustic components, sensitivity to spectrotemporal variation can be responsible for a portion of distortion tolerance when natural acoustic correlates of phoneme contrasts are altered or obscured. Moreover, the effectiveness of acoustic patterns independent of the elements that compose them is evident in the intelligibility of synthetic signals that dispense with natural acoustic correlates altogether: sine-wave speech (Remez, Rubin, Pisoni & Carrel, 1981), noise-band vocoded speech (Shannon, Zeng, Kamath, Wygonski, & Ekelid, 1995) and, acoustic chimeras of speech (Smith, Delgutte, & Oxenham, 2002). Sensitivity to aggregate patterns of acoustic variation confers robustness when acoustic details are blurred or masked, or when these vary unpredictably.
Sensory resolution of spectrotemporal modulation is one source of robustness, and in the widely shared conceptualization of Lindblom (1990) it is a signal-dependent aspect of intelligibility, in contrast with the class of signal-independent influences on intelligibility. The distinction is readily drawn. The recognition of messages depends on apprehending the linguistic properties of speech, which mark contrasts between phonemes in many and various ways. For instance, the signal properties of speech vary with changes in articulatory rate (Sommers, Nygaard & Pisoni, 1994), articulatory precision (Bradlow & Bent, 2002), and the level of accompanying noise (Miller & Nicely, 1955). Each of these signal properties affects intelligibility, and each can be understood as a kind of variation in the acoustic expression of phoneme contrasts available to an attentive listener. To compare these with signal-independent influences on intelligibility, consider that a listener’s specific or general knowledge can modulate the demand for sensory resolution of contrastive attributes at the phonetic grain of the signal. Indeed, a word is resolved by distinguishing it from others with which it is similar, and characteristics of the lexical neighborhood of a spoken word can determine the ease or difficulty of recognition (Pisoni, Nusbaum, Luce & Slowiaczek, 1985). A spoken word can be easier or harder to recognize because the lexical neighbors from which it must be distinguished are fewer or greater in number, or more or less common, or both.
In the specific challenge to versatility posed by sine-wave speech, none of the familiar acoustic products of vocal sound is present in the spectrum. In consequence, a listener is obliged to resolve phonetic properties largely from the spectrotemporal pattern of an anomalous carrier. The role of a listener’s knowledge of language in the perception of sine-wave speech has been a topic of speculation (Davis, Johnsrude, Hervais-Adelman, Taylor & McGettigan, 2005; Hervais-Adelman, Davis, Johnsrude & Carlyon, 2008) without direct empirical evaluation. Plausibly, the peculiarity of the acoustic carrier might drive a perceiver to rely substantially on signal-independent properties in understanding sine-wave speech (Cutler, 2008). The present project aims to apply methods distinguishing signal-dependent and signal-independent aspects of recognition to the case of sine-wave speech.
A test of signal-dependent and signal-independent aspects of the perception of sine-wave speech can offer a functional analysis of an extreme challenge to versatility. Yet, it is difficult to predict the outcome of such a test with sine-wave speech from precedent or principle because it combines two challenges, a weirdly unnatural vocal carrier and a set of talker-specific idiolectal properties. To explain, such signals replicate the coarse-grain spectrotemporal properties of speech in three or four time-varying sinusoids. Because sine-wave speech lacks broadband resonances, harmonic structure and short-term aperiodicities that are typical of vocally produced sound (Stevens & Blumstein, 1981), a perceiver must attend to the dynamic spectrotemporal variation of an unnatural carrier to identify phonetic and superordinate linguistic properties. Despite the likely role of modulation sensitivity in listening to natural speech (Remez, 2005), sine-wave speech does not initially engage this function, and the default perceptual organization splits each tone into separate source of sound (Remez et al., 1981), consistent with a conventional description of auditory perceptual organization (for example, Bregman, 1990). However, an instruction to listen to the tone pattern as a kind of synthetic speech triggers concurrent bistable perceptual organization, in which the tones are both bound as if issuing from a single vocal source and split into separate streams as if each component came from a different source (Remez, Pardo, Piorkowski & Rubin, 2001). This has been observed with CV nonsense syllables (Remez, 2008), isolated words (Liebenthal, Binder, Piorkowski & Remez, 2003; Remez et al., 2001), and sentences (for example, Remez et al., 1994).
Surviving the absence of natural vocal pitch and an irreducible impression of unnatural timbre, a sine-wave utterance is also perceived to originate from a distinct talker of atypical vocal characteristics (Brungert, Iyer & Simpson, 2006; Remez, Nagel & Fellowes, 2007; Remez, Rubin, Nygaard & Howell, 1987) whether sine-wave speech is modeled on utterances of a familiar talker (Remez, Fellowes & Rubin, 1997) or an unfamiliar talker (Fellowes, Remez & Rubin, 1997; Sheffert, Pisoni, Fellowes & Remez, 2002). In perceiving linguistic properties of sine-wave speech, the paralinguistic characteristics of its production, and some of the indexical properties of the impossible talker who said it, listeners are remarkably versatile in adapting to the absence of canonical speech cues, the elementary acoustic details on which speech perception has been claimed to depend (Diehl, Lotto & Holt, 2004; Liberman & Cooper, 1972; Raphael, 2005).
The experiments reported here aimed to describe the perceptual changes that occur when a listener is first exposed to sine-wave speech and is able to understand the spoken message. In this project, the measures sought to distinguish two broad classes of cognitive function involved in intelligibility: 1) the signal-dependent ability to resolve fine phonetic details preserved in a sine-wave replica of an utterance; and, 2) the signal-independent ability to distinguish among spoken words in neighborhoods of differing density and likelihood. Conceivably, each of these functions improves as a listener gains facility in perceiving sine-wave speech. On one hand, a listener who adapts to the absence of familiar acoustic products of vocalization is relying largely on the time-varying attributes of non-speech carriers to find the phonetic features used to identify words. On the other hand, a competent sine-wave listener has met an unprecedented challenge to exploit the stable inhomogeneities across lexical neighborhoods to identify spoken words from whichever features are available in the sine-wave patterns. In order to assay the relative roles of each of these influences on intelligibility, our tests adopted a method devised by Bradlow and Pisoni (1999). Their technique measured the relative contribution of the phonetic details resolved in an utterance, the signal-dependent properties, and reliance on lexical knowledge, the signal-independent knowledge that moderates the need for fine- or coarse-grain resolution of the signal in service of the identification of words.
Among other attributes of spoken word recognition, Bradlow and Pisoni noted the perceptual gains associated with learning the idiosyncrasies of a specific talker. The method used two sets of words, one nominally easy to recognize, the other nominally hard to recognize. Apart from its segmental properties, an easy word was more common according to an estimate of its frequency of occurrence, had relatively few lexical neighbors, and its neighbors were all less frequent than the word itself. A hard word, in contrast, was less frequent, had relatively more neighbors, and its neighbors were all more frequent than the word itself. Two groups listened to spoken English words, one composed of native English speakers and the other a diverse group of speakers of English as a second language including native speakers of Bengali, Chinese, Dani, Japanese, Korean, Nepali, Russian, and Spanish. Each group of listeners exhibited improvements in isolated word recognition after exposure to speech of the talker who had produced the target words, but second language listeners only improved with easy words. Second language listeners were familiar with all of the words, but were poorer than native English listeners in resolving the fine phonetic distinctions required for improved recognition of the hard words. By contrasting the performance on easy and hard word sets, Bradlow and Pisoni were able to characterize the differing ability of each listener group to exploit signal-dependent and signal-independent aspects of the recognition of spoken words.
Based on their example, the present project included three experiments with sine-wave speech. Overall, the aim was to determine the relative importance of perceptual resources that project from sensory samples to phonetic types, a signal-dependent aspect of recognition, and the cognitive deployment of lexical knowledge, a signal-independent influence on intelligibility. Together, the three experiments calibrated the effect of exposure to sine-wave speech using the empirical model derived from Bradlow & Pisoni. In Experiment 1, sine-wave versions of the easy and hard word sets were created, to determine whether a performance difference occurred when sine-wave words were spoken by a single talker. In Experiment 2, the perceptual effects of exposure to the acoustic carrier, to the idiolect of the talker, and to both aspects of sine-wave speech were estimated, using the performance level difference in easy and hard word identification as an index of the effect of exposure. In Experiment 3, a control test was created to determine whether the difference in easy and hard word recognition performance could be ascribed in this instance to differential uncertainty at the point of contact between an as yet unidentified phonetic string and a known word. Taken together, the results showed that exposure to a sine-wave talker promotes the perceptual resolution of signal-dependent fine phonetic detail preserved in the tone patterns.
This initial procedure aimed to determine a baseline difference in recognition performance between easy and hard words using test items created by modeling sine-wave synthesis on natural samples spoken by a single talker. Although an empirical precedent using noise-band vocoded test items suggested that the phenomenon would well survive the absence of natural acoustic vocal products (Trout, 2005), that test had also used several talkers as sources of spoken words, inherently imposing a cognitive load at the point of recognition in addition to the perceptual challenge caused by the vocoded variant of speech (Mullennix, Pisoni & Martin, 1989; Sommers et al., 1994). Using natural speech, Bradlow and Pisoni (1999) reported performance level differences when listeners heard easy and hard words spoken by a single talker, indicating that the effect would be observed here to the extent that sine-wave speech was treated akin to natural speech. The goal, then, was to determine the range of recognition performance observed under conditions that imposed an extreme low-level challenge in sensory encoding and projection to phonetic types, combined with the use of test materials spoken by a single talker — admittedly, one who produced physically impossible spectral details.
Following the method of Bradlow and Pisoni (1999), two sets of 74 words were produced, an easy set and a hard set. The items in each set differed in three characteristics: mean frequency of occurrence (310 versus 12 instances per million words, according to the norms of Kucera & Francis, 1967), mean neighborhood density (14 versus 27 neighbors, estimated using the technique of Luce & Pisoni, 1998) and the mean frequency of occurrence of the lexical neighbors (38 versus 282 occurrences per million words). With word sets of this composition, an easy word stands out in a sparser neighborhood of less common neighbors, while a hard word is overmatched by a denser neighborhood of more common words. Nonetheless, all of the words in the sets had been judged as highly familiar with an average of 6.25 on a 7-point familiarity scale (Nusbaum, Pisoni & Davis, 1984). Additional descriptive details appear in Bradlow and Pisoni (1999); a list of the words appears in Appendix A.
The items in the word sets were spoken in citation form by a male talker (R.E.R.) seated in a sound-attenuating chamber wearing a head-mounted microphone. The speech was sampled to disk at a rate of 44.1 kHz. The samples were edited to word-size files and equated in energy. Frequency and amplitude of acoustic resonances, bursts, frictions, and murmurs were estimated by tracing acoustic features of the natural samples presented in a spectrographic display, and the selected values were used to compile a table of synthesis parameters for each word. The synthesis parameters represented frequency and amplitude values of four time-varying sinusoids at a temporal grain of 10 ms throughout each utterance. The sinusoidal complexes were converted to digital waveforms calculated at a rate of 44.1 kHz with 16-bit amplitude resolution and were stored in sampled-data format (Rubin, 1980). A pair of comparison spectrograms of a natural word and its sine-wave replica are shown in Figure 1. In the recognition tests, the items were transferred to compact disc, and were presented at a nominal level of 68 dB SPL via Beyerdynamic DT770 headphones to listeners seated in a sound-attenuating chamber.
Twelve volunteers recruited from the undergraduate population of Barnard College and Columbia University performed the word recognition test. Each was a native speaker of English who reported no history of a speech or hearing problem. Listeners were naïve with respect to studies of sine-wave and synthetic speech, and none had already participated in an experiment using such acoustic items.
Listeners were instructed that a series of common English words would be spoken by a speech synthesizer, and were asked to identify the word on each trial. They were encouraged to guess. On each of the 148 trials of the test, a word was selected at random without replacement and presented twice in the clear, with 1 s of silence between the two presentations. A 3 s interval of silence occurred at the end of every trial, during which a participant who recognized the word wrote it in a specially prepared booklet. There were 6 s of silence at the end of every tenth trial to help the listeners track the sequence during the progress of the test.
A participant contributed two scores to the analysis, the proportion of the items identified correctly in the easy word set and in the hard word set. Every subject identified easy words better than hard words, with an average performance of 0.42 of the easy words identified and 0.25 of the hard words identified. A t-test performed on the paired scores revealed that the performance difference on the two tests could not be attributed to chance [t = 9.705, df = 11, p < .001]. A graphic representation of the group data is shown in the left panel of the histogram in Figure 2.
The replication attempted in Experiment 1 showed that the finding of differential levels of recognition of easy and hard words persisted in a novel combination of conditions. All of the speech samples presented in this test derived from the same talker, and natural acoustic vocal products were eliminated in the creation of sine-wave replicas. However, differences between this test and precedents in materials and procedure made it difficult to predict the outcome that was observed here. Presenting natural speech samples in the clear, Bradlow and Pisoni (1999) had observed a similar performance difference when all of the test items were spoken by a single individual. Yet, listeners in their project were not challenged to accommodate an absence of natural acoustic elements, nor to rely on time-varying properties of the carrier, nor to tolerate an anomalous vocal timbre. In the study reported by Trout (2005), test items were stripped of natural acoustic elements by vocoding the speech spectrum in noise bands, but the items on each trial varied in the issuing talker, adding signal-dependent uncertainty across the test block. The observation here of a difference in recognition performance with easy and hard words established the potential of this measure as an assay of the cause of improved performance following exposure to sine-wave speech, as Experiment 2 undertook to determine.
If versatility in speech perception depends on flexibility in the use of acoustic properties for phonetic purposes (Remez, 2005), and is sustained by the stability of the lexicon (Cutler, 2008), fast retuning might be evident after brief exposure to some aspects of sine-wave speech. Indeed, recent studies of perceptual tuning to idiosyncrasies of individual talkers suggests a rapid and supple accommodation of idiolectal variation, at least within the native language (Allen & Miler, 2004; Clarke & Garrett, 2004; Kraljic, Brennan & Samuel, 2008). In Experiment 2, a procedure compared exposure to the acoustic form of the contrasts and the idiolectal characteristics of a specific talker. Three kinds of exposure were provided, each to a different group of listeners, preliminary to the word recognition test: 1) sine-wave sentences based on speech of the same talker whose samples were used as models for the easy and hard words; 2) natural sentences of the talker whose utterances were used as models for the sine-wave words, to provide familiarity with the idiolect of the sine-wave words without also creating familiarity with sine-wave timbre; and, 3) sine-wave sentences based on natural samples of a different talker, to familiarize listeners with the timbre of sine-wave speech without also producing experience of the idiolect of the talker who produced the models for the sine-wave words. The difference in the recognition of easy and hard sine-wave words evident in Experiment 1 served as a null-exposure baseline. When exposure was effective, the recognition performance of easy and hard sine-wave words changed, which served as an index of the relative contribution of sensitivity to signal-dependent phonetic detail in the tone patterns and of cognitive use of the signal-independent characteristics of lexical neighborhood structure.
Although the outcome again was difficult to predict, two thematic predictions were considered. First, exposure that fostered the resolution of phonetic attributes of sine-wave speech would be observed as enhanced recognition of easy and hard words alike. If acuity improved in apprehending features of segmental contrast, this would hypothetically promote discrimination between any target word and its lexical neighbors. Second, if preliminary experience facilitated the use of signal-independent functions alone, perhaps by reducing competition for limited cognitive resources in general, this might be seen as favoring recognition of the cognitively less demanding lexical class, specifically, the easy words, over the more demanding lexical class, the hard words. This interpretation is encouraged by the precedent of Bradlow and Pisoni, who observed such differential benefit in performance by second language listeners.
Two kinds of test material were used in this experiment, sentences used in an exposure interval, and easy and hard sine-wave words used in a spoken word identification test. There were three kinds of exposure item. One, Same Talker Natural, was a set of seventeen natural utterances produced by one of the authors (R. E. R., an adult male), the same talker whose speech served as the model for the easy and hard sine-wave words. They were semantically ordinary items taken from lists of sentences designed for intelligibility tests. Four words in the easy test set and none of the words in the hard test set occurred within these sentences. They were recorded in the manner described for natural speech in Experiment 1, and were stored in sampled data format with 16-bit amplitude resolution at a sampling rate of 44.1 kHz.
A second set of exposure items, Same Talker SW, was a set of seventeen sine-wave sentences. They were produced by interactive tracing of the spectral prominences of natural utterances in the manner described for sine-wave words in Experiment 1, yielding synthesis parameters for each sentence for four tones using a temporal grain of 10 ms. The natural models for these sine-wave sentences were different utterances than those that were used in the Same Talker Natural condition. The waveforms of the sine-wave sentences were calculated at 16-bit amplitude resolution and a sampling rate of 44.1 kHz and stored in sampled data format.
A third set of exposure items, Different Talker SW, were seventeen sine-wave sentences modeled on natural utterances spoken by one of the authors (J. S. P., an adult female). Seven of the words in the easy test set and none of the words in the hard test set occurred within the sentences. The natural speech was recorded in a sound shielded enclosure with a stand-microphone, sampled at 44.1 kHz, and interactive tracing of the spectral prominences produced sine-wave synthesis parameters at a temporal grain of 10 ms. The resulting sine-wave waveforms were calculated at 16-bit amplitude resolution at a sampling rate of 44.1 kHz and were stored in sampled data format.
Both sets of sine-wave sentences were composed to produce high levels of intelligibility. For a brief warm-up sequence to use in each of the two conditions of sine-wave exposure, three additional sentences were created from natural utterances spoken the two talkers.
The identification test of easy and hard sine-wave words used the items prepared for Experiment 1.
In the experimental procedures, the acoustic test materials were transferred to compact disc, and were presented at a nominal level of 68 dB SPL via Beyerdynamic DT770 headphones to listeners seated in a sound-attenuating chamber.
Thirty-six volunteers were recruited from the undergraduate population of Barnard College and Columbia University. Participants were randomly assigned to one of the three exposure conditions composing three groups of twelve listeners. Each was a native speaker of English who reported no history of speech or hearing problem. Listeners were naïve with respect to studies of sine-wave and synthetic speech, none had already participated in an experiment using such acoustic items, nor was any listener familiar with the natural speech of either talker whose utterances were used in this procedure.
A test session included two parts, an exposure interval and an identification test of easy and hard sine-wave words. Participants were instructed at the beginning of the session that there were two parts, the first composed of sentences and the second composed of words. In the two exposure intervals that used sine-wave sentences, listeners were informed that the test items were a type of synthetic speech, and a three-sentence sample was provided of sine-wave speech of the same kind as the exposure items, Same Talker SW or Different Talker SW. An exposure interval consisted of seventeen trials, one for each of the seventeen sentences. On a trial a sentence was presented five times with 1 s of silence between repetitions and 3 s of silence between trials. A listener was instructed to transcribe each sentence in a specially prepared test booklet, and was permitted to write during the presentation. Following the block of sentences and a brief intermission, the identification test of easy and hard words occurred. The method of presentation was identical to the procedure of Experiment 1, in which each word occurred twice on a trial, and each listener wrote the recognized word in a test booklet.
Sentence transcriptions were scored and performance was uniformly good, with natural sentences transcribed nearly error-free and sine-wave sentences identified at high performance level despite a difference between the talkers (Same Talker SW = 93% correct, Different Talker SW = 78% correct, t = 5.057, df = 32, p < .0001). Each of the thirty-four sine-wave sentences was identified correctly by several listeners. This incidental result encourages the conclusion that the exposure interval of each group provided experience of both auditory and linguistic properties of sine-wave and natural speech.
For the critical dependent measures, a participant contributed two scores to the analysis, each a proportion of the items identified correctly in the easy word set and in the hard word set. A two-way mixed-design analysis of variance was performed on the between subjects factor of Exposure (Same Talker SW, Same Talker Natural, Different Talker SW) and the within subjects factor of Word Set (Easy or Hard). This analysis revealed main effects of each treatment and no interaction (Exposure: F (3, 44) = 4.037, p < .01; Word Set: F (1, 44) = 162.7, p < .001; the interaction of Exposure and Word Set: F (3, 44) = 2.114, p > .1). A post hoc means test (Scheffé, α = .05) revealed that performance in the two conditions of Experiment 1 did not differ from performance in two of the exposure conditions tested here: Same Talker Natural and Different Talker SW. To summarize the results, easy words were recognized better than hard words in every condition; exposure to natural sentences of the talker whose utterances were used as models for the sine-wave words did not differ from no exposure, nor did it differ from exposure to sine-wave speech of a different talker. And, recognition improved for easy and for hard words alike after exposure to sine-wave speech produced by the talker whose spoke the natural models for the sine-wave words. A graphic representation of the group data is shown in the right panel of the histogram in Figure 2.
Because of the slightly poorer intelligibility of the Different Talker SW sentences, it seemed useful to consider the possibility of predicting performance on the Easy and Hard word sets from the intelligibility of this preliminary block of sentences. The literature is arguably clear in finding that perceptual projection into the lexicon is a general condition on learning the characteristics of a specific talker (reviewed by Cutler, 2008; however, see Bertelson, Vroomen & de Gelder, 2003). In Experiment 2, listeners transcribed the Different Talker SW sentences at 78% correct, suggesting that poorer intelligibility of these sentences might have been responsible for the absence of an effect of exposure observed on the test of easy and hard words. However, the sentence and word recognition performance of subjects exposed to Different Talker SW sentences exhibited no clear pattern of dispersion, precluding a statistical finding of a relation between sentence intelligibility and word recognition.
In this study, the different aspects of exposure in the recognition of sine-wave words permit two clear conclusions. First, when the same talker produced the utterances used as models for sine-wave sentences and words, there was parallel improvement in easy and hard word recognition. This shows that the effect of exposure enhanced the resolution of fine phonetic detail from which to distinguish words regardless of the characteristics of the lexical neighborhood. By improving phonetic acuity preliminary to the identification of words, subtler distinctions were promoted between words at the point of lexical identification, and this kind of signal-dependent improvement is projectable into both kinds of lexical neighborhood. Alternatively, a hypothetical improvement in the use of lexical knowledge would have favored identification of easy words more than hard words, due to the differential advantages inherent in this signal-independent aspect of spoken word identification.
Second, there was no evidence of a change in spoken word identification of any kind either from exposure to sine-wave timbre or to the idiolect in which the sine-wave words were spoken. This is slightly surprising, considering that subphonemic details of the kind that distinguish an idiolect phonetically are likely contact points available for lexical access, especially for words of lesser frequency of occurrence (Goldinger, 1998; Luce, Goldinger, Auer & Vitevitch, 2000). Similarly, an experience of the auditory quality of intelligible sine-wave sentences might have created a readiness on the listener’s part to encode the details of any talker conveyed this way. Instead, the exposure that produced benefit was limited to the combined experience of the sine-wave carrier and the subphonemic phonetic assortment specific to the sine-wave words in the recognition test.
This discussion of the recognition of easy and hard sine-wave words started with a hypothetical description of the signal-dependent and signal-independent influences on intelligibility, continued with a rationale for construction of the two sets of test words, and proceeded to report differences in recognition performance in this sine-wave variant of the technique. Throughout, the explanation of the performance has appealed to the differing demands for resolving a common word amid rarer and fewer neighbors and for resolving a rarer word among common and more numerous neighbors. There is clearly greater uncertainty surrounding the recognition of a hard word than an easy word in this test, though this conclusion lacks one critical proof with the test items used here. All methods of estimation and synthesis are error prone, and if specific phonetic properties happened to be more or less difficult to produce synthetically, then differential distributions of segmental constituents might have affected intelligibility across the test sets. Experiment 3 was a test to estimate residual effects on recognition attributable to the inherent properties of the synthetic test items themselves by imposing conditions that eliminated the contributions of signal-independent uncertainty (Sommers, Kirk & Pisoni, 1997).
In fact, the segmental characteristics of Easy and Hard words do differ, a consequence of accidental properties of English given the neighborhood requirements for composing these two sets. Specifically, Easy words vary more in initial consonant than Hard words do, and vary far more in final consonant. Moreover, the consonants differ in distribution across the two sets. Approximately half of the Easy words begin, in declining incidence, with: /F L D P G W/; in contrast, half of the Hard words begin with: /M H B W/. Approximately half of the Easy words end in: /W L Θ Z K/; in contrast, half of the Hard words end in: /T N D/. By using a procedure to eliminate signal-independent effects on identification due to uncertainty, this test exposed any residual signal-dependent differences in word recognition caused by errors in estimating spectrotemporal properties when creating the sine-wave synthesis parameters.
Each word in the test set was presented for recognition in a three alternative forced choice task, equating uncertainty of identification for easy and hard words. Two tests were used, one in which the alternatives included the item itself with two single-feature variants in the onset consonant, and a second in which the alternatives included the item itself and two single feature variants in the coda consonant. This format had the effect of assessing the quality of the sine-wave synthesis as well as imposing the same uncertainty for easy and hard words.
The easy and hard sine-wave words prepared for Experiment 1 were used in two tests differing only in the printed alternatives from which listeners identified the target words. Two tests using the format of a 3-alternative forced choice procedure were created, one differing in the Onset consonant of the foils and one differing in the Coda consonant of the foils. Alternatives were constructed to differ in a single phonemic feature from the target word, obliging listeners to resolve a single-feature contrast in order to distinguish a target from its alternatives. Unfortunately, no English words could be used to compose a pair of single feature alternatives for both Onset and Coda conditions for seven of the words in the easy set and one word in the hard set, for which reason these items were not included in the test.
In the experimental procedures, the acoustic test materials were transferred to compact disc, and were presented at a nominal level of 68 dB SPL via Beyerdynamic DT770 headphones to listeners seated in a sound-attenuating chamber, who indicated their responses in specially prepared test booklets.
Twenty-four volunteers were recruited from the undergraduate population of Barnard College and Columbia University. A participant was randomly assigned to one of the two conditions of response alternatives, Onset or Coda, composing two groups of twelve listeners. Each was a native speaker of English who reported no history of a speech or hearing problem. Listeners were naïve with respect to studies of sine-wave and synthetic speech, and none had already participated in an experiment using such acoustic items.
Listeners were instructed that they would hear a series of common English words produced by a speech synthesizer, and were asked to identify the word on each trial by choosing the alternative that best matched it. They were encouraged to guess. On each of the 140 trials of the test, a word was selected at random without replacement and presented twice in the clear, with 1 s of silence between the two presentations. A 3 s interval of silence occurred at the end of every trial, during which a participant indicated the chosen alternative in a test booklet. There were 6 s of silence at the end of every tenth trial to help the listeners track the sequence during the progress of the test.
A participant contributed two scores to the analysis, each a proportion of the items identified correctly in the easy word set and in the hard word set. The scores were submitted to a two-way mixed-design analysis of variance on the between group treatment Alternative (Onset or Coda) and the within group treatment Word Set (Easy or Hard). The analysis showed no performance difference for easy and hard words, at the same time that it revealed that identification performance overall was very good, approximately 88% correct. A graphic representation of the group data is shown in a histogram in Figure 3.
In this procedure, the uncertainty in identification of each word was equated in the number of alternatives, the feature contrast between the target and alternatives, and the degree of phonemic difference between the target and the alternatives. With this constraint imposed, every word presented to listeners for identification had a provisional and small lexical neighborhood of identical structure and, therefore, uniform uncertainty at the point of identification. The performance difference due to natural lexical neighborhood structure should have been neutralized, revealing any residual difference in intelligibility that could not be attributed to signal-independent characteristics if this procedure is adequately sensitive to inherent differences in identifiability among words. Validating the methods of Experiments 1 and 2, the findings of Experiment 3 included very high rates of identification, far greater than the performance observed in Experiments 1 and 2 with open-set identification, and neither the phonotactic manipulation nor the long established properties of the natural lexical neighborhoods affected performance. For now, the evidence favors the conclusion of Experiment 2: Beneficial short-term exposure must include both auditory and phonetic characteristics of the target words.
The present project sought to measure the changes in perceptual function that accompany exposure to sine-wave speech. This kind of signal, like noise band vocoded speech and acoustic chimeras of speech, remains highly intelligible despite an absence of natural acoustic products of vocalization from the acoustic stream. A listener who has accommodated this extreme perturbation on the perception of speech expresses the epitome of versatility, and the three tests reported here aimed to calibrate the components of this perceptual feat by assessing signal-dependent and signal-independent functions. Despite the plausibility of the hypothesis that perceivers would benefit from exposure to samples of idiolect or carrier characteristics alone, an effect of exposure was only observed when these two aspects of sine-wave speech occurred together.
The finding that exposure to sine-wave examples improved attention at a narrow phonetic grain is roughly consistent with some precedents, though there are a few discrepancies to explain. To note a consistent outcome, exposure to sentences synthesized by rule improved the intelligibility of novel isolated words also synthesized by rule (Greenspan, Nusbaum & Pisoni, 1988). However, this project trained listeners over days, and it is not evident that the brief span between exposure and testing in Experiment 2 here implicates the same perceptual function. In another precedent, Eisner & McQueen (2006) observed long-lasting talker-specific effects on allophonic resolution tested with isolated VC syllables induced by exposure to fluent speech. The adaptation endured at least 12 hours, despite exposure to other speech. It will be useful for new studies to reconcile the functions of rapid adaptation (e.g., Clarke & Garrett, 2004) with learning brought about by term training (e.g., Allen & Miller, 2004) as well as ordinary exposure (Remez et al., 1997).
Similarly, a conflicting pair of findings is also pertinent to the relation between exposure to speech and subsequent perception of spoken words. In this case, attention during exposure was drawn to the characteristics of individual talkers rather than the linguistic properties of speech as such (Nygaard & Pisoni, 1998; Nygaard et al., 1994). Specifically, exposure to talkers was provided in a training regimen that taught listeners to identify unfamiliar individuals within a random sequence of ten talkers. A set of utterances originally spoken in the clear, half of which had been produced by the talkers in the training set, were presented in varying noise loads to assess the effect of this training on intelligibility. In this method, there were clear benefits of learning to identify talkers from isolated words when the test of intelligibility was also composed of isolated words, and control measures showed that mere exposure without learning to classify the talkers was insufficient to produce substantial improvements in word recognition. However, in a condition specifically relevant to the present studies, exposure in the form of spoken sentences did not influence identification performance with spoken words at all. From a perspective of perceptual learning (Gibson, 1969) the performance enhancement observed in those studies was contingent on diligent exercise in categorization during the exposure phase, hypothetically driving the listener’s attention to an exact set of distinctive attributes underlying talker and word identification alike. In this regard, it is useful to note the revolutionizing effect of such findings on our understanding of the interchange between linguistic and indexical perception (reviewed by Pardo & Remez, 2006; Pisoni & Levi, 2007). Yet, the match between training and test materials required to calibrate an effect of exposure does not describe the critical conditions reported here.
The method of the present project found that exposure to sine-wave sentences improved sine-wave word recognition despite the mismatch in linguistic unit. An effect of exposure was only observed when acoustic materials matched both in talker and sine-wave carrier. A difference in procedure could be responsible for this difference in outcome from the talker training projects that saw the transfer of exposure depend on a match in the linguistic unit between exposure and subsequent assay. Perhaps the scalable effect observed over such a brief period occurred because the test materials used during the exposure interval in Experiment 2 came from a single talker, in contrast to the mix of ten talkers who contributed utterances in the exposure phase of the prior tests. In addition, the use of sine-wave speech might have drawn attention to a more abstract level of the auditory-to-phonetic projection than is likely to occur in a training task using natural speech of several talkers (McLennan, Luce & Charles-Luce, 2003; Sheffert et al., 2002). Amid the variation in vocal timbre due to glottal characteristics, to a habitual articulatory posture (Nolan, 1983) that arguably affects vocal timbre, and to idiosyncratic acoustic details that affect the phenomenality of speech, initial attention is naturally captured by these qualitative properties. Because these are absent in sine-wave speech, the perceiver is left with no less abstract property of the tones than the aggregate variation in the spectrotemporal pattern, apprehended as a linguistically governed syllable train with onsets and codas (Remez & Rubin, 1990). Potentially, this property has the capability to evoke perceptual versatility in an exposed listener in a way that natural spectrotemporal details of a talker do not, even were these readily committed to memory. The similarities and contrasts of the present findings with precedents can only be reconciled by new studies that pursue these exact premises.
To conclude, the methods applied in our project revealed that:
Neither exposure to the attributes of the carrier nor to the attributes of the talker alone are sufficient to promote sine-wave intelligibility.
Improved performance required exposure to both the anomalous sinusoidal carrier and talker-specific phonetic attributes.
Effective exposure improved performance for easy and hard words alike. This implicates an auditory-to-phonetic process, rather than an increase in reliance on lexical structure in the perception of sine-wave speech.
The authors are grateful for the wise counsel of Ann Bradlow and J. D. Trout as we launched this project, and for the careful attention of the anonymous reviewers and Editor, Mitchell Sommers, as we sought to improve our report. This research was supported by a grant from the National Institute on Deafness and Other Communication Disorders (DC00308).
Publisher's Disclaimer: The following manuscript is the final accepted manuscript. It has not been subjected to the final copyediting, fact-checking, and proofreading required for formal publication. It is not the definitive, publisher-authenticated version. The American Psychological Association and its Council of Editors disclaim any responsibility or liabilities for errors or omissions of this manuscript version, any version derived from this manuscript by NIH, or other third parties. The published version is available at www.apa.org/pubs/journals/xhp