|Home | About | Journals | Submit | Contact Us | Français|
This study examined the redundancy of spectral and temporal information in everyday sentences, which were reduced to 16 rectangular spectral bands having center frequencies ranging from 250 to 8000 Hz, spaced at 1/3 octave intervals. High-order filtering eliminated contributions from transition bands, and the widths of the resulting effectively rectangular speech bands were varied from 4% down to 0.5%. Intelligibility of these sub-critical bandwidth stimuli ranged from nearly perfect in the 4% bandwidth conditions, down to nearly zero in the 0.5% bandwidth conditions. However, a large intelligibility increase was obtained under the narrower filtering conditions when the speech bands were used to vocode broader noise bands that approximated critical bandwidths (ERBn) at the 16 center frequencies. For example, the 0.5%-and 1%-bandwidth speech stimuli were only about 1% and 20% intelligible, respectively, whereas scores of about 26% and 60%, respectively, were obtained for the ERBn-wide noise bands modulated by the speech bands. These large intelligibility increases occurred despite elimination of spectral fine structure and the addition of stochastic fluctuations to the speech-envelope cues. Results from additional experiments indicate that optimal temporal processing requires that envelope cues stimulate a majority of the fibers comprising an ERBn.
Previous studies have shown that speech is highly redundant in the spectral domain, remaining intelligible despite substantial reductions in bandwidth (e.g., Warren et al., 1995) and/or the removal of spectral fine structure (e.g., Shannon, 1995). The present study examines the lower limits of this redundancy by using high-order FIR filtering to reduce “everyday” sentences to 16 rectangular bands having sub-critical bandwidths ranging from 4% down to 0.5% of center frequency. The generation of such narrow bands of speech requires extremely steep, effectively vertical filter slopes: Warren, Bashford, and Lenz (2004) found that transition bands having a 1200 dB/octave roll-off contributed over 30% of the intelligibility of a nominal 1/3-octave passband, and they found that intelligibility of the speech band did not reach its minimum until roll-off was increased to 4800 dB/octave. In the present study that roll-off was maintained across the spectrum through frequency-dependent adjustments in the order of FIR filtering.
In addition to examining the limiting bandwidth for the processing of natural speech, containing both fine structure and envelope cues, the present study focused on the spectral limits of temporal-envelope processing using noise-vocoded stimuli. Of particular interest was determining whether the level of resolution involved in temporal processing corresponds to the widely used estimate of the auditory filter bandwidth, the ERBn (Glasberg and Moore, 1990). This question was addressed in the present study by extracting speech envelopes from extremely narrow, sub-critical bands (2% bandwidth, with a minimum roll-off of 4800 dB/octave) and using those envelopes to modulate rectangular noise bands that varied in bandwidth from 5% to 300% of an ERBn.
The 120 listeners in this study (6 groups of 20) were undergraduate students at the University of Wisconsin-Milwaukee who were paid for their participation. All listeners were native monolingual English speakers who reported having no hearing problems and had normal bilateral hearing, as measured by pure tone thresholds of 20 dB HL or better at octave frequencies from 250 to 8,000 Hz.
The speech stimuli were the 100 (10 lists of 10) CID "everyday" sentences that contain 500 keywords used for scoring (Silverman and Hirsh, 1955). They were derived from a broadband digital recording (44.1 kHz sampling, 16-bit quantization) used previously by Warren et al. (1995) in a preliminary study of spectral redundancy. The recording was produced by a male speaker having no evident regional accent and an average voicing frequency of 100 Hz.
Fig. 1 shows the long-term averaged spectra of the original broadband sentences (in green), along with the spectrum of sixteen rectangular bands (shown in yellow) extracted from the broadband stimulus through FIR filtering. The bands shown have a fixed bandwidth of 2%, and center frequencies ranging from 250 Hz to 8000 Hz, spaced at 1/3-octave intervals. Fig. 2 shows the long-term spectrum of the 2% speech bands as they were presented in experiment 1, combined with their spectrum levels equated.
In FIR filtering, a given number of taps will produce a roll-off that is fixed in dB/Hz. Hence, in order to produce a roll-off fixed in dB/octave it is necessary to employ a proportionally greater number of taps as the cutoff frequency decreases. In order to produce the 4800 dB/octave roll-off required for this study, the number of filter taps ranged from a low of 1404, required for the higher frequency skirt of the 8000 Hz CF band, to a high of 382,000 taps required for the lower frequency skirt of the 250 Hz CF band. Filtering of the sentences was accomplished using the fir1 function in MATLAB 220.127.116.111 running on a Mac Pro 2 × 2.93 Ghz quad-core Intel Xeon processor.
Because the present study employs exceedingly high orders of filtering to produce sub-critical bandwidths, antiphasic cancellation was used to examine the stimuli for possible artifacts or signal distortions. Using the same filtering procedure, Sixteen contiguous 1/3-octave speech bands were extracted from the broadband speech signal, and were then phase inverted and added back to the broadband signal. Figure 3 shows the spectrum of the original signal (in green) along with a tracing of spectral components that remained in the range of the added antiphasic passbands after cancellation (yellow tracing). It can be seen that cancellation within the passbands was approximately 60 dB, indicating negligible distortion of the speech signal. Moreover, the ringing components at the cutoff frequencies (Yellow spikes), produced by FIR filtering, were more than 30 dB down for each of the speech bands.
One group of listeners in this study was presented with 16-band speech stimuli varying in bandwidth from 4% to 0.5%. The remaining groups of listeners were presented with 16-band stimuli consisting of rectangular noise bands that were modulated by the temporal envelopes of corresponding speech bands extracted by full-wave rectification. In experiment 1, the vocoded noise bands were filtered either to match the bandwidth of their parent speech bands, or to exceed that bandwidth, with expansion to either 1/3-octave or 1 ERBn. Figure 4 shows the long-term spectrum of an ERBn band stimulus. Two additional experiments were conducted with vocoded stimuli. In Experiment 2 the vocoded noise bands ranged from 5% to 80% of an ERBn, and in Experiment 3 they ranged from 100% to 300% of an ERBn.
Testing was performed in a sound-attenuating chamber, with the stimuli delivered diotically through Sennheiser HD 250 Linear II Headphones at a slow-rms peak level of 60 dBA SPL. Listeners were instructed to call out what the voice was saying as best they could, and were encouraged to guess when unsure.
The four groups of listeners who participated in Experiment 1 received 5 blocks of twenty sentences. The first two blocks of sentences presented were derived from 4% speech bands and the third, fourth and fifth blocks of sentences were derived from 2%, 1%, and 0.5% bands, respectively. Stimuli in this experiment were presented in order of decreasing speech bandwidth to allow listeners to adapt to the effects of filtering.
One group of listeners received the 16-band speech stimuli, while the remaining three groups received 16-band vocoded stimuli whose component noise bands were modulated by the temporal envelopes extracted from the speech bands having corresponding center frequencies. For one of the latter groups, the vocoded noise bands matched the bandwidths of the parent speech bands. For the remaining two groups the noise bands were expanded beyond the spectral limits of the speech bands, either to the width of an ERBn or to 1/3-octave.
Figure 5 presents the mean percent intelligibility scores obtained for the four groups of listeners, plotted as a function of speech bandwidth. It can be seen that performance for listeners receiving the filtered speech bands was near ceiling for the second block of sentences comprised of 4% bands. Intelligibility remained high (about 80%) for the 2% bands, and fell to near zero when the widths of the speech bands were decreased to 0.5% of center frequency. For the remaining groups of listeners, who received noise-vocoded stimuli, performance varied dramatically as a function of noise bandwidth. For listeners receiving the noise bands that matched the sub-critical bandwidths of the parent speech bands, intelligibility dropped substantially relative to that obtained with the speech, and most dramatically in the 2% bandwidth condition, with an intelligibility decrease of nearly 60%. However, this loss of intelligibility was not observed for vocoded stimuli comprised of either ERBn or 1/3-octave noise bands. Indeed, intelligibility was generally higher for those vocoded stimuli than for the speech, especially at the most adverse speech bandwidths of 1% and 0.5%. This finding suggests that temporal processing requires that speech envelope cues stimulate a majority of the fibers comprising a critical band. Experiments 2 and 3 examined this hypothesis in greater detail.
In experiment 1, the largest effect of widening the speech-modulated noise bands beyond the spectral limits of their parent speech bands was obtained in the 2% speech bandwidth condition: When the widths of the speech-modulated noise bands were increased from 2% to the width of an ERBn, intelligibility nearly quadrupled (23% vs. 86%). Accordingly, to obtain a more detailed look at the effect of bandwidth in experiment 2, the temporal envelopes extracted from the sixteen 2% speech bands were used to modulate rectangular noise bands that varied from 5% to 80% of an ERBn. The five bandwidths employed -- 5%, 10%, 20%, 40%, and 80% of ERBn width -- were applied to separate 20-sentence sets of 16-band noise-vocoded stimuli. Since accurate estimation of the relative intelligibility of the different bandwidths was crucial in this experiment, the order of presentation for bandwidths was not fixed, but rather was pseudorandomized for each listener, with the restriction that each bandwidth appeared 4 times in each serial position across listeners.
Figure 6 presents the mean percent intelligibility scores obtained for the five bandwidths presented in experiment 2. It can be seen that intelligibility increased in a roughly linear fashion from about 5% at the minimum bandwidth to about 87% in the 80% ERBn condition. This maximum value was within 1% of the intelligibility score obtained in experiment 1 for the 100% ERBn noise bands modulated by the 2% speech band envelopes. This finding provides further support for the hypothesis that optimal temporal processing requires that speech envelope cues stimulate a majority of fibers comprising a critical band. Experiment 3 examined the effects of spreading speech envelope cues outside the spectral boundaries of their respective ERBs.
The design of experiment 3 was identical to that of experiment 2, with one exception: The widths of the noise bands modulated by the 2% speech band envelopes ranged from 100% to 300% of an ERBn. The results are shown in Figure 7.
Intelligibility in the 100% ERBn condition was 84% in this experiment, which is quite close to the value of 86% obtained for the same condition in experiment 1, which employed a very different experimental design. The mean difference was not statistically significant [F(1,38) = 0.55, p > .45]. It can be seen that intelligibility steadily declined as the speech-envelope cues were spread beyond the boundaries of their respective critical bands, as defined by the ERBn scale.
The results of these experiments indicate that the ERBn scale closely approximates the level of spectral resolution employed in the processing of speech envelope information. The results also indicate that optimal temporal processing requires that envelope cues stimulate a majority of the auditory fibers comprising a critical band. Hence, envelope processing is obligatorily coarse-grained. This limitation may help explain some earlier observations in the literature involving spectrally sparse analogs of natural speech. For example, Souza and Rosen (2009) have reported higher intelligibility for noise-vocoded than sine-vocoded speech when extracted speech envelopes were subjected to 30 Hz rather than 300 Hz lowpass filtering. They suggested that the greater “density” of the noise spectrum under these filtering conditions might facilitate auditory object formation. Alternatively, our results suggest that the greater density of the noise spectrum better ensures optimal processing of envelope cues within critical bands, independent of object formation. The present findings also can explain how the intelligibility of sine-wave speech can be improved by amplitude modulation of the sine-waves that track speech formants (Carrell and Opie, 1992; Lewis and Carrell, 2007). The results of experiments 1 and 2 of this study suggest that the broadening of sine-wave formant trackers through amplitude modulation may improve intelligibility by more fully engaging coarse-grained processing of temporal envelope cues, which can then operate in conjunction with the processing of spectral fine structure.
This work was supported in part by NIH through Grant DC 00208.