The proposed TSM methods were designed in order to modify, in real-time, a speech signal captured by the microphone located near the speaker’s mouth or the speech signal sent from a device (e.g. a cellphone, TV etc.). Three different TSM methods for the real-time speech stretching are proposed: an uniform real-time TSM (algorithm A) described by authors in an earlier paper [16
], and two non-uniform real-time TSMs. One of the non-uniform TSMs was described in a conference paper (algorithm B) [14
], while the second method provides a novel solution (algorithm C).
All the proposed TSM methods are based on the assumption that the input signal contains redundant information, i.e. silence passages (pauses between words, sentences, speeches) and prolonged vowels. These parts of the signal may be removed or at least they should not be stretched. This approach allows saving extra time in which the stretched speech could be presented.
In addition, as it was postulated by Coyle [10
], Chu [17
] and Demol [18
], the non-uniform TSM was used for methods B and C, in order to obtain a high-quality and natural-sounding stretched speech. Non-uniform TSM is performed using various values of scaling factors for different speech units i.e. vowels, consonants and phone transitions. Scaling factors are chosen in a way that preserves the natural prosody, i.e. vowels are stretched with higher factors than for consonant, while phone transitions remain intact. Depending on the input speech rate, the signal is modified with different scaling factors. The way in which scaling factors are selected is related to the type of TSM method. The procedure of factors adjustment is described in the next sections. The block diagram of the proposed real-time TSM method is shown in Figure .
Block diagram of the proposed real-time TSM method.
All of the algorithms used in the content analysis block were described in details in earlier papers [14
], thus they will not be discussed here. The content analysis consists of: voice activity detection algorithm (VAD), vowel detection algorithm, rate of speech (ROS) estimation, stutter detection and phone transitions detection. As the core of the TSM, a SOLA (Synchronous Overlap and Add) algorithm was used. It was shown that this algorithm ensures high quality of the stretched speech and low computational complexity [19
]. Moreover, SOLA method uses constant values of the analysis time shift and constant length of the analysis time frame. This fact allows for integrating the content analysis algorithms with the TSM procedure in a natural way, i.e. every time a frame of the input signal is analyzed in order to identify its content. Subsequently, based on results provided by the content analysis algorithms, the TSM procedure is performed. The parameter determining the amount of time-scale modification is called a scale factor α
). It is defined by the equation (1):
where Sa is the time shift of the frame used during the analysis step, Ss is the time shift of the frame used during the synthesis step. If the value of α(t) is greater than 1, the input signal will be stretched, if α(t) is lower than 1, the signal will be shortened; for α(t) equal to 1, the time scale modification will not be performed. Since the TSM will be performed only in order to expand the time of the input signal, α(t) will take values equal or higher than 1.
Uniform speech stretching (method A)
In this method, a speech signal is stretched using constant values of the scaling factor. Input signal is time-extended only when the voice is detected by the VAD and vowel prolongation was not observed by the vowels detector. Despite the fact that the input signal is non-uniformly time scaled (silence and speech passages are modified using different stretching factor values), the speech signal is modified uniformly (with a single value of the stretching factor). The stretching procedure is controlled by the αd parameter (representing desired scaling factor). The value of αd should be specified (in the experiment it was set to a constant value equal to 1.5). Additionally, elimination of redundancy in the input signal is performed by replacing intervals of silence longer than 200 ms with the time-expanded speech.
Non-uniform TSM controlled by a scaling factor (method B)
The second method of time-expansion of speech signals is performed using the same principles as in the method A, but additionally, the scaling factor values may vary depending on the input signal content and the ROS. Values of α
) used in this method are presented in Table . The symbol αd
stands for the value of the scaling factor specified by the user. The rate of speech is estimated based on the analysis of vowels positions. Speech with the rate higher than or equal to 5.16 vowels/s is marked as fast. Selection of this threshold was based on the manually labeled utterance rates (slow, fast), where the average value and standard deviation of ROS obtained from all the recordings in the database, were calculated [14
]. Whenever the fast spoken speech is detected, higher values of α
) are used, and for speech with a normal rate, these values are reduced. Two additional restrictions were added to ensure that vowels will be stretched using values of α
) not lower than for consonants: for slow speech, if the calculated value of α
) is lower than 1, it is set to 1, and for fast speech, if the calculated value of α
) is lower than 1.1, it is set to 1.1. The important is also fact that only not for all silence passages α
) is defined because some of them are removed to ensure the synchronization between the input and output signal.
Values of the scaling factor used in method B
Non-uniform TSM controlled by estimated ROS (method C)
Two methods presented above use the scaling factor as the control value of the output speech rate. This is not a natural way of specifying the speech rate, since for the same values of the scaling factor, the stretched speech will have different rates depending on the rate of the input speech. Therefore, authors of this paper have proposed the method in which, as the control value of time-expansion, a desired ROSd value is used. The value of the ROSd is specified by the user. As a result of speech modification, stretched speech has the rate close to the ROSd value. The signal processing procedure applied to this method is the same as in the algorithm B, but the current value of scaling factor is calculated for every signal frame separately, according to equations (2)(3)(4):
where αcons(t) is the value of scaling factor for the current frame (provided a consonant was detected), αvowel(t) is the value of scaling factor for the current frame (provided a vowel was detected) , Δt is the time interval used for the ROS estimation (in the experiment, it was set to 1.5 s), Δtvowel is the duration of the vowel in the estimation interval, η is the ratio between the scaling factor used for the vowels and the scaling factor used for consonants (in the experiment, it was equal to 1.7).
Examples of speech stretching obtained using the proposed methods are shown in Figure . In these examples, αd
was set to 1.5 (for method A and B) and ROSd
was equal to 3 vowels/s. These values of the scaling factor were also used during speech intelligibility tests described in Section 3. The choice of αd
value was based on the results presented by Nejime et al.[7
]. He had shown that for αd
equal to 1.5, the highest improvement in speech comprehension could be achieved. The chosen value of ROSd
ensures the same ROS expansion as for the methods A and B.
Example of speech stretched using methods A-C.
An analysis of Figure shows that the lowest difference in the duration of the stretched and the original speech is obtained using the method B. The method A produces the highest differences in the utterance duration. If not much redundancy is found in the input signal (using detectors), the signal can be time-expanded for a relatively long time and differences between the input and the output signal can drift towards infinity. To prevent such a situation, the TSM procedure is turned off after the difference between the input signal and output signal is higher than Δtoff, and the unmodified speech is send to the output. This threshold is exceeded much often for the method A than for methods B and C and its value can be defined by the user. During the experiments, Δtoff was set to 3 seconds.