Among the tasks for which machines may simulate human behavior, automatic speech recognition (ASR) has been foremost since the advent of computers. A device to understand speech, however, needs a calculating machine capable of making complex decisions, and, practically, one that could function as rapidly as humans. As a result, ASR has grown rapidly in proportion to other areas of pattern recognition (PR) based in a large part on the power of computers to capture a relevant signal and transform it into pertinent information, i.e., recognizing patterns in the speech signal [
1].
There has been a growing interest in objective assessment of acoustic variables in dysphonic patients in recent years. Voice pathology detection and classification is a topic which has interested the international voice community [
2]. Most of the work in this field is concentrated on automatically diagnosing the pathology using digital signal processing methods [
3-
6]. For example, in the study of Dibazar et al, [
3] five different vocal pathologies were detected using MFCC and fundamental frequencies. In their study, the highest recognition sensitivity was achieved with vocal fold paralysis while the lowest sensitivity was for hyperfunctional voice disorders.
In another study by Dubuisson et al [
4], discrimination of normal and pathological voices was analyzed using correlation between different types of acoustic descriptors. Such descriptors were of two types; temporal and cepstral. Temporal descriptors included energy, mean, standard deviation, and zero crossing, while spectral descriptors included delta, mean, several moments, spectral decrease, roll-off, etc. It has been found that using spectral decrease and first spectral tri-stimulus in the Bark scale, and their correlation leads to correct classification rate between normal and pathological voices of 94.7% for pathological voices and 89.5% for normal ones with sustained vowels. These rates mean that 94.7% of the pathological voices were classified as pathological voices and 89.5% of the normal voices were classified as normal voices. The reason behind the higher rates for pathological voices is that the authors use features inspired from voice pathology assessment and the number of normal voice samples is much lower than that of pathological samples. The performance of linear predictive coding (LPC)-based spectral analysis to discriminate pathological voices of speakers affected by vocal fold edema was evaluated in the study of Costa et al [
5]. Their results show that LPC-based cepstral method is a good way to represent changes in vocal tract by vocal fold edema. In another study, estimation of glottal noise from voice signals using short-term cepstral was used to discriminate pathological voices from normal voices [
6]. It was found that glottal noise estimation correlated less with jitter and shimmer for pathologic voices and not significantly for normal voices. Miyamoto et al [
7] investigated pose-robust audio-visual speech recognition of a person with articulation disorders resulting from cerebral palsy. They used multiple acoustic frames (MAF) as an acoustic feature and active appearance model (AAM) as a visual feature in their system. Their proposed audio-visual method resulted in an improvement of 7.3% in the word recognition rate at 5 dB signal-to-noise ratio compared to the audio-only method.
All of the above-mentioned studies used only sustained vowel /a/ as an input. Comparative evaluation between sustained vowel and continuous speech for acoustically discriminating pathological voices was studied by Parsa et al [
8]. It was found in their experiment that classification of voice pathology was easier for sustained vowel than for continuous speech. On the other hand, automated intelligibility assessment was performed with context dependent phonological features using 50 consonant-vowel-consonant (CVC) words from six different types of voice disordered speakers in the study of Middag et al [
9]. Their evaluation revealed that the root mean squared error of the discrepancies between perceived and computed intelligibilities can be as low as 8 on a scale of 0 to 100. Automatic recognition of Polish words was carried out in the study Wielgat et al [
10], where the input was speech from voice disordered Polish children. They used MFCC and human factor cepstral coefficients (HFCC) to recognize words with confusing phonemes. In their experiment, HFCC performed better than MFCC. In a recent work, automatic recognition system evaluated speech disorders in head and neck cancer, where the speakers were German natives [
11]. Intelligibility was quantified by speech recognition on recordings of a standard text read by laryngectomized patients with cancer of the larynx or hypopharynx and patients who had suffered from oral cancer. Both patient groups showed significantly lower word recognition rates than an age-matched control group.
In the current study, a conventional ASR system was used for evaluation of six different types of voice disordered patients speaking Arabic digits. MFCC and GMM (Gaussian mixture model)/HMM (hidden Markov model) were used as features and classifier, respectively. The recognition results were analyzed for types of diseases. Effects on performance before and after clinical management in a subset of the disordered voices were also investigated. Finally, the first four formants (F1, F2, F3, and F4) of vowel /a/ present in the digits were extracted to make a comparison of distortion in terms of formants for different voice disorders. We believe that this is the first such work that tries to examine the accuracy of ASR in Arabic speech of people with pathological voices. Also the comparison of ASR performance between pre and post management (surgical or medical) may provide additional interest to other language communities now investigating ASR as a mean of examining outcomes of treatments.