|Home | About | Journals | Submit | Contact Us | Français|
This paper presents a study on intonation patterns in Cantonese aphasia speech. The speech materials were spontaneous discourse recorded from seven pairs of aphasic and unimpaired speakers. Hidden Markov model based forced alignment was applied to obtain syllable-level time alignments. The pitch level of each syllable was determined and normalized according to the given tone identity of the syllable. Linear regression of the normalized pitch levels was performed to describe the intonation patterns of sentences. It was found that aphasic speech has a higher percentage of sentences with increasing pitch. This trend was found to be more prominent in story-telling than descriptive discourses.
Aphasia is a collection of language disorders caused by damage to specific regions of a human brain. It most commonly affects language and speech production of the patient. Language impairments associated with aphasia can be present in phonology, morphology, semantics, syntax, and pragmatics , . In terms of speech production, aphasia is often manifested by phonemic errors, articulation distortions, and speech disfluencies , . It has been suggested that speech features and paralinguistic symptoms in addition to linguistic characteristics are useful in differentiating disordered speech from unimpaired speech and typifying syndromes of aphasia , , , . Particularly, acoustical analysis is believed to have good potential in providing objective evidences to support medical diagnosis. In the past, acoustical analysis of aphasic speech was often limited to a small number of short utterances that were carefully elicited by oral reading. The acoustical measurements were made manually from speech waveforms and spectrograms. There is a strong demand for effective deployment of state-of-the-art spoken language technologies to support large-scale study.
Cantonese is a major Chinese dialect spoken by tens of millions of people in Southern China, Hong Kong and Macau areas. Prosody of Cantonese speech has attracted tremendous interests from speech researchers, mainly because Cantonese has a complicated tone system. However, efforts on disordered Cantonese speech are rarely seen. Recently a large database of aphasia Cantonese speech, namely Cantonese AphasiaBank, has been developed as part of an inter-disciplinary project on multi-modal and multi-level analysis of Chinese aphasic discourse . In , a hidden Markov model (HMM) based continuous speech recognition system of Cantonese was used for automatic time alignment of parallel aphasia speech and unimpaired speech. Statistical analysis of supra-segmental duration measurements showed that aphasia utterances contain more frequent insertion of pauses and significantly shorter speech chunks than unimpaired speech. A follow-up investigation revealed that the duration difference between content words and function words in aphasia speech is consistently smaller than that in unimpaired speech.
The present study is focused on another important dimension of prosody, namely intonation. We investigate the sentence-level intonation patterns and make comparison between aphasia and unimpaired speech. The intonation pattern is represented by linear regression over normalized pitch levels of all syllables in the sentence. It is found that aphasia utterances tend to have more sentences with pitch increasing.
Like Mandarin or Putonghua, Cantonese is a monosyllabic and tonal language. Each Chinese character is pronounced as a syllable that can be divided into an Initial part and a Final part. Each Cantonese syllable carries a specific lexical tone. When the tone changes, the syllable may correspond to a different character with different meaning. There are 20 Initials, 53 Finals and 9 tones in Cantonese . Figure 1 depicts the pitch patterns of the Cantonese lexical tones. A syllable carrying entering tone must end with an occlusive code /p/, /t/ or /k/, which makes the syllable duration substantially shorter than a syllable with non-entering tone. In terms of pitch level, each of the entering tones coincides with one of the non-entering tones. Therefore in most transcription systems, e.g., the LSHK's JyutPing system , only six distinct tone labels are used. They are numbered from 1 to 6 as shown in Figure 1.
Cantonese AphasiaBank is a multi-modal and multi-level corpus developed jointly by University of Central Florida and University of Hong Kong . The corpus contains speech recordings from 120 unimpaired and 96 aphasic individuals, all being native speakers of Cantonese. The speakers are divided into three age groups (18 - 39; 40 - 59; 60 and above) and two education levels. The recordings were elicited as spontaneous oral narratives on 8 different tasks. The tasks include single picture description, sequential picture description, procedure discourse, story telling and personal monologue, with details given as in Table 1.
All recordings were manually transcribed using the Child Language ANalyses computer program (CLAN; ). CLAN has been widely used in studies of conversational interaction, language learning, and language disorders. The CLAN editor supports functions such as annotation of morpho-syntactic features on linguistic transcripts, adding codes to files, linking transcripts to media, and automatic computation of different indices (e.g., MLU or TTR). The orthographic transcription of each recording was prepared in the form of a sequence of Chinese characters. Non-speech events, such as fillers or unintelligible jargons and speech segments, are also tagged with specific codes.
Seven pairs of age- and gender-matched aphasia and unimpaired speakers were selected from the Cantonese Aphasi-aBank. They include five pairs of male and two pairs of female. Among the seven aphasia speakers, five are anomic aphasia and the other two are transcortical sensory. They are considered to be fluent speakers from speech pathology perspective. There were 112 recordings from the 14 speakers. The average lengths of recordings are 48.6 second and 48.2 second for aphasia and unimpaired speakers respectively.
Automatic time alignment at sub-syllable level was performed on the speech materials using HMM forced alignment technique. The acoustic feature vector was composed of 13 MFCC parameters, and their first and second derivatives, which were computed with a short-time frame size of 25 msec and frame shift of 10 msec. Gender-dependent acoustic models were trained with about 20, 000 utterances from the CUSENT database . Each Initial unit was modeled with a 3-state HMM and each Final was modeled with a 5-state HMM. Each HMM state was represented by a Gaussian mixture model with 16 mixture components.
For each speech recording, the Chinese characters in the corresponding orthographic transcription were first converted to Cantonese syllables. Each syllable was then further decomposed into an Initial and a Final. The process of forced alignment of speech is similar to that of automatic speech recognition. The given sequence of Initial and Final units are used to define a highly restricted search space that contains only one legitimate HMM sequence. The goal of search is to determine the model-level time alignment that best matches the input utterance . The time boundaries of syllables and words could be derived straightforwardly from the time alignments of Initials and Finals. An example of forced alignment result is shown as in Figure 2.
Sentence boundaries were marked manually by inspecting both the orthographic transcriptions and audio recordings. This is an important step because intonation patterns would be established for individual sentences. In principle, a sentence is a sequence of words which conveys a complete thought. However, in this study, some sentences do not complete in meaning. The speaker might start another sentence and switch to a new topic. The speaker might also quote what was supposed to be spoken by the character in the story. In these cases, it was considered that a new sentence was started.
Conjunctions like ‘because’, ‘then’, ‘therefore’, ‘finally’ hint that a new sentence starts. Normally a sentence contains one verb only, except the case when the verb subcategorizes for a clause, e.g., ‘discover’, ‘decide’, ‘believe’, ‘estimate’, etc. In some of the utterances, especially those by aphasic speakers, long pauses or breaks are included. We do not consider these pauses or breaks as sentence boundaries if the speech fragments before and after are expressing the same idea and involve only one verb.
As a result, a total of 1,072 sentences were marked in the 112 recordings. Among them, 419 sentences were from aphasia speakers. The average sentence lengths are 10 syllables and 13 syllables for aphasia and unimpaired speech, respectively.
Pitch values were estimated every 10 msec from the speech signals. We used three existing pitch estimation algorithms: PEFAC , , RAPT ,  and YIN . These algorithms are based on different principles and therefore considered to be complementary to each other. Given the three estimated pitch values, the average of the two closest values was taken as the pitch of the respective frame. Knowing the time alignment of a syllable, we used the median of the pitch values from all voiced frames to represent the pitch level of the syllable.
The pitch level of a syllable is determined by the underlying lexical tone. For example, a syllable carrying Tone 1 is expected to have a higher pitch than one with Tone 3. In continuous speech, the pitch of a syllable depends also on the sentential context. In order to separate the long-term trend of pitch change from the local tone variation, a method of pitch normalization was proposed by Li  in an effort on Cantonese prosody modeling. The basic idea is to convert the raw pitch value of a syllable into a value that is independent of the syllable's tone identity. This can be done by multiplying a conversion ratio. For example, the pitch level of a syllable carrying Tone 1 can be converted into an equivalent pitch of Tone 3 by multiplying a ratio of 0.8.
where f is the original pitch of a syllable, c is the converted pitch, k = 1, 2, …, 6 is the tone identify of the syllable, and Rk is respective conversion ratio. The ratios were obtained by Li  through statistical analysis.
After the conversion, we could consider as if all syllables were carrying the same tone. In other words, the normalized pitch values are independent of the tone identities. The change of these pitch values would reflect the global trend over the sentence.
The intonation pattern of each sentence was obtained by performing linear regression on the normalized pitch values of all syllables in the sentence. The objective is to find the best fit by minimizing the sum of squared errors. Each intonation pattern is described by two parameters: the slope m and the offset b. A positive value of m indicates that the sentence has a tendency of increasing pitch, and a negative value hints a tendency of decreasing pitch.
An example is shown as in Figure 3. The original and the normalized pitch values are marked in red cross and green square respectively. There are two sentences shown in the Figure, one with a declining pitch and the other with an increasing trend.
For all recordings of each subject, the number of sentences with positive slopes is counted. We compare the percentages of positive-sloped sentences between aphasia and unimpaired speakers on different tasks and sentence lengths.
Table 3 shows the percentages of sentences with positive slopes of intonation in aphasia and unimpaired speech utterances. Out of the 419 manually marked sentences from aphasia speakers, 108 (25.78%) show increasing pitch. In unimpaired speech, there are only 69 out of 653 sentences (10.57%) having positive slopes. The significant difference could be partly due to the fact that sentences in aphasic speech are relatively short and short sentences do not show declining pitch in general. Nevertheless, if we focus on only short sentences (8 syllables or shorter), the percentage of sentences with positive slopes in aphasia speech is much higher than that in unimpaired speech.
Table 4 shows the percentages of positive-slope intonations for story-telling and descriptive discourses separately. It can be seen that sentences in story-telling discourses more frequently have rising-pitch intonation.
By analyzing the speech materials from a limited number of speakers, it was observed that aphasia speech contains more sentences with rising-pitch intonation than unimpaired speech. The difference is quite significant. The method of linear regression on tone-normalized pitch values is effective to show the contrast, despite that the derived intonation patterns may be over-simplified and inaccurate. We plan to extend the study by including more speakers and carrying out statistical significance test. Our ultimate goal is to apply intonation analysis to objective assessment of aphasia speech.
This study is supported in part by a grant funded by the National Institutes of Health to Anthony Pak-Hin Kong (PI) and Sam-Po Law (Co-I) (project number: NIH-R01-DC010398).