|Home | About | Journals | Submit | Contact Us | Français|
In this pilot study we sought to determine the reliability and validity of collecting speech and voice acoustical data via telephone transmission for possible future use in large clinical trials. Simultaneous recordings of each participant's speech and voice were made at the point of participation, the local recording (LR), and over a telephone line using a dedicated in-line computerized interactive voice recording system, the remote recording (RR). All voice recordings were made from our laboratory telephone located in Groton, Connecticut to the RR system located in Madison, Wisconsin. All data points were compared on a measure-by-measure basis between the LR and RR recordings. The results suggest that both measures of frequency excursion and of speech motor timing are reliably captured over the telephone. Results are discussed in terms of specific acoustic measures that may be useful and accurately measured via telephone transmission, for examining disease severity and pharmacological intervention for use in a large-scale clinical trial.
A variety of voice acoustical measures have been shown to be potentially sensitive markers of disease severity and therapeutic treatment response in both motor and affective central nervous system (CNS) disorders. These measures, related to either the motor timing aspects of speech or prosodic control of speech, have been employed in the study of diseases ranging from Parkinson's disease (Goberman, Coelho, & Robb, 2002; Goberman & Coelho, 2002a; Goberman & Coelho, 2002b) to Major Depressive Disorder (Ellgring, & Scherer, 1996; Nilsonne, 1987; Nilsonne, Sundberg, Ternstrom, & Askenfelt, 1988; Flint, Black, Campbell-Taylor, Gailey, & Levinton, 1992; Stassen, Kuny, & Hell, 1998; Teasdale, Fogarty, & Williams, 1980). Although the recording of human voice samples, within the context of conducting multi-centre clinical drug trials may at times be desirable, it is a logistically difficult task. Creating ideal recording conditions (e.g., the use of a soundproof booth, on-location recording equipment, and well-trained staff) is typically neither cost nor time permissible for a study including large numbers of participants across multiple study sites. In an effort to move the field of communication sciences and disorders towards evidence-based practice, and participation in randomized clinical trials, it is first necessary to determine suitable avenues of large-scale data collection. Therefore, this study was conducted to determine the methodological challenges, sensitivity of measurement, and the overall appropriateness of using the telephone to collect potentially useful and disease/disorder-specific voice acoustical data.
The use of the telephone is not a new avenue for the collection of objective speech data. Telephone data collection has a diverse background in documenting lexical stress (van Kuijk, & Boves, 1999) and interview success and perceptual ratings of speech (Sharf, & Lehman, 1984), as well as speech timing in affective disorders such as schizophrenia and mania (Friedman, & Sanders, 1992). However, none of these authors sought to investigate and report on the integrity of the voice signal, and the possibility of degradation and/or alteration of the signal data as a result of telephone transmission.
Due to the wide range of measures that may be useful for the acoustic analysis of disordered speech (see for example Kent, Weismer, Kent, Vorperian, & Duffy, 1999), we have limited our focus to two types of variables (i.e., time dependant and frequency dependant) that have previously been shown to be clinically meaningful for understanding several dopamine related CNS disorders in which we hold an interest. In this initial study, we focused our efforts on two time dependant measures, speaking rate and voice onset time, and two frequency related measures, pitch excursion as measured by fundamental frequency (F0) variation and vocal range (Hall, & Yairi, 1997; Kent et al., 1999; Nishio, & Niimi, 2001; Turner, Tjaden, & Weismer, 1995; Weismer, Laures, Jeng, & Kent, 2000).
Speaking rate is a measure of the overall integrity of the speech motor control system (Hall, & Yairi, 1997). This general measure of speech production ability holds promise as a clinical outcome measure for the study of CNS disorders associated with motoric output deficits. It has been documented as a sensitive and differentiating measure in a wide variety of CNS diseases accompanied by dysarthria of speech. For example, persons with amyotrophic lateral sclerosis (ALS) have been shown to produce slower speaking rates whether they are speaking under habitual or fast speaking conditions (Weismer et al., 2000). Additionally, (Kleinow, Smith, and Ramig, 2001) have demonstrated increased variability of speech rate in persons with idiopathic Parkinson's disease. In a broad study of speaking rate in Japanese persons with dysarthria, Nishio and Niimi (2001) found overall speaking rate to be significantly reduced in all dysarthria types studied including flaccid, spastic, ataxic, hypokinetic, mixed and unilateral upper motor neuron type. These authors have concluded that the simple measure of speaking rate is a sensitive measure of disordered speech motor performance in all of the most common clinically recognized types of dysarthria.
Changes in speaking rates have also been associated with changes in emotional tone in individuals with affective disorders such as depression and negative symptom schizophrenia. In general, speaking rates in these patients have been shown to be slower overall and well-linked to depressive symptomatology (Flint et al., 1992, Teasdale et al., 1980) and negative symptom complex schizophrenia (Alpert, Rosenberg, Pouget, & Shaw, 2000; Alpert, Kotsaftis, & Poiget, 1997; Puschel, Stassen, Bomben, Scharfetter, & Hell, 1997; Shaw, Dong, Lim, Faustman, Pouget, & Alpert, 1999) Decreased pause durations and increased speech rates have been concurrently seen in patients undergoing pharmacological treatment for depression. Findings indicate that signs of mood improvement are associated with concurrent speech changes and that the speech changes may be more indicative of early therapeutic response than other clinical measures (Stassen et al., 1998). Generally, dynamic changes in speaking rate and speech pause time measurements, seen as increased speaking rate and shortened pauses, have mirrored clinical improvements in depression and are strongly correlated to symptomatology and improvement (Ellgring, & Scherer, 1996; Flint, Black, Campbell-Taylor, & Gailey, 1993; Hardy, Jouvent, & Widlocher, 1984; Nilsonne, 1987, 1988; Stassen et al., 1998; Alpert, Pouget, & Silva, 2001).
Voice onset time (VOT) also represents an acoustic measure related to motoric strategy and timing that is associated with changing articulation in the mechanisms of speech production (Borden, Harris, & Raphael, 1994, p. 131; Kent, Weismer et al., 1999, p. 159). Measurement of this dimension represents the inclusive time duration from the release of a stop consonant to the subsequent periodic vibration of a following vowel. Previous research regarding VOT changes in persons with PD is inconsistent. Forrest, Weismer, and Turner (1989) found increased mean VOT, which they attributed to deficits in the coordination and initiation of movement in the laryngeal musculature. Others have found decreased VOT, which was attributed to rigidity of the laryngeal musculature causing a reduction in vocal fold opening (Flint et al., 1992, 1993; Weismer, 1984). Voice onset time involves the initiation and coordination of voicing in speech and is therefore of interest because it has been suggested that early changes in hypokinetic dysarthria associated with PD are likely to initially involve laryngeal control (Duffy, 1995; Zwirner, & Barnes, 1992). Measurement of VOT is reliably made. In our laboratory, we have found that measures of onset, offset, and total VOT duration, made by independent judges, produces very high inter-rater and intra-rater reliability (r>.95), and similar results have been observed in other laboratories (r=.96; Flint et al., 1993). To date, only one study has investigated VOT in depression and found shorter VOT in depressed participants relative to control subjects; however, these durations were not significantly different from persons with PD (Flint et al., 1993). It does not appear that this topic has been investigated in schizophrenia.
The measure Fundamental frequency (F0) variability involves tracking the dynamic fluctuations of vocal inflection over time. The use of automatic computerized software to analyse acoustic properties of speech, such as vocal range and F0 contour, has become increasingly commonplace (Duffy, 1995; Kent et al., 1999; Kent, Vorparian, & Duffy, 1999) and is thought to be reliably used in persons with dysarthria and in normal controls when task content is well controlled (Kent, Vorparian et al., 1999). Previous research in persons with depression (cf. Alpert et al., 2001; Talavera, Saiz-Ruiz, & Garcia-Toro, 1994), negative symptom schizophrenia (cf. Alpert et al., 2000), and PD (cf. Flint et al., 1992; Metter, & Hanson, 1986) suggests that F0 variability in these disorders is reduced significantly when compared to typical healthy control populations. These consistent findings of decreased variability of F0 across these disorders has led some researchers to hypothesize that a common underlying mesolimbic-nigrostriatal dopamine depletion may be the root cause (Flint et al., 1993; Talavera et al., 1994).
Two of the authors (M. S. C. and N. R.) served as the subjects in this comparative study and each participant served as his or her own control for each recording session. These participants were a 34 year-old male, and a 23 year-old female, both native speakers of American English, in good physical health with no history of neurological deficit, learning disability, or psychiatric disorder. They were assessed by a licensed speech-language pathologist (SLP) as being 100% intelligible in conversation with appropriate speech articulation. The participants were also subjectively judged by the SLP to be free of overt voice disorders or dysarthria.
For each telephone call, the participants systematically executed a series of speech and vocal exercises that constitute a fundamental portion of our acoustics laboratory generic speech and voice collection protocol (see Appendix A). This abridged protocol was specifically designed to extract the four variables of interest (i.e., speaking rate, pitch variability, pitch range, and VOT), and consisted of the following: automatic speech (e.g., counting from 1–40), vocal range with the vowel /a/, diadochokinesis (DDK; /pa pa pa pa pa /), and standard paragraph reading (e.g., The Grandfather Passage). During each recording, participants spoke naturally into the telephone (i.e., the remote recording; RR), with the local recording (LR) microphone placed approximately 10 cm from the speaker's mouth. The participants spoke at a comfortable volume in all recording situations.
Two telephone calls were made from our laboratory in Groton, CT to an interactive voice response (IVR) recording system located in Madison, WI. This system has been adapted, validated and used for the remote capture and collection of data for individuals with depression to study drug treatment response in clinical trials (Mundt, 1997). The IVR is programmed to prompt participants to answer questions and perform tasks specific to a data collection protocol. Then responses and performances were digitally recorded at a sampling rate of 8 kHz. The RR sound files, in the form of .wav files, were then sent via e-mail back to the laboratory in CT.
Local recordings were made in a quiet environment using a unidirectional high quality microphone designed for vocal recording (Sure SM-58). Analog to digital conversion was accomplished through the XLR front panel microphone input of the CSL-4400 audio capture device, sampled at 44100 Hz 16 bit quantification and saved in .wav file format following the completion of each of the four tasks. Files of interest were then subjected to acoustical analyses.
All acoustic analyses were performed using both the freely available Praat (Boersma, & Weenik, 2003) and the commercially available Computerized Speech Laboratory (CSL) main program (Kay Elemetrics, 2001) speech and voice analysis software programs. Because the speech and voice recordings were already in a digital format, the signals were opened and directly analysed regarding the measures of interest (see below).
Overall speaking rate in this investigation refers to the total amount of time needed by the speaker to complete each of the following; the passage-reading task and the counting task. This time is calculated by first marking the onset of visible energy in the acoustic spectrogram of the digitized signal. The completion of the reading passage or counting task is then marked by the offset of visible spectrum energy. Then overall speaking rate is calculated as the absolute time in between the two energy markers in the reading task or counting task.
Pitch variability was measured as a coefficient of variation of F0 within a running speech sample. This was calculated as the standard deviation of F0 divided by the mean of F0 for the segment of interest. The periodicity-to pitch-autocorrelation function of Praat, with a pitch floor of 75 Hz and an automatic time step were used to derive the mean and standard deviation of F0 for both tasks. The tasks of interest for this measure were the standardized reading passage and the automatic speech task.
Vocal range is a measure of vocal control related to the ability to regulate laryngeal mechanisms regarding frequency, perceptually identified as vocal pitch. In this task participants were instructed to take a deep breath, say /a/ in their typical speaking voice, and gradually raise their pitch until they could not make it any higher. Then participants were instructed to take a deep breath, say /a/ in their typical speaking voice, and gradually lower their pitch until they could not make it any lower. A visible F0 contour was extracted from the Praat periodicity-to-pitch autocorrelation function for the aforementioned exercises. Vocal range was then calculated as the difference between the highest and lowest analyzable periodic frequency measured in Hertz.
VOT was measured from the stop consonant vowel combination production /pa/ from the DDK task. Measurement followed typical acoustical conventions with markers measuring from the point of initial spectrographic evidence of the plosive burst of the stop consonant, to the point of periodic vocal fold vibration signaling the onset of the vowel. The three middle productions of the /pa/ consonant vowel pair were used for each participant.
Overall speaking rates, listed below in Table I, were quite consistently measured for both the automatic speech task and the standard reading passage. Measurement variability between the two recording conditions was, at its greatest, less than two tenths of a per cent different for either of the experimental task conditions. Per cent score variability between conditions ranged from 0% difference to .1% difference in seconds for all of the comparisons.
Pitch variability as measured by F0 coefficient of variation (COV), listed in Table II below, was very consistently measured between the LR and RR recording conditions. In three out of the four comparisons, the measurement difference was less than two hundredths of a per cent, with the final measurement at less than five hundredths of a per cent different.
Pitch range in Hertz was also reliably measured in both the LR and RR conditions. Absolute pitch range differences between recording conditions were .32 Hz for the male and 2.21 Hz for the female with an overall percent difference of one tenth of a per cent and four tenths of a per cent respectively (see Table III).
Measures of VOT that were compared across recording locations included the total VOT for matched pairs of consonant syllable pairs /pa/. Three such pair measurements were analysed per subject between recording conditions. Difference scores, measured in milliseconds, ranged from a low of one ms to a high of six ms. These results demonstrate that VOT may be measured with a relatively high degree of precision between local and remote recording measurements.
Twenty five per cent of the measures, ten measures overall, were randomly selected from the LR and RR conditions and reanalysed for the purpose of determining both the inter-rater reliability and intra-rater reliability of the measurement procedures. Pearson product-moment correlations were used to determine the association between the original measurements the repeated measurements. Correlation coefficients were r2=.95 for the within examiner reliability and r2=.92 for between rater reliability.
We sought to determine the feasibility of using human speech samples that were collected and recorded over the telephone for voice acoustical research. The results should be considered preliminary, due to the limitations of a two subject descriptive design, the small sample size of recordings, and the use of only a single healthy male and a single healthy female. Still, these initial findings demonstrate that quite reliable and accurate measurements can be made whether the acoustical signal is collected over the telephone line or under more ideal laboratory conditions. This was true for measures of speaking rate, pitch variability, pitch range, and VOT, all of which are considered to be useful metrics in the study of affective and dysarthric speech and voice profiles.
The current study is based on an earlier comparative study conducted by our laboratory, in which international telephone calls were remotely and locally recorded as the subject performed vocal and speech exercises similar to those used in this study (Cannizzaro, & Snyder, 2003). Recordings were made from three locations in Western Europe and yielded promising results. While this initial foray served as the inspiration for the current study, we sought to remove certain design flaws, such as inconsistencies in telephones and the modes of transmission (e.g., satellite, differing phone companies), by employing a single telephone with known transmission characteristics in a more controlled setting.
The measurement accuracy of the two time-dependent measures of interest, speaking rate and voice onset time, are reliant on accurate representation and visualization of the speech signal as a complex waveform and as a spectrogram. As long as visible landmarks, such as a plosive burst or the initiation of periodic voicing are clearly apparent during the analysis, the measurement should be as accurate as the skill level of the person performing the analysis. In all of the comparisons made in this experiment, there were no difficulties in performing these analyses, as the signals of interest were, without exception, clearly discernable from the surrounding signal. This is readily apparent by the close agreement of the measurements we were able to make across a variety of tasks in both the RR and LR conditions. These findings are further supported by the high levels of inter-rater and intra-rater reliability found for both conditions across the tasks of interest.
The accurate quantification between the two frequency related measures of interest, pitch variation and pitch range in Hertz, is a little more complicated given the known characteristics of the telephone. Telephone frequency response is generally reported to be within the range of 300 Hz to 3000 Hz acting essentially as a band pass filter rejecting higher and lower frequency transmission (Kent, & Read, 2002, p. 78). In an initial test of the IVR system, we have found an accurate frequency portrayal of the telephone transmission between 250 Hz to approximately 3200 Hz with a steep roll off in frequency response outside of these parameters (Cannizzaro, & Snyder, 2003). Given these qualities, the measurement of frequencies beyond these boundaries, in the interest of this study below 300 Hz, is dependant on the algorithms used in the analysis. For this reason, the periodicity-to pitch-autocorrelation function in Praat was used as it is not dependant on the actual presence of a fundamental frequency to make this measurement. Autocorrelation uses the repeat length of the waveform to determine the fundamental period (P. Boersma, personal communication, November 12, 2003). That is, the lowest common frequency in a complex periodic wave with equally spaced harmonics will generate a repeat pattern at that frequency. Since the voice fundamental frequency is generated by the periodic vibration of the vocal folds, and each harmonic is a simple multiple of that fundamental, the lowest common repeat pattern is at the same rate as the fundamental frequency of the voice. As evidenced by our findings, even a wide range of frequency excursion can be accurately assessed in telephone recordings.
While our initial findings are quite promising regarding the use of the telephone to collect acoustic data, there are some cautions that should be addressed. We chose our measures carefully based on our initial testing of the IVR system with speech and noise signals. No measures of intensity were performed due to the difficulty in accurately calibrating and equating the decibel level of the mouth to telephone and mouth to recorder levels. At the present time, this leaves out a number of measures that may hold potential value for a future clinical trial using voice acoustic data. Also, the clarity of our voice recordings was consistent as only one telephone was used in this experiment. The use of lower quality telephones and other modes of transmission (e.g., cellular or satellite) would also need to be explored before they could be deemed suitable for use in large-scale data collection.
Overall, we were encouraged by these early findings. Future investigations of larger groups of subjects with suspected and diagnosed CNS disorders, as well as healthy controls, will ultimately advance the feasibility of telephone recording of acoustic data. Similar designs that utilize simultaneous LR and RR recordings in future studies can only serve to further our understanding employing these techniques in large-scale randomized clinical trials.
Supported, in part, by NIH grant R43MH68950 to J. C. M. The authors would also like to acknowledge Ben Barth, without whose technical support the use of the IVR system would not have been possible.
Telephone speech collection module
You wished to know all about my grandfather. Well, he is nearly ninety-three years old. He dresses himself in an ancient black frock coat, usually minus several buttons; yet he still thinks as swiftly as ever. A long, flowing beard clings to his chin, giving those who observe him a pronounced feeling of the utmost respect. When he speaks his voice is just a bit cracked and quivers a trifle. Twice each day he plays skillfully and with zest upon our small organ. Except in the winter when the ooze or snow or ice prevents, he slowly takes a short walk in the open air each day. We have often urged him to walk more and smoke less, but he always answers, “Banana Oil!” Grandfather likes to be modern in his language.