|Home | About | Journals | Submit | Contact Us | Français|
Assessment of prosody is important for diagnosis and remediation of speech and language disorders, for diagnosis of neurological conditions, and for foreign language instruction. Current assessment is largely auditory-perceptual, which has obvious drawbacks; however, automation of assessment faces numerous obstacles. We propose methods for automatically assessing production of lexical stress, focus, phrasing, pragmatic style, and vocal affect. Speech was analyzed from children in six tasks designed to elicit specific prosodic contrasts. The methods involve dynamic and global features, using spectral, fundamental frequency, and temporal information. The automatically computed scores were validated against mean scores from judges who, in all but one task, listened to “prosodic minimal pairs” of recordings, each pair containing two utterances from the same child with approximately the same phonemic material but differing on a specific prosodic dimension, such as stress. The judges identified the prosodic categories of the two utterances and rated the strength of their contrast. For almost all tasks, we found that the automated scores correlated with the mean scores approximately as well as the judges' individual scores. Real-time scores assigned during examination – as is fairly typical in speech assessment – correlated substantially less than the automated scores with the mean scores.
Assessment of prosody is important for diagnosis and remediation of speech and language disorders, for diagnosis of certain neurological conditions, as well as for foreign language instruction. This importance stems from the role prosody plays in speech intelligibility and comprehensibility (e.g., Wingfield, 1984; Silverman et al., 1993) and in social acceptance (e.g., McCann & Peppé, 2003; Peppé et al., 2006, 2007), and from prosodic deficits in certain neurological conditions (e.g., stroke; House, Rowe, & Standen, 1987 or Parkinson's Disease; Darley, Aronson, & Brown, 1969a, b; Le Dorze et al., 1998).
Current assessment of speech, including that of prosody, is largely auditory-perceptual. As noted by Kent (1996; also see Kreiman & Gerratt, 1997), the reliability and validity of auditory-perceptual methods is often lower than desirable as the result of multiple factors, such as the difficulty of judging one aspect of speech without interference from other aspects (e.g., nasality judgments in the presence of varying degrees of hoarseness); the intrinsic multidimensional nature of certain judgment categories that require judges to weigh these dimensions (e.g., naturalness); the paucity of reference standards; and the difficulty of setting up truly “blind” judgment situations. Many of these issues are not specific to perceptual judgment of speech; in fact, there is an extensive body of literature on biases and inconsistencies in perceptual judgment going back several decades (e.g., Tversky, 1969).
Presumably, these issues would not be faced by automated (“instrumental”) speech assessment methods. Nevertheless, automated methods have largely been confined to analysis of voice features that are only marginally relevant for prosody (e.g., the Multi-Dimensional Voice Program™, or MDVP; Elemetrics, 1993). What obstacles are standing in the way of developing reliable automated prosody assessment methods?
An obstacle for any method, whether automated or auditory-perceptual, consists of the multiple “levels” of prosodic variability; for each level, one must distinguish between which deviations from some – generally ill-defined – norm are acceptable (e.g., due to speaking style) and which deviations are not (e.g., due to disease). One level of variability involves dialect. Work by Grabe and colleagues (Grabe, Post, Nolan, and Farrar, 2000; Grabe and Post, 2002), for example, has shown that prosodic differences between dialects of British English can be as large as prosodic differences between languages. Not surprisingly, dialect differences have been shown to create problems for auditory-perceptual assessment (e.g., assessment of speech naturalness; Mackey, Finn, & Ingham, 1997). Another level of variability involves differences between speakers that are not obviously due to dialect. These differences are sufficiently systematic to provide useful cues for speaker identification (e.g., Sönmez et al., 1998; Adami et al., 2003), and may involve several speaker characteristics, the most obvious of which are gender, age, and social class (e.g., Milroy & Milroy, 1978). At a third level, there is systematic within-speaker variability due to task demands (e.g., Hirschberg, 1995, 2000), social context (e.g., Ofuka et al., 1994, 2000), and to emotional state (e.g., Scherer, 2003).
In addition to these forms of variability that are due to systematic factors, there is also variability that is apparently random. For example, in data reported by van Santen and Hirschberg (1994), in a highly confined task in which the speaker had to utter sentences of the type “Now I know <target word>” in a prosodically consistent – and carefully monitored – manner, the initial boundary tone was found to have a range of 30 Hz while the final boundary tone had a range of only 3 Hz; the typical pitch range of these utterances was less than 100 Hz. This form of variability may be less of a challenge for auditory-perceptual methods because these methods may benefit from the human speech perception system's ability to ignore communicatively irrelevant features of speech, but it clearly presents a challenge for automated methods.
There are additional aspects of prosody that pose complications and that are unrelated to variability. One is the intrinsically relative nature of many prosodic cues. For example, durational cues for the lexical stress status of a syllable are not in the form of some absolute duration but of how long or short the duration is compared to what can be expected based on the speaking rate, the segmental makeup of the syllable, and the location of the syllable in the word and phrase (van Santen, 1992; van Santen & Shih, 2000). A second aspect of prosody that poses complications for automated methods is that prosodic contrasts typically involve multiple acoustic features. To continue with the same example, lexical stress is expressed by a combination of duration, pitch, energy, spectral balance (e.g., Klatt, 1976; van Santen, 1992; Sluijter & van Heuven, 1996; van Santen & Niu, 2002, Miao et al., 2006), and additional features due to effects at the glottal level that are not fully captured by these basic acoustic features (e.g., glottal closing and opening slope; Marasek, 1996). Thus, there could be speaker-dependent trade-offs in terms of the relative strengths of these features, necessitating a fundamentally multidimensional approach to automated prosody assessment.
Both the intrinsic relativity of individual prosodic features and the trade-offs between them pose challenges for automated prosody assessment methods. These challenges seem to be fundamentally different from those posed by, for example, vowel production assessment. In a given phonemic context, vowel formant frequencies must lie within fairly narrow ranges in order for the vowel to be perceived as intended. While prosodic categories cannot even remotely be characterized by “point templates” in some conventional acoustic space, point template approaches for phonemic categories used by typical speech recognition systems clearly work rather well, via vector-based acoustic models in conjunction with some initial normalization step (e.g., cepstral normalization, vocal tract length normalization) and making basic allowances for coarticulation (e.g., by using phonemic-context dependent acoustic models).
Despite these obstacles for automated methods, there are obvious drawbacks to relying on auditory-perceptual methods and important advantages to using automated methods. First, we already mentioned validity and reliability issues of auditory-perceptual methods. Second, given the poor access many individuals have to services from speech-language pathologists or foreign language teachers, reliance on computerized prosody remediation or instruction is likely to increase. To be truly useful, such computerized systems should have the capability to provide accurate feedback; this, in turn, requires accurate automated assessment. Third, despite the exquisite sensitivity of human hearing, it is plausible that diagnostically relevant acoustic markers exist whose detection exceeds human capabilities. Detection of some promising markers, such as statistical features of pause durations in the course of a 5-minute speech recording (e.g., Roark et al., 2007), might be cognitively too demanding. Others could have too low an SNR to be humanly detectable. The acoustic feature of jitter, for example, has potential for early detection of certain vocal fold anomalies (e.g., Zhang & Jiang, 2008; Murry & Doherty, 1980) but has fairly high perceptual thresholds, certainly with non-stationary pitch (e.g., Cardozo & Ritsma, 1968). In other words, exclusive reliance on auditory-perceptual procedures is not good for discovery of new diagnostic markers.
We thus conclude that automated measures of assessment of prosody production are much needed, but that constructing such measures faces specific challenges. In our approach, we use a combination of the following design principles that help us address these challenges. (i) Highly constraining elicitation methods (e.g., repeating a particular word with a specific stress pattern) to reduce unwanted prosodic variability due to, for example, contextual effects on speaking style. (ii) A “prosodic minimal pairs” design for all but one task, in which the list of items used to elicit speech consists of randomized pairs that are identical except for the prosodic contrast (e.g., the third item on the list is tauveeb and the seventh tauveeb, with underlining indicating word stress). This serves to reduce the impact of confounding speaker characteristics, such as pitch range or vocal tract length; each speaker is his or her own control. (iii) Robust acoustic features that can handle, for example, mispronunciations and pitch tracking errors. (iv) Measures that consist of weighted combinations of multiple, maximally independent acoustic features, thereby allowing speakers to differ in the relative degrees to which they use these features. (v) Measures that include both global and dynamic features. Prosodic contrasts such as word stress are marked by pitch dynamics, while contrasts such as vocal affect can perhaps be characterized by global statistics. (vi) Parameter-poor (and even –free) techniques in which the algorithms themselves either are based on established facts about prosody (e.g., the phrase-final lengthening phenomenon) or are developed in exploratory analyses of a separate data set whose characteristics are quite different from the main data in terms of speakers (e.g., adults and children ages 11-65 vs. children 4-7). In conjunction with (ii) and (iii), this serves to maximize the portability of the measures in order to minimize the influences of recording conditions, SNR, sample characteristics, and other factors that may be difficult to control across laboratories or clinics. Parameter-rich systems may lack such portability, since the parameter estimates may depend on the idiosyncrasies of the acoustic recording conditions and the training samples.
The goal of this paper is to describe the construction and validation of a number of prosody measures based on these design principles. The speech data were collected as part of an ongoing study of the production and interpretation of prosody in autism, whose aim is to detail prosodic difficulties in autism spectrum disorder, developmental language disorder, and typical development, in the age range of 4-8 years. The current paper focuses on methodology. Elsewhere we have presented preliminary findings on between-group differences on the suite of measures (Tucker Prud'hommeaux et al, 2008; van Santen, Tucker Prud'hommeaux, et al., 2007, 2008).
The tasks used for elicitation include variants and modifications of tasks in the PEPS-C (“Profiling Elements of Prosodic Systems - Children”; Peppé & McCann, 2003) paradigm, as well as of two tasks developed by Paul et al. (2005). In our study of prosody in autism, children complete tasks designed to test both their interpretation and their production of prosody. The present paper considers, in detail, the results of only the tasks related to the production of prosody. Findings are reviewed in four categories: Stress Related Tasks, Phrasing Task, Affect Task, and Pragmatic Style Task.
(i) Stress Related Tasks (Lexical Stress Task, Emphatic Stress Task, and Focus Task). In the Lexical Stress Task (based on Paul et al., 2005; also see Dollaghan & Campbell, 1998), the computer plays a recording of a two-syllable nonsense word2 such as tauveeb, playfully accompanied by a picture of a thus-named “Martian” life-form. The child's task is to repeat after the recorded voice with the same stress pattern. In the Emphatic Stress Task (Plant & Öster, 1986; Shriberg et al., 2001; Shriberg et al., 2006) the child has to repeat an utterance in which one word is emphasized (“Bob may go home”, “Bob may go home”, etc.). (Note that in Plant & Öster's procedure, the subject does not repeat an utterance but reads appropriately annotated text aloud.) Finally, in the Focus Task (adapted from the PEPS-C), a recorded voice incorrectly describes a picture of a brightly colored animal with a soccer ball, using the wrong word either for the animal or for the color. The child must correct the computer by putting contrastive stress on the incorrect label. For example, if the voice describes a picture of a black cow as a blue cow, the child responds “No, the black cow has the ball”. (ii) In the Phrasing Task, also adapted from the PEPS-C, the child has to indicate with an appropriate phrase break whether a picture represents three objects (e.g., fruit, salad, and milk) or two objects (e.g., fruit-salad and milk). (iii) In the Affect Task, the child has to say a fixed phrase (“It doesn't matter”) using the emotion (happiness, sadness, anger, or fear) that corresponds to the affect expressed by a picture of a stylized face. These pictures were obtained by digitally modifying Ekman faces (Ekman and Friesen, 1976) into line drawings that only retain features relevant for facial affect. This task is loosely based on a receptive vocal affect task (Berk, Doehring, & Bryans, 1983). (iv) In the Pragmatic Style Task (based on Paul et al., 2005), the child views a photo of either a baby or an adult and must speak to that person using the appropriate prosody.
These tasks were administered as follows. Each task started with four training trials during which the examiner corrected and, if necessary, modeled the response. In addition, each task was immediately preceded by the corresponding receptive task, thereby providing some degree of additional, “implicit” training. Thus, significant efforts were made to ensure that the child understood the task requirements. During the administration of a task, the child and the examiner were seated at adjacent sides of a low, child-appropriate, rectangular table. The examiner interacted with a laptop computer to control the experiment; the screen of the laptop was not visible to the child. A software package (based on the CSLU Toolkit; Sutton et al., 1998) presented the auditory stimuli via high-quality loudspeakers and visual stimuli on a touch screen, processed the child's touch screen responses, and enabled to examiner to control the sequence of stimuli (specifically, to repeat up to two times items missed due to the child's attention drifting) and to indicate whether the child's response was correct or incorrect. The binary (i.e., correct vs. incorrect) scores obtained in this manner will be called real-time scores. The software also recorded the child's vocal responses and stored these in a data structure containing all events and speech recordings, appropriately synchronized and time-stamped. This data structure, including the recorded speech, can be re-accessed by the examiner after completion of a task to verify the scoring judgments made during the task. We call these scores verified real-time scores. The only real-time scores we will consider in this study are these verified real-time scores.
Participants were 15 children who met criteria for Autism Spectrum Disorder (ASD); 13 children who were considered “typical” in terms of several criteria, discussed below (Typical Development, or TD group); and 15 children who met some but not all criteria for inclusion in the ASD group. There are two reasons for using this heterogeneous sample. First, for the purpose of developing and validating the automated measures, a wide range of performance levels are needed; restriction of the study to the TD group could have created serious restriction of range problems because most children in the TD group perform well on these tasks. Second, it is important to establish that the experimental procedures can be applied to, and generate meaningful scores for, children with neurodevelopmental disorders.
All participants were verbal and had a mean length of utterance of at least 4. The ASD group was “high functioning.” (HFA), with full scale IQ of at least 70 (Gilberg & Ehlers, 1998; Siegel, Minshew & Goldstein, 1996), as well as “verbally fluent.” Diagnoses made use of DSM-IV-TR criteria (DSM-IV-TR, 2000) in conjunction with results of the “revised algorithm.” derived from the ADOS (Lord et al., 2000; Gotham, et al., 2007, 2008), administered and scored by trained, certified clinicians, and results of the Social Communication Questionnaire (SCQ; Berument et al., 1999).
Exclusion criteria for all groups were the presence of any known brain lesion or neurological condition, including cerebral palsy, tuberous sclerosis, intraventricular hemorrhage in prematurity, the presence of any “hard” neurological sign (e.g., ataxia); orofacial abnormalities (e.g., cleft palate); bilinguality; severe intelligibility impairment; gross sensory or motor impairments; and identified mental retardation. For the TD group, exclusion criteria include, in addition, a history of psychiatric disturbance (e.g., ADHD, Anxiety Disorder), or any family member with ASD or Developmental Language Disorder.
In addition to the child speakers, we also obtained recordings of a superset of the items in the above six tasks from 36 professional and amateur actors, both male and female, ranging in age from 11 to 65. We have used these recordings as a development set for exploratory purposes. Results for one of these individuals, for the affect task, were reported by Klabbers et al. (2007).
Two sets of human listener-judged scores were collected: (1) verified real-time scores, produced by clinicians during examination and later verified; and (2) listener judgments, collected using a web-based randomized perceptual experiment, produced by six naïve, non-clinical individuals.
The four individuals generating the verified real-time scores had clinical or research experience with neurodevelopmental disorders and were extensively trained in administering and scoring the speech elicitation tasks (section 2.1). Their backgrounds were in speech language pathology, phonetics, and clinical child psychology. All were native speakers of American English. Since they tested different subsets of children, no mean examiner scores could be computed. Each of these individuals verified his or her own scores, off-line.
Six individuals participated in the listening experiment portion of the study. They reported normal or corrected-to-normal hearing, were native speakers of American English, and had no clinical or research experience with neurodevelopmental disorders. They were unfamiliar with the study goals and did not know the children. They were paid for their participation. They processed the same sets of stimuli, thus enabling us to compute mean listener scores.
All listener tasks were computer controlled and took place in quiet offices using headsets. In the Affect Task, a listener heard a recording of a child response, decided which of four alternative emotions (happy, sad, angry, or fearful) best described it, and indicated a certainty level on a 0-2 scale (0=possibly, 1=probably, 2=certainly). The four alternatives were indicated using the same set of stylized faces that was used in the task performed by the children, with descriptive verbal labels added. The locations of the four alternatives remained constant throughout the experiment. Listeners also had the option to indicate “I don't know” without selecting a degree of certainty. This option was included in response to listener requests for such an option during a pilot study.
In the remaining five tasks, a listener heard a recording of two child responses forming a prosodic minimal pair (e.g., tauveeb and tauveeb, from the same child), decided which member of the pair corresponded to which response alternative, and indicated one of two degrees of certainty (“probably”, “certainly”). Listeners also had the option to indicate “I don't know”. Scores generated by these procedures will be called minimal pairs based scores. While the certainty ratings add to task complexity, listeners expressed the need to be able express the sharp differences in confidence they experienced.
The listeners had substantial difficulty reliably distinguishing between angry and happy and between sad and fearful. Based on this, we scored the listener responses as follows. We scored a response (to an individual utterance) as positive when the listener chose happy or angry, with full confidence scored as 1, some confidence as 0.66, and a little confidence as 0.33; sad or fearful responses were scored likewise but with corresponding negative values. A response of “I don't know” was scored as 0. We note that this scoring scheme does not take into account what the utterances' intended affects are. Thus, correlations between judges based on these per-utterance scores do not reflect agreement about the appropriateness of the utterances. Recall that each speaker produced 4 versions of a single sentence: happy, sad, angry, and fearful. To obtain per-speaker scores, we combined the scores assigned to that speaker's utterances by adding the mean listener scores for the utterances whose targets were angry or happy and subtracting from this sum the sum of the scores for the utterances whose targets were sad or fearful. This difference score, in contrast to the per-utterance score, does reflect whether a child makes an appropriate contrast between the affects.
In these tasks, in which the listener judged minimal pairs of utterances, we assign a positive score to a listener response if the listener's choice indicates that a child made the correct contrast between the two utterances and a negative score otherwise. A score of zero indicates that neither utterance was produced with convincingly appropriate prosody. Thus, scores could have the values of 1, 0.5, 0, -0.5, and -1. Per-speaker scores were computed by averaging these scores.
During an examination, the clinician – who always knew the target prosody – decided whether each response was correct or incorrect, without a certainty rating (correct = 1, incorrect = 0). During verification, the clinician listened to each response and decided whether to accept or reject the decision made during the examination. To make these scores comparable to the minimal pairs based scores, we collected for each minimal pair the two corresponding verified real-time scores, and mapped these onto 1.0, 0, and -1.0 depending on whether both were positive, one was positive, or neither was positive.
For each task, the listener data can be represented as an N × 6 table, where the N rows correspond to the utterances collected across all speakers, the 6 columns to the listeners, and the cells to the listener ratings. This per-utterance table can be condensed into a per-speaker table, by combining for each speaker the ratings of the k utterances as discussed in section 2.5. We extend these tables by adding additional columns for mean listener scores (see below) and verified real-time scores (see Section 2.1).
Two types of mean listener scores were computed. The first type of mean listener score was obtained simply by averaging the six listener columns in the data tables. The second score was obtained by applying Principal Components Analysis to the covariance matrix of the six listener columns, and multiplying these columns with the Eigen vector associated with the first principal component. This score is thus a weighted mean of the listener scores, with lower weights for listeners whose scores are less correlated with other listeners (van Santen, 1993). Since we found only minimal differences between the results obtained by these two methods, we only report results obtained with the simple average. Consistent with these minimal differences is the fact that, across the six tasks, the first component explained between 63 and 78% of the variance for the per-utterance data and between 76 and 90% of the variance of the per-subject data. There were no systematic differences between listeners in terms of their loadings on the first principal component. In combination with the similarity of the results obtained between weighted mean listener scores and the simple mean listener scores, we conclude that there were no systematically distinct (“bad”) subgroups of listeners or individual outliers and that we can use the simple mean listener scores.
Figures 3--8,8, left panels, show the (product-moment) correlations between the scores of the per-utterance listener scores, mean listener scores, and verified real-time scores. The figures show the ranges of the between-listener correlations and of the correlations between the listeners and the mean listener scores; for the latter, we always excluded the individual listener's data when the mean listener score was computed to avoid the obvious inflation that would otherwise result. Figures 3--8,8, right panels, present the same analyses for per-speaker data. One could propose that from a practical perspective, it would be these per-subject scores that matter and not the per-utterance scores.
These results show, first, that all scores have significantly positive correlations with each other (p=0.01, one-tailed). Second, the listener scores correlate more with each other (on average, 0.64 for the per-utterance data and 0.73 for the per-speaker data) and with the mean listener scores (0.75 and 0.83, respectively) than the verified real-time scores (0.48 and 0.56, respectively).
Several factors could be responsible for the latter, including: frame-of-reference effects due to different subgroups of children being scored by different examiners; the examiners being biased by awareness of the broad capabilities of the child they were scoring; and judgments of individual responses being intrinsically more difficult than judgments of minimal pairs.
Except for the affect task, all correlations were larger for the per-speaker scores than for the per-utterance scores. This was most likely due to the reduction in variance by computing per-speaker averages. In the case of the affect scores, the per-utterance and the per-speaker scores are not strictly comparable because the former do not take into account the appropriateness of the utterance whereas the latter do.
A finding not depicted in the figures is that mean listener scores on the tasks other than the affect task were overwhelmingly positive (i.e., indicating that the child made the correct distinction), whether based on per-utterance data or on per-speaker data. Averaged over these tasks, the percentages of positive responses were 88% and 95%, respectively. This makes an important point: even in cases where the listeners were quite uncertain, as a group they were still able to accurately determine the child's intention. This supports the sensitivity of the prosodic minimal pairs-based methods. The verified real-time scores were less often positive (78% and 89% for the per-utterance and per-speaker data, respectively; using a chi-square test, these percentages are significantly smaller at p<0.01 than the corresponding percentages for the minimal pairs based methods). A methodological consequence of the strong positive bias is that the percentage agreement between judges (i.e., agreement measured in terms of the sign of the scores, ignoring the magnitude) is not meaningful, and that the standard measure for interjudge agreement (Cohen's Kappa; Cohen, 1960) is also not meaningful because this measure becomes unstable as the number of cases in the negative/negative cell approaches zero.
In summary, the data support the claim that the mean listener scores can serve as a gold standard for validating the automated scores because of the good agreement of the individual listeners' scores with each other and with the mean listener scores, and because of the sensitivity of these scores to the children's intended prosodic contrasts. The data show that the verified real-time scores are less reliable, and hence should not be used to evaluate the automated scores.
Pre-processing consisted of the following steps: locating and extracting the child responses from the recordings; determining certain syllable, word, or phonetic segment boundaries (depending on the task and measure used; discussed in detail below); and extracting acoustic features, including (i) fundamental frequency, or F0, using the Snack Sound Toolkit (2006), and (ii) amplitude (in dB) in four formant range based passbands (B1(t): 60-1200 Hz, B2(t): 1200-3100 Hz, B3(t): 3100-4000 Hz, B4(t): 4000-8000 Hz). These passbands are based on Lee et al. (1999, Table III), using formant ranges for 5-year olds. The amplitude trajectories B1(t), …, B4(t) are transformed into two new measures: B(t), the amplitude averaged over the four passbands, defined as 0.25*[B1(t)+ …+ B4(t)]; and a spectral balance vector consisting of bj(t), the mean-corrected energies, defined as bj(t)=Bj(t)-B(t), for j=1,…,4.
Spectral balance, thus defined, has a number of useful properties. (i) It is gain invariant, i.e., multiplication of the input signal by a constant has no effect. To a first order of approximation, this makes these features robust with respect to factors such as the gain setting of the hardware and the proximity of the speaker to the microphone. The same is obviously not true for B(t). However, the prosodic minimal pairs approach can be expected to reduce the effects of these factors since they do not vary substantially within a session and hence approximately have the same values in the two utterances making up a prosodic minimal pair. (ii) Spectral balance should not be confused with spectral tilt defined as, e.g., the slope of a line fitted to the log power spectrum. As we shall see in the analysis of affect in Section 4.8, certain affects are associated with high values of b1 and b4 and low values of b2 and b3, while other affects have the opposite pattern. This difference cannot be captured by spectral tilt. (iii) The advantage of using formant frequency-based passbands is that it reduces effects on these measures of articulatory variability. If nominally the same vowel is pronounced differently in the two utterances making up a prosodic minimal pair, resulting in different formant frequencies, this will have limited effects on the spectral balance vector because of the way we selected the passbands.
Broadly speaking, in stress related tasks the expected contrast involves a change in the alignment of an acoustic trajectory relative to syllables or words. For example, the F0 peaks in tauveeb and tauveeb might be expected to occur in the first and second syllable, respectively. Alignment, however, is not nearly as simple. A method that considers which syllable contains a pitch peak will not work because F0 peaks can occur in the stressed or post-stressed syllable depending on the segmental make-up of the syllables and the number of syllables in the (left-headed) foot with which the pitch accent is associated (van Santen & Möbius, 2000; van Santen, Klabbers, & Mishra, 2006). An additional problem is the difficulty of determining in which syllable the F0 peak is located, due to the existence of a flat region in the F0 curve, to the presence of small amounts of jitter, and to the “true” F0 peak being located inside an obstruent where pitch cannot be measured. Our method, instead, uses a “soft” alignment approach in which we (i) compute a difference curve between the two curves extracted from the utterances making up a prosodic minimal pair and (ii) use a one-dimensional ordinal pattern recognition method based on isotonic regression (Barlow et al., 1972) to determine whether the difference curve is closer to the ordinal pattern consistent with a correct response pair than to the ordinal pattern associated with an incorrect response pair. The straightforward intuition behind this concept is that even under considerable variability in alignment, this difference curve has an up-down-up shape (see Figure 1).
We first explain the ordinal pattern detection method in general form. Consider a sequence of values, x=<x1, …, xn>, and associated weights, w=<w1, …, wn>. The x's, for example, may denote fundamental frequency values sampled at 5 ms frame intervals and the w's per-frame products of amplitude and voicing probability. An ordinal pattern is defined in terms of sequences of substrings <1,…,i1>, <i1+1,…,i2>, …, <im-1+1,n> of the string <1,…,n> together with directions denoted u (for up) or d (for down) associated with these substrings. We define the fit to an ordinal pattern udu… as follows.
Similarly for a pattern dud…, with appropriate changes in the inequalities. For the special case of Fit(x,w,u), this equation reduces to the standard isotonic regression (Barlow et al., 1972), given by:
Thus, the measure of fit in Eq. (1) alternatingly applies isotonic regression and antitonic regression (in which the ≤ signs are replaced by ≥ signs) for any selection of turning points, i1,…,im-1, and jointly optimizes the isotonic and antitonic fits over all selections of these turning points.
The Dynamic Difference method comprises the following steps. Below, sL(t) and sR(t) are the trajectories that, if the correct prosodic contrast is made, are associated with the left aligned item (e.g., tauveeb) and right aligned item (e.g., tauveeb), respectively:
We computed two dynamic difference measures in the case of the Lexical Stress Task (dud/udu and d/u based), but only the d/u based measure for the Emphatic Stress Task and the Focus Task (in which cases we analyzed the inter-vowel-center intervals); the key reason is that the pitch movement is confined to the critical word in the first task but also includes utterance parts subsequent to the critical word pair in the latter tasks; these parts are generally not well-controlled by the child and hence introduce severe phonemic variability (e.g., the insertion of entire words) that undermines the comparability of the utterances making up a prosodic minimal pair. For consistency, we only report results from the d/u based analyses. (For the Lexical Stress Task, results were essentially the same for the dud/udu and d/u based analyses.)
It is well-known that word stress and sentence stress are associated with an increase in duration of the stressed syllable or word (Klatt, 1976). We propose to use a simple measure, given by (L1R2-L2R1,)/(L1R2+L2R1), where Li and Ri denote the duration of the i-th syllable or word in the left (L) and right (R) aligned items, respectively.
The Lexical Stress Ratio (LSR; Shriberg et al., 2003) was designed in the context of assessing the ability to convey lexical stress in trochaic words in children with childhood apraxia of speech. This measure is computed as follows (see Shriberg et al., 2003, for details). For a two-syllable word, w, with stress on the first syllable, the vowel regions are determined in each syllable, and pitch (in Hz) and amplitude are computed. Next, for each of the two syllables (i=1,2) the quantities DwiAwi (amplitude area), DwiF0wi (pitch area), and Dwi are computed, where, for the i-th syllable, Dwi is the duration of the vowel (in ms), Awi the average vowel amplitude (in dB), and F0wi the average fundamental frequency (in Hz) in the vowel region. In the next step, the ratios Dw1Aw1/Dw2Aw2, Dw1F0w1/Dw2F0w2, and Dw1/Dw2 are formed. Finally, these three ratios are combined via
where α, β, and γ were determined by Shriberg et al. (2003) by computing the weights of the factor explaining the largest amount of variance, as produced by factor analysis (applied to a data matrix with speakers as rows, the three ratios as columns, and the per-speaker averages of these rows in the cells). The values of the weights were 0.507, 0.490, and 0.303, respectively. To apply this measure to the prosodic minimal pairs setup, we compute for each word pair (L, R), where L (R) denotes the left (right) aligned item, the measure (LSRL-LSRR)/(LSRL+LSRR).
To illustrate the value of a minimal pairs based analysis for reducing the impact of confounding speaker variables, we correlated the verified real-time scores with pitch, amplitude, and duration measures applied to individual utterances. These analyses were performed for the lexical stress task.
For pitch and amplitude, we applied the d/u based dynamic difference measure, assuming that when the pitch peak is on the first (second) syllable or word the overall trend in the two-syllable sequence is downward (upward). We also measured the ratio of the durations of the first and second syllable, assuming again that stress causes lengthening. A fundamental problem with these measures is, of course, that they are affected not only by speaker variables (e.g., some children with language disorders are very insecure about the correctness of their answer and express this with a rising intonation, a trend that is partialed out in the minimal pairs analysis but not in the current analysis) but also by the segmental structure of the syllables and words. For example, the second syllable in shinaig contains an intrinsically long diphthong; in addition, because this non-word is spoken in isolation, one may expect utterance final lengthening of the second syllable. These and other factors may conspire to make the proposed, individual-utterance based measures unreliable as a way of assessing the correctness and strengths of prosodic contrasts.
We computed two types of correlations. First, we correlated the measures with the prosodic targets, scoring utterances with target stress on the first (second) syllable as -1 (+1). The correlations with the pitch, amplitude, and duration-based measures were 0.642, 0.527, and 0.235, respectively; the multiple regression correlation was 0.688. (Since the dependent variable was binary, this analysis is equivalent to linear discriminant analysis.) These correlations were similar to those between the mean listener scores and the corresponding minimal pairs based measures, although the multiple regression correlation was substantially higher in the latter case (0.79). These results show that, as noted before in Section 3 based on the real-time scores and on the listener scores, the children generally made appropriate prosodic contrasts. The results also show that the proposed features may be more useful for prosody recognition identification purposes than anticipated.
Second, we correlated the measures with the verified real-time scores. For this purpose, we first separated the utterances into two groups in accordance with the stress location of the target prosody. We then reversed the signs of the measures for the group with target stress on the second syllable. We now correlated these measures with the correct (1) vs. incorrect (o) verified real-time scores. For the first syllable stress group the correlations with the three measures were 0.125, 0.145, and 0.235, while for the second syllable stress group they were 0.338, 0.342, and 0.210. When the two groups we combined, the correlations were 0.291, 0.212, and 0.195. These results suggest that the correct/incorrect decisions correlated rather poorly with the proposed measures.
In summary, the proposed measures were moderately accurate in identifying the target prosody of an utterance. We hypothesize that this is because the children generally conveyed the target prosody clearly and correctly. However, the measures were not able to predict when an examiner would judge an utterance as being correct or incorrect.
A measure mathematically similar to the duration based measure in Section 4.3 was used, comparing the duration of the first word (e.g., “chocolate”) and the duration of the remainder (“cookies and jam”), with the boundary drawn at the start of the second word (“cookies”). In the case of “chocolate, cookies, and jam”, P1 (the duration of the first word in the phrase boundary condition) is relatively long because it includes the full word “chocolate”, with the final syllable being lengthened by the phrase boundary (e.g., Klatt, 1976), and any pause; in the case of “chocolate-cookies and jam”, N1 (the duration of the first word in the no-phrase boundary condition) is relatively short because the word “chocolate” is often contracted (e.g., to “chocl”) and there is neither phrase-final lengthening nor a pause. Since there is no reason why there would be substantial effects of the phrase boundary on the duration of the remainder (“cookies and jam”), P2 and N2 can be expected to be approximately equal and hence P1/P2 to be larger than N1/N2. A measure confined to the [-1,1] interval is given by (P1N2-P2N1)/(P1N2+P2N1).
We spent substantial time exploring effects on pitch, but no consistent patterns were found. We surmise that this may be related to the fact that, while each of the prosodic minimal pairs in this task indeed involves the presence or absence of a phrase boundary, the linguistic nature of the boundary varies significantly across the items. In the two-item list context, each of the initial items (“chocolate cookies,” “chocolate ice cream,” and “fruit salad”) has a different sequence of grammatical components (ADJ-N, ADJ-N-N, N-N), each of which is associated with one or more default stress patterns. We also observed that some children emphasized “chocolate” and “fruit” in the two-item list context, perhaps to distinguish the pictured cookies, ice cream, and salad from the sugar cookies, vanilla ice cream, and vegetable salad that they remembered from previous pictures depicting three items. We speculate that these somewhat subtle distinctions between items may further add to individual differences and hence to the lack of consistent pitch patterns for this task, because some children in this age range (and populations studied) may be able to understand and prosodically express these subtler differences and others do not.
Based on analyses of the actor data, we compute the following features. First, the average over the utterance of the fourth amplitude band, B4(t). We surmise that this band captures the “breathiness” of baby-directed speech which we informally observed in the actor data. Second, we compute a robust maximum value of F0 by ordering all frames in the utterance in terms of the weight function defined by the product of amplitude and voicing probability, and finding the frame that has the largest F0 value among the top 10% of the weight-ordered frames. Duration was not a reliable predictor of Style in the actor data and was not included in the evaluation.
Based on the analysis of the actor data, we decided to include the (b2+b3)-(b1+b4) contrast (see Figure 2), which captures a striking pattern whereby the angry and happy affects have an inverted-U-shaped pattern in terms of the four bands while the sad and fearful affects have the inverse pattern. We also included the same robust F0 maximum as well as the mean amplitude (via B1+B2+B3+B4).
The key evaluation criterion is the degree to which the objective measures, separately and in combination, can predict the mean listener scores, as measured by the product-moment correlation. There are many ways in which, for a given task, the multiple acoustic measures can be combined, including neural nets, support vector machines, and other linear or non-linear methods. We have decided to use simple linear regression because of its robustness and its small number of parameters, which may benefit portability. To avoid the positive bias in estimating the multiple regression's R-squared statistic, we use a training/test procedure in which we estimate the regression weights on a subset of the data (training set) and evaluate the multiple correlations on the remaining data (test set). At the same time, in order to reduce restriction-of-range artifacts that could lead to underestimates of the true correlations between observed and predicted mean listener scores, we select training and tests sets with a sub-sampling procedure whose aim is to minimize differences in variance between training set, test set, and the overall set. Toward this end, we construct the test and training subsets by rank-ordering the data on the mean listener scores, and creating n pairs of tests sets and training sets, (Tei,Tri), where the i-th test set, Tei, contains items i, n+i, 2n+i, …, and the corresponding training set, Tri, the remaining items. Typically, we use n=10. This results in the test sets and training sets having approximately the same variances as the entire data sample. We report correlations between automated scores and mean listener scores by computing the median of these correlations over multiple selections of test sets.
The automated methods proposed in this paper succeeded in approximating human ratings in reliability, as assessed via correlations with auditory-perceptual mean listener ratings. In addition, the objective measures were superior to the conventional method of assessment in which the examiner makes real-time judgments and verifies these offline. These automated methods could be of immediate practical value in terms of substantial labor savings and enhanced reliability.
More important, however, are the principles that underlie the methods and that were spelled out in the Introduction, including the usage of highly specific elicitation methods; a prosodic minimal pairs design; robust acoustic features; capturing both global and dynamic features; and weighted combinations of multiple, maximally independent acoustic features. We are currently developing additional methods based on these same principles for other tasks in our protocol.
Several issues need to be addressed as we move forward. First, we would like to develop methods based on this first generation that detect new markers of neurological disorders – markers that are not audible to the human ear. Such markers include both acoustic features that have a too low SNR to be humanly detectable or are too complex, such as those based on statistical properties of relatively long speech fragments. Second, our methods, while not requiring human judgment, are not fully automatic because they require human labeling and segmentation, some at the word level and others at the phonetic segment level. In theory, automatic segmentation methods may make similar mistakes in segmenting the two utterances making up a prosodic minimal pair (e.g., locating the vowel-nasal boundary too early in both shinaig and shinaig), thereby possibly canceling the effects of the error, but this needs to be demonstrated with actual automatic segmentation systems. Third, the methods need to be extended to unrestricted speech. Key for preserving the prosodic minimal pairs feature of our methods will be the creation of algorithms that detect quasi prosodic minimal pairs in broadly elicited speech samples. Quasi prosodic minimal pairs are pairs of prosodically contrastive words or phrases that are phonemically similar but not identical. Finally, we need to develop methods that not only assess whether a speaker can consistently express a certain prosodic contrast but also whether the speaker expresses this contrast with an appropriate balance of prosodic features.
We thank Sue Peppé for making available the pictorial stimuli and for granting permission to us for creating new versions of several PEPS-C tasks; Lawrence Shriberg for helpful comments on an earlier draft of the paper, in particular on the LSR section; Rhea Paul for helpful comments on an earlier draft of the paper and for suggesting the Lexical Stress and Pragmatic Style Tasks; the clinical staff at CSLU (Beth Langhorst, Rachel Coulston, and Robbyn Sanger Hahn) and at Yale University (Nancy Fredine, Moira Lewis, Allyson Lee) for data collection; senior programmer Jacques de Villiers for the data collection software and data management architecture; Meg Mitchell and Justin Holguin for speech transcription; and the parents and children for participating in the study. This research was supported by a grant from the National Institute on Deafness and Other Communicative Disorders, NIDCD 1R01DC007129-01; a grant from the National Science Foundation, IIS-0205731; by a Student Fellowship from AutismSpeaks to Emily Tucker Prud'hommeaux; and by an Innovative Technology for Autism grant from AutismSpeaks. The views herein are those of the authors and reflect the views neither of the funding agencies nor of any of the individuals acknowledged.
Disclosure: This manuscript has not been published in whole or substantial part by another publisher and is not currently under review by another journal.
1Preliminary results of this work were presented as posters at IMFAR 2007 and IMFAR 2008
2We use nonsense words rather than attested trochaic/iambic pairs (e.g., contest-N vs. contest-V) because such pairs are not identical phonetically and are not generally part of a typical child's vocabulary. We do recognize, as one reviewer pointed out, that inherent properties of the component phonemes may trigger the perception of stress in the absence of meaning. The minimal-pair presentation should reduce the likelihood of this phenomenon.
3We thank one of the reviewers for making this point.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.