|Home | About | Journals | Submit | Contact Us | Français|
Learners of a second language practice their pronunciation by listening to and imitating utterances from native speakers. Recent research has shown that choosing a well-matched native speaker to imitate can have a positive impact on pronunciation training. Here we propose a voice-transformation technique that can be used to generate the (arguably) ideal voice to imitate: the own voice of the learner with a native accent. Our work extends previous research, which suggests that providing learners with prosodically corrected versions of their utterances can be a suitable form of feedback in computer assisted pronunciation training. Our technique provides a conversion of both prosodic and segmental characteristics by means of a pitch-synchronous decomposition of speech into glottal excitation and spectral envelope. We apply the technique to a corpus containing parallel recordings of foreign-accented and native-accented utterances, and validate the resulting accent conversions through a series of perceptual experiments. Our results indicate that the technique can reduce foreign accentedness without significantly altering the voice quality properties of the foreign speaker. Finally, we propose a pedagogical strategy for integrating accent conversion as a form of behavioral shaping in computer assisted pronunciation training.
Despite years or decades of immersion in a new culture, older learners of a second language (L2) typically speak with a so-called “foreign accent,” sometimes despite concerted efforts at improving pronunciation. Similar learning phenomena have been observed in the animal world: a critical period exists beyond which animals cannot learn certain behaviors, e.g. bird singing, nest building, courting. In analogy with this phenomenon, Penfield and Roberts (1959), and later Lenneberg (1967), proposed the concept of a critical period for language acquisition. Initially proposed for the acquisition of a first language, this critical period (roughly between the age of two and puberty) has also been studied in the context of L2 acquisition (Major, 2001). Among the many aspects of proficiency in a second language (e.g. lexical, syntactic, semantic, phonological), native-like pronunciation is the most severely affected by a critical period because of the neuromusculatory basis of speech production (Scovel, 1988). Thus, according to this theory, foreign-accented production is unavoidable if a second language is learned beyond the critical period years.
To address this gloomy outlook on L2 pronunciation, many authors argue that what really matters is that the speech be intelligible, rather than accent-free (Neri et al., 2002; Pennington, 1999). However, although foreign accentedness does not necessarily affect a person’s ability to be understood (Munro and Derwing, 1995), foreign-accented speakers can be subjected to discriminatory attitudes and negative stereotypes (Anisfeld et al., 1962; Arthur et al., 1974; Lippi-Green, 1997; Ryan and Carranza, 1975; Schairer, 1992). Thus, by achieving near-native pronunciation, L2 learners stand more to gain than just better intelligibility. A second, and more direct, counter-argument to the critical period hypothesis has been provided by a number of studies showing that native-like pronunciation can be achieved by adults learning the second language well beyond puberty (Bongaerts, 1999). Nonetheless, the proportion of such native-like L2 speakers is believed to be small: between 0.1% and 3% (Markham, 1997). Given the small probability of attaining native-like pronunciation, it is unrealistic to believe that an L2 learner should hold this as their ultimate goal. However, most L2 learners can make significant strides towards reducing their accent and possibly achieving near-native performance. According to Bongaerts (1999), several factors contribute to the success of such L2 speakers: (1) a high motivation to achieve accent-free pronunciation, (2) unlimited access to L2 speech, and (3) intensive training in L2 perception and L2 production. These characteristics suggest that computer assisted pronunciation training (CAPT) is an ideal medium for the attainment of near-native pronunciation.
Although not as effective as human instruction, CAPT offers several features that make it advantageous in classroom settings (Neri et al., 2002; Pennington, 1999). Most notably, CAPT allows users to follow personalized lessons, at their own pace, and practice as often as they like. One study (Murray, 1999) showed that users are more comfortable practicing pronunciation in a private setting, where they can avoid anxiety and embarrassment. Users are also more likely to practice when and where it is convenient. The most praised systems are those that incorporate Automatic Speech Recognition (ASR) (e.g. FLUENCY (Eskenazi and Hansma, 1998), ISLE (Menzel et al., 2000), and Talk to Me (Auralog, 2002)) because they are able to provide users with objective and consistent (though not always correct) feedback. CAPT systems also keep several speakers in their databases, which helps listeners improve their listening comprehension (McAllister, 1998).
Despite the potential advantages of CAPT, the technology remains controversial for reasons beyond the debate surrounding the critical period hypothesis (Pennington, 1999). As noted by Neri et al. (2002), part of the problem is that many commercial products have chosen technological novelty over pedagogical value. For instance, a product may provide a display of the learner’s utterance (e.g. a speech waveform or a spectrogram) against that from a native speaker. These visualizations are not only difficult to interpret for non-specialists but are also misleading: two utterances can have different acoustic representations despite having been pronounced correctly.1 A second criticism stems from the limitations of ASR technology when used for detecting pronunciation errors and evaluating pronunciation quality (Neri et al., 2003); CAPT is a challenging domain for ASR because of the inherent variability of foreign-accented speech. ASR errors not only frustrate and mislead the learner but also, and more importantly, undermine their trust in the CAPT tool (Levy, 1997; Wachowicz and Scott, 1999).
Feedback is a critical component in pronunciation training; unfortunately, research data on the effectiveness of various feedback strategies is scarce (Neri et al., 2002). Hansen (2006) prescribes four critical criteria for feedback in CAPT; feedback should be easy to understand (comprehensive), feedback should determine if the correct phoneme was used (qualitative) and if the phoneme was of the correct length (quantitative), and feedback should suggest actions for improvement (corrective). CAPT systems employing ASR technology usually satisfy the first three requirements, but often have difficulty providing meaningful corrective suggestions. One of the more ambitious CAPT systems to date (ISLE; (Menzel et al., 2000)) satisfied all four of the criteria in (Hansen, 2006), but poor ASR accuracy ultimately limited its adoption. ASR errors can be so disruptive to the learner that Wachowicz and Scott (1999) have suggested that CAPT systems should rely on implicit rather than explicit feedback. As an example, recasts—a rephrasing of the incorrectly pronounced utterance-have been shown to be superior to explicit correction of phonological errors (Lyster, 2001).
During the last two decades, a handful of studies have suggested that it would be beneficial for L2 students to be able to listen to their own voices producing native-accented utterances (Jilka and Möhler, 1998; Sundström, 1998; Tang et al., 2001; Watson and Kewley-Port, 1989). The rationale is that, by stripping away information that is only related to the teacher’s voice quality, accent conversion makes it easier for students to perceive differences between their accented utterances and their ideal accent-free counterparts. In addition, it can be argued that accent-corrected utterances provide a form of feedback that is implicit (Wachowicz and Scott, 1999), corrective, and encouraging. A series of previous studies support this view.
Nagano and Ozawa (1990) evaluated a prosodic-conversion method for the purpose of teaching English pronunciation to Japanese learners. One group of students was trained to mimic utterances from a reference English speaker, whereas a second group was trained to mimic utterances of their own voices, previously modified to match the prosody of the reference English speaker. Pre- and post-training utterances from both groups of students were evaluated by native English listeners. Post-training utterances from the second group of students were rated as more native-like than those from the first group. More recently, Bissiri et al. (2006) investigated the use of prosodic modification to teach German prosody to Italian speakers. Their results were consistent with those of Nagano and Ozawa (1990), and indicate that the learner’s own voice (with corrected prosody) is a more effective form of feedback than prerecorded utterances from a German native speaker. Peabody and Sene. (2006) proposed a similar strategy to teach pronunciation of a tonal language (Mandarin), a problem that is very challenging for students whose native language is non-tonal (English). For this purpose, the authors used three different datasets of Mandarin utterances: a corpus produced by native speakers, and two corpora produced by L2 speakers. Using a phase vocoder, the authors transformed the pitch contour of L2 utterances to match the tonal shapes of native utterances. The transformed L2 utterances were twice as likely to be classified correctly by a pattern classifier. This result is not surprising since the classifier was trained with tones, but it does highlight the importance of prosody and its ability to indicate accent. Anecdotal support for the use of accent conversion is also provided by studies of categorical speech perception and production. In particular, Repp and Williams (1987) compared the accuracy of speakers imitating isolated vowels in two continua: [u]-[i] and [i]-[æ]. Their results indicate that speakers were more accurate when imitating their own (earlier) productions of those vowels than when imitating vowels produced by a speech synthesizer.
More recently, a few CAPT tools have begun to incorporate prosodic-conversion capabilities. These tools allow L2 learners to re-synthesize their own utterances with a native prosody, either through a manual editing procedure (Martin, 2004) or with automated algorithms (GenevaLogic, 2007). Proper intonation and stress are critical because they provide a temporal structure that helps the listener parse the continuous speech waveform (Celce-Murcia et al., 1996). Thus, a number of authors have suggested that prosody should be emphasized early on in teaching a second language (Chun, 1998; Eskenazi, 1999). However, speech intelligibility can also degrade as a result of segmental/ spectral errors (Rogers and Dalby, 1996), which indicates that both segmental and supra-segmental features should be considered in pronunciation training (Derwing et al., 1998b). This suggests that full accent conversion (i.e. prosodic and segmental) would be beneficial in teaching pronunciation of a foreign language.
Probst et al. (2002) investigated the relationship between the student/teacher voice similarity and pronunciation improvement. Results from this study showed that learners who imitated a well-matched speaker improved their pronunciation more than those who imitated a poor match, suggesting the existence of a user-dependent “golden speaker.” Thus, one can argue that full accent conversion would provide learners with the optimal “golden speaker”: their native-accented selves. As a step towards this goal, this manuscript describes a speech processing method that can be used to transform foreign-accented utterances into their native counterparts, and provides a thorough validation of the method through a series of perceptual tests. In addition, we discuss implementation issues and propose a pedagogical strategy that would integrate accent conversion into computer assisted pronunciation training as a form of behavioral shaping (Kewley-Port and Watson, 1994; Watson and Kewley-Port, 1989).
What constitutes a foreign accent? A foreign accent can be defined as deviations from the expected acoustic (e.g. formants) and prosodic (e.g. intonation, duration, and rate) norms of a language. According to the modulation theory of speech (Traunmüller, 1994), a speaker’s utterance results from the modulation of a voice quality carrier with linguistic gestures. In this context, Traunmüller identifies the carrier as the organic aspects of a voice that “reflect the morphological between-speaker variations in the dimensions of speech,” such as those that are determined by physical factors (e.g. larynx size and vocal tract length). Thus, in analogy with the source/filter theory of speech production (Fant, 1960), which decomposes a speech signal into excitation and vocal tract resonances, modulation theory suggests that one could deconvolve an utterance into its voice quality carrier and its linguistic gestures. According to this view, then, a foreign accent may be removed from an utterance by extracting its voice quality carrier and convolving it with the linguistic gestures of a native-accented counterpart.
In contrast with voice conversion, which seeks to transform utterances from a speaker so they sound as if another speaker had produced them (Abe et al., 1988; Arslan and Talkin, 1997; Childers et al., 1989; Kain and Macon, 1998; Sundermann et al., 2003; Turk and Arslan, 2006), accent conversion seeks to transform only those features of an utterance that contribute to accent while maintaining those that carry the identity of the speaker. Accent conversion is a relatively new concept; as a result, only a handful of studies have been published on the subject. Yan et al. (2004) proposed an accent-synthesis method based on formant warping. First, the authors developed a formant tracker based on hidden Markov models and linear predictive coding, and applied it to a corpus containing several regional English accents (British, Australian, and American). Analysis of the formant trajectories revealed systematic differences in the vowel formant space for the three regional accents. Second, the authors re-synthesized utterances by warping formants from a foreign accent onto the formants of a native accent; pitch-scale and time-scale modifications were also applied. An ABX test showed that 75% of the re-synthesized utterances were perceived as having the native accent, which indicates that segmental accent conversion is feasible. More recently, Huckvale and Yanagisawa (2007) used an English text-to-speech (TTS) system to simulate English-accented Japanese utterances; foreign accentedness was achieved by transcribing Japanese phonemes with their closest English counterparts. The authors then evaluated the intelligibility of a Japanese TTS against the English TTS, and against several prosodic and segmental transformations of the English TTS. Their results showed that both segmental and prosodic transformations are required to improve significantly the intelligibility of English-accented Japanese utterances.
Our work differs from the study of Yan et al. (2004) in two respects. First, our accent conversion method uses a spectral envelope vocoder, which makes it more suitable than formant tracking for unvoiced segments. Second, we evaluate not only the accentedness of the re-synthesized speech but also the perceived identity of the resulting speaker. The latter is critical because a successful accent conversion model should preserve the identity of the foreign-accented speaker. In contrast with Huckvale and Yanagisawa (2007), our study is performed on natural speech, and focuses on accentedness and identity rather than on intelligibility; as noted by Munro and Derwing (1995), a strong foreign accent does not necessarily limit the intelligibility of the speaker.
The remaining sections of this article are organized as follows. Section 4 provides an overview of the speech modification framework (FD-PSOLA) adopted for this work, and describes our method of accent conversion. Section 5 describes the perceptual protocol we have employed to evaluate our method along three dimensions: foreign accentedness, speaker identity, and signal quality; results from these perceptual experiments are analyzed in Section 6. The article concludes with a discussion of our findings and the implications of accent conversion in CAPT.
Our accent conversion transformation is based on the general framework of Pitch-Synchronous Overlap and Add (PSOLA) (Moulines and Charpentier, 1990). Several versions of PSOLA have been proposed in the literature, including Fourier-domain FD-PSOLA, linear-prediction LP-PSOLA, and time-domain TD-PSOLA (Moulines and Charpentier, 1990; Moulines and Laroche, 1995). These algorithms perform comparably under modest modification factors, but FD-PSOLA is the most robust to spectral distortion during the pitch modification step. For this reason, and despite its higher computational requirements, FD-PSOLA was adopted for this work.
FD-PSOLA operates in three stages: analysis, modification, and synthesis. During the analysis stage, the speech signal is decomposed into a series of pitch-synchronous short-time analysis windows; our implementation uses a pitch-marking algorithm (Kounoudes et al., 2002) to estimate instants of glottal closure. Each analysis window is framed with a Hanning window, and transformed into the frequency domain.2 As a result, all pitch-synchronous short-time spectra are represented with the same length (e.g. 2048 frequencies in our implementation).
In the modification stage, the short-time spectra and their locations are modified to meet the desired pitch and timing (i.e. those of the native speaker, in our case). This modification consists of three steps. First, a new set of synthesis pitch marks are defined according to the native pitch and timing. Second, the short-time spectra are copied (i.e. duplicated or deleted) onto the synthesis pitch marks. Finally, the short-time spectra are transformed to match the new pitch period. Since we operate in the frequency domain, this last step is equivalent to resampling, i.e. spectral compression lowers the pitch and expansion raises it. However, naïve compression of the spectrum also shifts speech formants. For this reason, we first flatten the spectrum with a spectral envelope vocoder (SEEVOC) (Paul, 1981). We also use a spectral folding technique (Makhoul and Berouti, 1979) to regenerate high frequency components that are lost when performing spectral compression (Fig. 1b). Finally, we multiply the flattened spectrum by the SEEVOC spectral envelope estimate, thus restoring its original resonances (Fig. 1c).
The modified short-time spectra are finally transformed back to the time domain, and combined by means of a least-squared-error signal estimation criterion:
where is the inverse Fourier transform of the short-time spectra at time m and w(m − n) is the windowing function (e.g. Hanning) (Griffin and Lim, 1984).
For convenience, we will call the second-language (foreign) speaker of American English the learner, and the native speaker of American English the teacher. We also assume that parallel English utterances are available from both speakers.
Our accent transformation method proceeds in two distinct steps. First, prosodic conversion is performed by modifying the phoneme durations and pitch contour of the learner utterance to follow those of the teacher. Second, formants from the learner utterance are replaced with those from the teacher. These two steps are performed simultaneously in our implementation.
To perform time-scale conversion, we assume that the speech has been phonetically segmented by hand or with a forced-alignment tool (Sphinx, 2001; Young, 1993). From these phonetic segments, the ratio of teacher-to-learner durations is used to specify a time-scale modification factor α for the learner on a phoneme-by-phoneme basis; as prescribed by Moulines and Laroche (1995), we limit time-scale factors to the range of α = [0.25, 4].
Our pitch-scale modification combines the pitch dynamics of the teacher with the pitch baseline of the learner. This is achieved by replacing the pitch contour of the learner utterance with a transformed (i.e. shifted and scaled) version of the pitch contour of the teacher utterance. For this purpose, we first estimate average pitch values for the learner ( ) and teacher ( ) from a corpus of utterances. Next, we define a piecewise-linear time-warping, ΨLT(f(t)), to align learner and teacher utterances at phoneme boundaries. Finally, given pitch contours and for the specific learner and teacher utterances to be converted, we define a pitch-scale factor β as
where we also limit pitch-scale factors to the range of β = [0.5, 2]. This process allows us to preserve speaker identity by maintaining a reasonable pitch baseline and range (Compton, 1963; Sambur, 1975), while acquiring the pitch dynamics of the teacher, which provides important cues to native accentedness (Arslan and Hansen, 1997; Munro, 1995; Vieru-Dimulescu and Mareüil, 2005). Once the time-scale and pitch-scale modification parameters (α,β) are calculated, standard FD-PSOLA is used to perform the prosodic conversion.
Our segmental accent conversion stage assumes that the glottal excitation signal is largely responsible for voice quality, whereas the filter contributes to most of the linguistic information. Thus, our strategy consists of combining the teacher’s spectral envelope (filter) with the learner’s glottal excitation. FD-PSOLA allows us to perform this step in a straightforward fashion: in the final step illustrated in Fig. 1c, we multiply the learner’s flat spectra by the teacher’s envelope rather than by the learner’s envelope. In order to reduce speaker-dependent information in the teacher’s spectral envelope, we also perform Vocal Tract Length Normalization (VTLN) using a piecewise linear function defined by the average formant pairs of the two speakers (see Fig. 2) (Sundermann et al., 2003). These formant locations are estimated with Praat (Boersma and Weenink, 2007) over the entire corpus. The result is a signal that consists of the learner’s excitation, and the teacher’s spectral envelope normalized to the learner’s vocal tract length.
The proposed accent conversion method was evaluated through a series of perceptual experiments. We were interested in determining (1) the degree of reduction in foreign accent that could be achieved with the model, and (2) the extent to which the transformation preserved the identity of the original speaker. To establish the relative contribution of segmental and prosodic information, these two factors were manipulated independently, resulting in three accent conversions: prosodic only, segmental only, and both. Original utterances from both foreign and native speakers were tested as well, resulting in five stimulus conditions (see Table 1). Sample video files for the five conditions are available as Supplemental material (1–5.mpg and rev1–5.mpg), and spectrograms of the three primary conditions (1, 4 and 5) are shown in Fig. 3.
One hundred and ninety one participants were recruited from the undergraduate pool maintained by the Department of Psychology at Texas A&M University. All participants were native speakers of American English and had no hearing or language impairments. Two speakers were selected from the CMU_ARCTIC database (Kominek and Black, 2003): ksp_indianmale and rms_usmale2. Given that our participants were native speakers of American English, utterances from ksp_indianmale were treated as the foreign-accented learner, and utterances from rms_usmale2 were treated as the native-accented teacher.3 The same twenty sentences were chosen for each of the five conditions, or 100 unique utterances. Audio stimuli were presented via headphones.
Thirty-nine students participated in a 25-minute scaled-rating test to establish the degree of accentedness of individual utterances. Following Munro and Derwing (1994), participants responded on a 7-point Empirically Grounded, Well-Anchored (EGWA) scale (0 = not at all accented; 2 = slightly accented; 4 = quite a bit accented; 6 = extremely accented) (Pelham and Blanton, 2007). Each participant rated all 100 utterances.
Forty-three students participated in a 25-minute Mean Opinion Score test to obtain a numerical indication of the perceived quality (e.g. lack of distortions) of the recorded/synthesized utterances. Following Kain and Macon (Kain and Macon, 1998), participants heard an utterance and were asked to indicate the acoustic quality of the stimulus on a standard MOS scale from 1 (bad) to 5 (excellent), where “excellent” was defined as a sound that had no distortions. Before the test began, students listened to examples of sounds with various accepted MOS values. This task included feedback, which allowed students to calibrate themselves to the reference scores. Each participant rated all 100 utterances.
Forty-three students participated in a 25-minute speaker identification test. Following Kreiman and Papcun (1991), participants heard two linguistically different4 utterances presented consecutively, and were instructed to “focus on those aspects of the voice that determine identity.” Participants were asked to determine if the two sentences were produced by the same speaker or by two different speakers, and to rate their confidence on a 7-point EGWA scale (0 = not at all confident; 2 = slightly confident; 4 = quite a bit confident; 6 = extremely confident). These two responses were then converted into a 15-point perceptual score from 0 to 14 (Table 2). Each participant listened to 60 pairs of utterances.5
Sixty-six students participated in a 25-minute speaker identification test similar to that in Section 5.3, except that utterances were played backwards. Although the instructions in the previous experiment clearly stated that participants should focus on the identity-related aspects of the speaker voices, we suspected that it would be difficult for participants to ignore linguistic cues in those utterances. Fortunately, reversed speech removes most of the linguistic cues (e.g. language, vocabulary, and accent) that may be used to identify a speaker, while retaining the pitch, pitch range, speaking rate, and vocal quality of the speaker, which can be used to identify familiar and unfamiliar voices (Sheffert et al., 2002; van Lancker et al., 1985). Thus, this reversed-speech identification test allowed us to determine the extent to which the accentedness of our speakers (or possibly distortions resulting from the re-synthesis) had been used as a cue in the previous identification study.
Results from the foreign accentedness experiment are summarized in Fig. 4a. Original recordings from the foreign speaker received the highest average accent rating (4.85), while native speaker recordings had the lowest average rating (0.15). The prosodic transformation decreased the perceived accent slightly (4.83), but this change was not statistically significant; t(38) = 0.38, n.s. On the other hand, the segmental transform lowered the rating to 1.97; t(38) = 24.14, p 0.01. When used in concert, both transformations yield an average score of 1.79; t(38) = 24.06, p 0.01. Both of these reductions were statistically significant.
Results from the acoustic quality experiment are summarized in Fig. 4b. Original recordings from the native speaker received the highest average quality rating (4.84), while the unmodified foreign speaker averaged a lower rating (4.0); this difference was statistically significant; t(42) = 6.68, p 0.01. This lower rating may have been caused by differences in the recording conditions for both speakers, but it is also possible that subjects penalized the “quality” of non-native speech because it was less intelligible. All transformations lowered the quality ratings: starting from the original baseline (4.0), the prosodic and segmental transformations reduced quality ratings to 2.96 and 2.67, respectively; these differences were statistically significant (t(42) = 12.49, p 0.01 and t(42) = 9.19, p 0.01), with respect to the rating of foreign-accented utterances.
The identity experiments yield a collection of perceptual distances between pairs of utterances (0/14: the participant was extremely confident that the speakers were the same/ different). Because only the relative distance between stimuli is available, we resort to multi-dimensional scaling (MDS) to find a low-dimensional visualization (e.g. 2D) of the data that preserves those pair-wise distances; see (Matsumoto et al., 1973) for a classical use of MDS in speech perception. Namely, we use ISOMAP (Tenenbaum et al., 2000), an MDS technique that attempts to preserve the geodesic distance6 between stimuli. For clarity, technical details of ISOMAP are included in Appendix A.
ISOMAP visualizations of the identity tests are shown in Fig. 5. In the case of forward speech, samples from conditions 1 and 2 map closely together in the manifold. Thus, this result indicates that the prosodic transformation had only a small effect on the perceived identity of the speakers. On the other hand, samples of learner utterances (condition 1) and their segmental transformations (conditions 3 and 4) are clearly separated in the ISOMAP manifold. This result indicates that participants were able to distinguish between learner and segmentally-transformed utterances, which suggests that they perceived the latter as a “third” speaker; note that this type of inference is not possible with the ABX tests commonly used in voice conversion. However, all samples containing the learner’s glottal excitation (conditions 1–4) appear to map on a linear subspace that is separate from the teacher utterances (condition 5), which indicates that the former are perceived as being closer to each other than to the teacher. In fact, by calculating the average Euclidean distance across conditions, we find that this “third” speaker (conditions 3–4) is perceived to be three times closer to the learner (condition 1) than to the teacher (condition 5).
Results from the reversed-speech experiment, shown in Fig. 5b, indicate that participants were unable to differentiate between conditions 1–2 and conditions 3–4. These results support our hypothesis, namely, that participants in the forward speech experiment had identified conditions 3–4 as a “third” speaker because of the association between accentedness and speaker identity. Since most linguistic cues (including accent) are not accessible with reversed speech, participants perceive conditions 1–4 as utterances from the same speaker.
Interestingly, the ISOMAP embedding in both cases (though more clearly with forward speech) can be interpreted in terms of the source-filter theory. As shown in Fig. 5, the first dimension separates samples in condition 5, which uses the teacher’s glottal excitation, from samples in the remaining conditions, which use the learner’s glottal excitation. In contrast, the second dimension separates samples in conditions 1–2, which employ the learner’s filter, from samples in conditions 3–5, which employ the teacher’s filter.
The perceptual results presented in the previous section indicate that our accent conversion approach can reduce the perceived foreign accentedness of an utterance (by about two-thirds) while preserving information that is unique to the voice quality of the speaker. Thus, these results support our choice of a spectral envelope vocoder to decompose utterances into their voice quality and linguistic components. Although foreign-accented utterances (condition 1) are already perceived as being of lower quality, the technique itself introduces perceivable distortions, as indicated by the lower quality ratings for conditions 2–4. This result could be attributed to several factors, including segmentation/alignment errors, voicing differences between speakers, and phase distortions that result from combining glottal excitation with spectral envelope from different speakers. Our results of accentedness seem to underplay the importance of prosody when compared with other studies (Jilka and Möhler, 1998; Nagano and Ozawa, 1990). This could be a consequence of the elicitation procedure used in the ARCTIC database (Kominek and Black, 2003), since read speech is more prosodically flat than spontaneous or conversational speech (Kenny et al., 1998).
Identity tests with forward speech indicate that the segmental transformations (with or without prosodic transformation) are perceived as a third speaker. This third speaker disappears, however, when participants are asked to discriminate reversed speech.7 One could argue that the emergence of a third speaker on forward speech is merely the result of distortions introduced by the segmental transformation; these distortions are imperceptible when utterances are played backward, which may explain why the third speaker “disappears” with reversed speech. In other words, accentedness and acoustic quality would be confounded in our experiments. This view, however, is inconsistent with the acoustic quality ratings obtained in the second experiment. As shown in Fig. 4b, quality ratings for condition 2 are similar to those of conditions 3–4, rather than to those of condition 1; if participants had used acoustic quality as a cue in the identification study, condition 2 would have been perceived also as belonging to the third speaker. Thus, our identification experiments with forward and reverse speech indicate that participants used not only organic cues (voice quality) but also linguistic cues (accentedness) to discriminate speakers. This suggests that something is inevitably lost in the identity of a speaker when accent conversion is performed. After all, would foreign-born public figures (e.g. Arnold Schwarzenegger, Javier Bardem) be recognized as themselves without their distinct accents?
As discussed in Section 2, several studies have suggested the use of speech modification as a training tool for second-language pronunciation, and have shown promising results (Bissiri et al., 2006; Nagano and Ozawa, 1990; Peabody and Seneff, 2006). In addition, new CAPT tools have also begun to incorporate speech modification capabilities (GenevaLogic, 2007; Martin, 2004). This previous work has focused on time-scale and pitch-scale modifications, arguably because of the impact that prosody has on foreign accent and intelligibility. However, segmental pronunciation errors are also detrimental to intelligibility (Rogers and Dalby, 1996), and both aspects of pronunciation should be considered during training (Derwing et al., 1998a).
Our work has focused on developing a method for full (i.e. prosodic and segmental) accent conversion, and characterizing the model on three perceptual criteria: foreign accentedness, speaker identity, and acoustic quality. While our perceptual results are encouraging, the proposed accent conversion model has yet to be validated for the purposes of pronunciation training. To this end, our immediate issues deal with the implementation of the accent conversion method as a CAPT tool:
In addition to these technical challenges, special attention will have to be paid to feedback and pedagogical issues. Fortunately, earlier work by Kewley-Port and Watson (Kewley-Port and Watson, 1994; Watson and Kewley-Port, 1989) provides a framework for integrating accent conversion in CAPT. Namely, the authors proposed a three-dimensional taxonomy of CAPT systems according to their feedback strategy. The first two dimensions characterize feedback in terms of the level of detail (e.g. a spectrogram of the learner’s utterance versus a quality rating) and type of physical media (e.g. acoustic vs. visual). The third dimension is quite relevant to our work, because it characterizes feedback according to the standard against which the system evaluates productions of the learner. Two types of references are considered in the taxonomy: a normative standard (e.g. the teacher’s speech) and actual samples of the learner’s speech.8 The authors argue that using the student’s own voice as a standard can be considered as an attempt to incorporate “behavioral shaping” procedures into CAPT. In behavioral shaping, the teacher asks the students to compare their utterances against their previous efforts rather than against a separate standard. This is accomplished by keeping track of the student’s “best” utterances, and using them as a reference. Kewley-Port and Watson argue that using a normative reference can be detrimental in the early stages of training, when the student’s utterances are very distant from the ideal pronunciation. Instead, by using a “floating” reference (i.e. one that adapts to the performance of the learner), the teacher can provide carefully graded evaluations of the learner’s performance and guide him towards the ultimate goal. Accent conversion provides a mechanism for implementing such behavioral shaping procedures. Namely, by convolving the learner’s glottal excitation with a “morph” between the learner’s and teacher’s spectral envelope (as opposed to using the teacher’s envelope), accent conversion provides a continuum of transformations. During the early stages of learning, the system would provide the learner with transformed utterances that have less ambitious prosodic and segmental goals. The rationale is that these intermediate teachers would provide more realistic (though still challenging) goals for the user for imitate. As each of these intermediate teachers was met, the transformation would be updated using the latest, best pronunciation of the user. Continuing in this iterative manner, the training processes can be seen as a trajectory in a two-dimensional imitation space composed of the increasingly better productions of the learner (behavioral shaping) and a continuum of accent transformations (morphing). This process is illustrated in Fig. 6.
We believe that accent conversion could play a significant role in the next generation of computer assisted pronunciation training tools. Our method is based on the assumption that accent is contained in the prosody and formant structure of an utterance, whereas speaker identity is captured by vocal tract length and glottal shape characteristics. Our method employs FD-PSOLA to adapt the speaking rate and pitch of the learner towards those of the teacher, and a segmental transformation to replace the spectral envelope of the learner with that of the normalized teacher. These techniques achieved a significant reduction in foreign accent while preserving the voice quality of the speaker. Our results also reveal a strong connection between accent and identity, which suggest a tradeoff between reducing accentedness and preserving speaker identity. Our perceptual results, coupled with previous research showing the benefit of prosodic manipulation in pronunciation training, suggests that full accent conversion (segmental and prosodic) can be a successful form of implicit feedback in computer assisted pronunciation training.
Hart Blanton is greatly acknowledged for his suggestions regarding the EGWA scale for the perceptual ratings. These experiments were performed in his laboratory, for which we are also “6: extremely grateful.” We would also like to acknowledge an anonymous reviewer for providing an alternative explanation to the results in Fig. 5b.
To perform multi-dimensional scaling, we first create a (100 × 100) matrix containing the average perceptual distance between any two of the 100 utterances. Shown in Fig. 7a as an image (darker colors indicate larger perceptual distances between the corresponding pair of utterances), this matrix is sparse due to the large number of utterance pairs (10,000) relative to the number of participants. To guard against outliers, we eliminate any utterance pairs that have been rated by only one participant. We use an ε-neighborhood with a radius of 7 perceptual units9 to define a local connectivity graph; the resulting local distance matrix is shown in Fig. 7b. Geodesic distances between every pair of utterances are then estimated using Dijkstra’s shortest paths algorithm (Dijkstra, 1959), which results in the fully connected distance matrix D shown in Fig. 7c.
Following Tenenbaum et al. (2000), we apply an operator τ(·) to matrix D, which converts distances into inner products:
where S is a matrix containing the squared distances found in D (i.e. ), H is the centering matrix
IN is an identity matrix, and N is =100. The ith component yi of the d-dimensional embedding (i.e. the coordinates of the N utterances on the ith dimension of the embedding) is found by
Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.specom.2008. 11.004.
1On the other hand, it has been shown that displaying pitch contours improves intonation training, and that audio-visual feedback also improves prosody and segmental accuracy (Hincks, 2003).
2Our implementation follows the recommended window length of four times the local pitch period for voiced segments or a constant 10 ms for unvoiced segments (Moulines and Charpentier, 1990).
3In voice conversion, the choice of speakers is known to have a significant impact on the quality of the output (Turk and Arslan, 2005); we suspect that accent conversion is no different, but have not yet determined those factors that predict the success of the transformation. This is one of our immediate priorities as it must be investigated before a robust, learner-independent training tool is created.
4The 20 distinct sentences were divided into two sets (1–10 and 11–20) to ensure that pairs were linguistically unique. Presentation was counterbalanced across sets (i.e. a sentence from the first set was not always played first).
5All possible pairings can be expressed as a 5 × 5 matrix. To ensure that all pairs were sampled with the same frequency, diagonal elements in this matrix (i.e. same–same pairings) were sampled twice as often as off-diagonal elements, thus leading to 60 pairs (=(25 + 5) × 2 repetitions).
6ISOMAP assumes that samples exist on an intrinsically low-dimensional surface – a manifold. The geodesic distance is defined as the Euclidean distance between samples measured over this manifold. In ISOMAP, the geodesic distance is estimated as the shortest path in a graph where nodes represent samples and edges indicate neighboring samples.
7One could argue that the results in Fig. 5b contain three clusters, but that they are more spread due to the increased difficulty of the task. The fact remains that, even with reverse speech, subjects are able to discriminate teacher utterances from other utterances, whereas they can hardly discriminate learner utterances from their accent converted versions.
8The authors also advance that “another approach to generate client-specific standards would be to generate a speech model or template that could be based on the resonant characteristics of the client’s own vocal tract,” a view that is supportive of the work presented in this manuscript.
9Scores of 0–7 indicates pairs of utterances that participants believed to have been produced by the same speaker.