|Home | About | Journals | Submit | Contact Us | Français|
Speech perception (SP) most commonly refers to the perceptual mapping from the highly variable acoustic speech signal to a linguistic representation, whether it be phonemes, diphones, syllables, or words. This is an example of categorization, in that potentially discriminable speech sounds are assigned to functionally equivalent classes. In this tutorial, we present some of the main challenges to our understanding of the categorization of speech sounds and the conceptualization of SP that has resulted from these challenges. We focus here on issues and experiments that define open research questions relevant to phoneme categorization, arguing that SP is best understood as perceptual categorization, a position that places SP in direct contact with research from other areas of perception and cognition.
Spoken syllables may persist in the world for mere tenths of a second. Yet, as adult listeners, we are able to gather a great deal of information from these fleeting acoustic signals. We may apprehend the physical location of the speaker, the speaker's gender, regional dialect, age, emotional state, or identity. These spatial and indexical factors are conveyed by the acoustic speech signal in parallel with the linguistic message of the speaker (Abercrombie, 1967). Although these factors are of much interest in their own right, speech perception (SP) most commonly refers to the perceptual mapping from acoustic signal to some linguistic representation, such as phonemes, diphones, syllables, words, and so forth.1
Most of the research in the field of SP has focused on the mapping from the acoustic speech signal to phonemes, the smallest linguistic unit that changes meaning within a particular language (e.g., /r/ and /l/ as in rake vs. lake), with the often implicit assumption that phoneme representations are a necessary step in the comprehension of spoken language. The transformation from acoustics to phonemes occurs so rapidly and automatically that it mostly escapes our notice (Näätänen & Winkler, 1999). Yet this apparent ease masks the complexity of the speech signal and the remarkable challenges inherent in phoneme perception.
As a starting point, one might presume that phoneme perception is accomplished by detecting characteristics in the acoustic signal that correspond to each phoneme or by comparing a phoneme template in memory with segments of the incoming signal. In fact, this was the presumption in the early days of SP, starting in the 1940s (see Liberman, 1996), and it led to the hope that machine speech recognition was on the horizon. However, it became clear rather quickly that SP was not a simple detection or match-to-pattern task (Liberman, Delattre, & Cooper, 1952). Although there has been a wealth of studies documenting the acoustic “cues” that can signal the identity of different phonemes (see Stevens, 2000, for a review), there is significant variability in the relationship of these cues to the intended phonemes of a speaker and the perceived phonemes of a listener. The variability is due to a multitude of sources, including differences in speaker anatomy and physiology (Fant, 1966), differences in speaking rate (Gay, 1978; Miller & Baer, 1983), effects of the surrounding phonetic context (Kent & Minifie, 1977; Öhman, 1966), and effects of the acoustic environment such as noise or reverberation (Houtgast & Steeneken, 1973). The end result of all of these sources of variability is that there appear to be few or no invariant acoustic cues to phoneme identity (Cooper, Delattre, Liberman, Borst, & Gerstman, 1952; Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967; but see Blumstein & Stevens, 1981, for a possible exception). This means that listeners cannot accomplish SP by simply detecting the presence or absence of cues.
In place of a simple match-to-sample or detection approach, SP is now often conceived of as a complex categorization task accomplished within a highly multidimensional space. One can conceptualize a segment of the speech signal as a point in this space representing values across multiple acoustic dimensions. In most cases, the dimensions of this space are continuous acoustic variables such as fundamental frequency, formant frequency, formant transition duration, and so forth. That is, speech stimuli are represented by continuous values, as opposed to binary values of the presence or absence of some feature. SP is the process that maps from this space onto representations of phonemes or linguistic features that subsequently define the phoneme (Jakobson, Fant, & Halle, 1952). This is an example of categorization, in that potentially discriminable sounds are assigned to functionally equivalent classes (Massaro, 1987).
An early example of such an acoustic space representation for phoneme classes is present in Peterson and Barney (1952), where vowel productions by adult males and females and children were displayed in terms of first and second formant (F1 and F2) frequencies. This simple distribution map demonstrates that exemplars of particular phonemes tend to cluster together in acoustic space (e.g., instances of the vowel /i/ as in heat tend to have low F1s and high F2s), but there is a tremendous amount of overlap among the distributions of different vowels owing to variability in speech productions (see also Hillenbrand, Getty, Clark, & Wheeler, 1995, for an update on these vowel measures, and Lisker & Abramson, 1964, for overlap in consonant voicing distributions). Presumably, listeners have to determine boundaries in order to parse these acoustic spaces and perceive the intended phonemes despite acoustic variability. Whereas there are a few auditory perceptual discontinuities that may aid in parsing acoustic space into categories in some cases (Holt, Lotto, & Diehl, 2004; Pisoni, 1977; Steinschneider et al., 2005), for the vast majority of cases listeners must determine the boundaries among phoneme categories on the basis of their experience with the language.
Unfortunately, even a perceptual categorization approach to SP does not provide easy answers to many of the questions regarding phoneme perception. In this tutorial, we present some of the main challenges to our understanding of the categorization of speech sounds, as well as the development of our conceptualization of SP that has resulted from these challenges. Because it is not possible to exhaustively review 60+ years of research and theory here, we focus on issues and experiments that define open research questions.
A major problem of mapping from multidimensional acoustic distributions to phonemes is that some of the variability in the acoustic input space is relevant to the linguistic message, some of the variability is related to characteristics of the speaker, and some of the variability is noise. To further complicate things, variation on any particular acoustic dimension could be the result of any of these sources, depending on the context. The pitch (fundamental frequency, f0) of the vowel in the utterance /ba/, for example, may be linguistically insignificant as it varies with the sex and age of the speaker (Klatt & Klatt, 1990), but relative pitch does serve as a linguistically reliable cue to /ba/ versus /pa/, with /pa/ having a higher pitch relative to /ba/ (House & Fairbanks, 1953).
Voice pitch is one of as many as 16 cues that can distinguish /ba/ from /pa/ (Lisker, 1986). Whereas any of these multiple cues may be informative for the speech categorization, the perceptual effectiveness of each cue varies. For example, when categorizing consonants such as /b/, /d/, and /g/, American English listeners make greater use of differences in formant transitions as opposed to frequency information in the noise burst that precedes the transitions even though both cues reliably covary with the consonants (Francis, Baldwin, & Nusbaum, 2000). Of significance, listeners' relative reliance on particular acoustic cues changes across development (see, e.g., Nittrouer, 2004) and varies depending on the listener's native language (e.g., Iverson et al., 2003). Thus, establishing the mapping from an acoustic input space to a perceptual space is a developmental process that depends on language experience.
For several months after birth, normal-hearing infants appear to parse the speech input space in the same manner (see Kuhl, 2004, and Werker & Tees, 1999, for reviews). No matter the linguistic environment in which they are developing, the basic characteristics of the human auditory system's response to speech signals dictates perception. Since speech sounds must be discriminably different enough from one another to reliably convey meaning, languages have evolved inventories of speech sounds that exploit basic human auditory function (Diehl & Lindblom, 2004; Lindblom, 1986). Thus, young infants tend to discriminate nearly any speech distinction they are presented (Kuhl, 2004). However, by the first birthday, experience with the regularities of the native language restructures the perceptual space to which speech input maps (Werker & Tees, 1984). By this time, infants developing in English-speaking environments perceive the same sounds differently, for example, than do infants developing in Swedish-speaking environments (Kuhl, Williams, Lacerda, Stevens, & Lindblom, 1992). Infants appear to have parsed the perceptual space, finding regularity relevant to the native language amid considerable acoustic variability across other dimensions.
These changes have been described as a “warping” of the perceptual space (Kuhl et al., 2008). If we imagine perceptual space as a multidimensional topography, the perceptual landscape can be described as relatively flat in early infancy, with any discontinuities arising from discontinuities in human auditory processing. With experience with the native language environment, the perceptual space is warped to reflect the regularities of the native speech input space (Kuhl, 2000; Spivey, 2007), and infants begin to perceive speech relative to the characteristics of the native language rather than solely according to psycho-acoustic properties. The groundwork for reorganizing the perceptual space according to the regularities of the native language input thus begins in infancy (see Kuhl, 2000), although development of speech categories continues through childhood (see Walley, 2005). Although the development of speech categories is now widely documented, research is just beginning to uncover the learning mechanisms that guide this experience-dependent process.
A natural question that arises is, How does the initial categorization parsing based on one's native language affect the ability to learn a second language? A popular example of this issue comes from comparing English, which distinguishes /r/ from /l/, and Japanese, which does not use /r/ and /l/ to distinguish meaning and instead possesses a single lateral flap (Ladefoged & Maddieson, 1996), which overlaps with /r/ and /l/ in an acoustic space defined by the onset frequencies of the second (F2) and third (F3) formants (Lotto, Sato, & Diehl, 2004). Thus, English listeners must parse the perceptual space to best capture the linguistically relevant acoustic variability distinguishing /r/ from /l/, whereas Japanese listeners need not parse the space in quite the same manner, because variability in this region of the perceptual space is not relevant to Japanese (Best & Strange, 1992). Once the perceptual system commits to a parse of the perceptual space, there are long-term consequences for SP; the experience that we have with the sounds of our native language fundamentally shapes how we hear speech. Specifically, between-category sensitivity (e.g., an English listener distinguishing the consonants of rock and lock) is preserved, whereas within-category sensitivity (distinguishing two acoustically different instances of rock) is attenuated (Kuhl et al., 1992; Werker, 1994). This surely benefits our ability to communicate in a native language, but it has consequences for adults' perception and acquisition of nonnative speech categories.
An example of this is the difficulty native Japanese listeners have in perceiving English /r/ versus /l/ (Goto, 1971; Miyawaki et al., 1975). Although Japanese adults can improve their English /r/–/l/ perception and production (e.g., Bradlow, Nygaard, & Pisoni, 1999; Bradlow & Pisoni, 1999; Logan, Lively, & Pisoni, 1991; McCandliss, Fiez, Protopapas, Conway, & McClelland, 2002), it may take decades of English experience for native Japanese listeners to approach native levels of perceptual performance with English /r/–/l/ (Flege, Yeni-Komshian, & Liu, 1999), and even then, there are large individual differences in achievement (see, e.g., Slevc & Miyake, 2006). Native Japanese listeners' perceptual space has been tuned for the regularities of Japanese, and this organization is not entirely compatible with the speech input space of English.
The phenomenon of difficulty in perceiving nonnative speech categories demonstrates that speech is perceived through the lens of native language categories. Indeed, electrophysiological evidence suggests that the influence of categorization on SP is evident at very early stages of stimulus processing (e.g., Näätänen et al., 1997; Sharma & Dorman, 2000; Winkler et al., 1999; Zhang, Kuhl, Imada, Kotani, & Pruitt, 2001). The difficulties are greatest for nonnative sounds similar to native categories (Best, 1994; Flege, 1995; Harnsberger, 2001), suggesting that the warping of the perceptual space by the first language especially influences SP of acoustically similar nonnative sounds. Although the difficulties appear to be related to the age of category acquisition (Lenneberg, 1967), with adults having greater perceptual difficulty than younger listeners, much evidence suggests that this is related more to the length and degree of immersion in the second language environment than to maturation (e.g., Flege, 1995; Flege et al., 1999). Moreover, the perceptual changes introduced by parsing the perceptual space seem not to involve a loss of auditory sensitivity, since with sensitive measures adults can demonstrate an ability to distinguish difficult nonnative speech categories (Werker & Tees, 1984).
It is important to distinguish the description of SP as categorization from the notion that SP is categorical. Opening almost any perception or cognition textbook to the section on speech, one is likely to find an illustration displaying perhaps the best-known pattern of SP outside the field, categorical perception (CP; see Wolfe et al., 2008). In a typical CP experiment, a series of speech sounds varying in equal physical steps along some acoustic dimension is presented to listeners, whose task is to classify them as two or more phonemes. Typically, the proportion of each category response does not vary gradually with the change in acoustic parameters. Instead, there is an abrupt shift from consistent labeling of the stimuli as one phoneme to consistent labeling as a competing phoneme across a small change in the acoustics. This is one of three hallmarks of the phenomenon of CP. A second defining characteristic of CP is the pattern of discrimination across the acoustic speech series. When listeners discriminate pairs of stimuli along the series, the resulting function is discontinuous. Discrimination is nearly perfect for stimuli that lie on opposite sides of the sharp identification/categorization boundary, whereas discrimination is very poor for pairs of stimuli that are equally acoustically distinct but lie on the same side of the identification/categorization boundary. The final characteristic of CP is that identification/ categorization performance predicts discrimination performance; speech sounds that are given the same label (e.g., “ba”) are difficult to discriminate, whereas those given different labels are discriminated with high accuracy (see Harnad, 1987; Studdert-Kennedy, Liberman, Harris, & Cooper, 1970).
CP was formerly thought to be a peculiarity of SP (Liberman, 1957; Liberman, Harris, Hoffman, & Griffith, 1957) and was among several perceptual phenomena that have had great impact on speech theories. Its interpretation served to ignite debates over the objects of SP and the mechanisms that support their processing (see Diehl, Lotto, & Holt, 2004, for a review). However, CP has since been observed for perception of human faces (Beale & Keil, 1995) and facial expressions (Bimler & Kirkland, 2001), music intervals (see Krumhansl, 1991, for a review), and artificial stimuli that participants learn to categorize in laboratory tasks (Livingston, Andrews, & Harnad, 1998). It is observed in the behavior of nonhuman animals as well (see Kluender, Lotto, & Holt, 2005, for a review). Moreover, the prototypical pattern of CP is not observed for all speech sounds. Its patterns are much weaker for vowels than for stop consonants like /b/ and /p/, for example (Pisoni, 1973), and sensitive methods for measuring discrimination or discrimination training can cause the peaks in discrimination at the boundaries to disappear even for consonants (Carney, Widin, & Viemeister, 1977; Samuel, 1977). Rather than a speech-specific phenomenon, CP is a far more general characteristic of how perceptual systems respond to experience with regularities in the environment (Damper & Harnad, 2000) and, perhaps, of how time-varying signals are accommodated in perceptual memory (Mirman, Holt, & McClelland, 2004). Thus, the theoretical implications associated with CP (such as the proposition that it is a speech-specific phenomenon or that it is a qualitatively different sort of perceptual process) have not withstood empirical scrutiny.
However, although much of the controversy about the interpretation of CP has settled, CP has left an indelible mark on thinking about SP (perhaps especially among those outside the immediate field of SP). The sharp identification functions of CP are characterized by their steep boundary, but also by the relative flatness of the function within categories giving the appearance that, within a speech category, tokens are equivalent and that their acoustic variability is uninformative to the perceptual system. The classic CP pattern of responses suggests that the mapping from acoustics to speech label is discrete, such that acoustically variable instances of /ba/, for example, are mapped to “ba” irrespective of the acoustic nuances of a particular /ba/, its speaker, or its context.
Relatedly, one of the ways in which CP has left its mark is that descriptions of SP tend to describe speech identification instead of speech categorization. On the face of it, this seems a small difference, especially since these terms are often used interchangeably in the SP literature. However, identification (at least as it is used in other categorization literatures) is a decision about an object's unique identity that requires discrimination between similar objects. Categorization, on the other hand, reflects a decision about an object's type or kind requiring generalization across the perceptually discriminable physical variability of a class of objects (Palmieri & Gauthier, 2004). Whereas CP, with its suggested insensitivity to intracategory variability, is consistent with identification, there is much evidence that the facts of SP are better captured by categorization.
For example, when one exploits measures more continuous than the binary responses typical of CP tasks (e.g., was that sound /ba/ or /da/?), listeners' behavior suggests the rich internal structure of speech categories. Listeners rate some exemplars as “better” instances of a speech category than others (e.g., Iverson & Kuhl, 1995; Kuhl, 1991; Volaitis & Miller, 1992). Eyetracking paradigms further reveal that fine-grained acoustic details of an utterance affect its categorization (e.g., McMurray, Aslin, Tanenhaus, Spivey, & Subik, 2008; McMurray, Tanenhaus, & Aslin, 2002). It seems that the appearance of phonetic homogeneity in CP is largely a result of the binary response labels of CP identification tasks (Lotto & Holt, 2000). Furthermore, SP is affected by the familiarity of the voice that utters a token (Nygaard & Pisoni, 1998), suggesting that fine-grained acoustic details are retained in addition to phonemic labels. This more detailed information persists to influence word-level knowledge (Hawkins, 2003; McMurray et al., 2002) and memory (Goldinger, 1996, 1998). It appears that SP is not completely based on discrete, arbitrary labels such as phonemes (Lotto & Holt, 2000). Therefore, it is likely to be more productive to consider the mapping from the multidimensional input space to a perceptual space that has been studied by SP research as categorization rather than as categorical.
If SP is really a case of perceptual categorization, then our understanding of speech communication could benefit from what we know about general categorization processes. In fact, many of the models that have been successful for visual categorization have been applied to speech sound categorization, including classic prototype (Samuel, 1982), decision bound (Maddox, Molis, & Diehl, 2002; Nearey, 1990), and exemplar (Johnson, 1997) models. However, although perceptual categorization has long been studied in the cognitive sciences (see, e.g., Cohen & Lefebvre, 2005, for a review), the categorization challenges presented by speech signals are somewhat different from those for the visual categories that are more often studied: The speech input space is composed of mostly continuous acoustic dimensions that must be parsed into categories; there is typically no single cue that is necessary or sufficient for defining category membership; speech category exemplars are inherently temporal in nature, thereby limiting side-by-side comparisons; and information for speech categories is spread across time, thus creating segmentation issues. The evidence that exists suggests that these differences matter in understanding SP (Mirman et al., 2004). Unfortunately, the literature available to guide our understanding of the processes, abilities, and constraints of general auditory categorization is quite limited (but see Goudbeek, Smits, Swingley, & Cutler, 2005; Goudbeek, Swingley, & Smits, 2009; Guenther, Husain, Cohen, & Shinn-Cunningham, 1999; Holt & Lotto, 2006; Holt et al., 2004; Mirman et al., 2004; Wade & Holt, 2005a). Further research in auditory cognition will be needed in order to discover how auditory categorization and learning, in general, advance and limit SP (see Holt & Lotto, 2008).
The preceding description of SP as perceptual categorization illustrates some of the complexities in mapping from acoustics to phonemes. The reader may at this point find these complexities to be challenging but not particularly daunting. However, there is an additional level of complexity to phoneme categorization that has kept researchers busy for 60+ years. The problem was summed up well years ago by Repp and Liberman (1987) when they said that “phonetic categories are flexible” (p. 90). That is, phonetic categorization is extremely context sensitive.
One way in which context influences SP is that how speech sounds are labeled changes as a function of both the overall makeup of the stimulus set and the surrounding phonetic context. Even in classic CP tasks, the range of stimulus exemplars presented during the CP task influences the observed position of the category boundary along the stimulus series (Brady & Darwin, 1978; Rosen, 1979). The presence of comparison categories available in a task (/r/ and /l/ vs. /r/ and /l/ and /w/, for example) also influences the mapping to speech categories (Ingvalson, 2008). Thus, identical signals may be categorized as different speech sounds, depending on the characteristics of the other signals in the set in which they appear.
Adjacent phonetic context also strongly influences how a particular acoustic speech signal is categorized. For example, a syllable may be perceived as a /ga/ when preceded by the syllable /al/, but as a /da/ when preceded by /ar/ (Mann, 1980). Context dependence in SP is even observed “backward” in time, such that sounds that follow a target speech sound may influence how listeners categorize the target (e.g., Mann & Repp, 1980). The rate of speech (Miller & Liberman, 1979; Summerfield, 1981) or the acoustic characteristics of voice that produce a preceding sentence also influence how speech is categorized. Ladefoged and Broadbent (1957) demonstrated that they could shift a perceived target word from “bit” to “bet” by changing the acoustics of a preceding carrier phrase (e.g., raising or lowering the F1 frequencies in the phrase “Please say what this word is”). Even nonspeech contexts that mimic spectral or temporal characteristics of speech signals, but are not perceived as speech, influence speech categorization (e.g., Holt, 2005; Lotto & Kluender, 1998; Wade & Holt, 2005b). The fact that nonspeech signals shift the mapping from speech acoustics to perceptual space demonstrates that general auditory processes are involved in relating speech signals and their contexts. Effects of context also occur at multiple levels. SP can be shifted by phonotactic (Pitt & McQueen, 1998; Samuel & Pitt, 2003), lexical (Magnuson, McMurray, Tanenhaus, & Aslin, 2003; McClelland & Elman, 1986), and semantic (Borsky, Tuller, & Shapiro, 1998; Connine, 1987) context, indicating the possibility of an influence of feedback from higher level representations onto speech categorization (see McClelland, Mirman, & Holt, 2006, and Norris, McQueen, & Cutler, 2000, for reviews and debate).
So what are the cues that allow listeners to reliably map from speech input to perception of native language categories? This is a difficult question to answer, because, as described above, the “cues” for SP change radically with task and context. This fact has long been acknowledged in the literature and studied as, for example, trading relations—examining how specific acoustics cues “trade” off one another to be more or less dominant in signaling particular speech categories (e.g., Oden & Massaro, 1978; Repp, 1982). However, our attempts to relate a set of cues as the definitive signals of speech categories ultimately may be misplaced, precisely because of the inherent flexibility of SP. Listeners have exquisite sensitivity to the regularity present in acoustic signals, including speech, and appear to dynamically adjust perception to characteristics of this regularity. Moreover, the nature of this regularity appears to be task dependent; the same speech stimulus set is perceived quite differently as the task varies. This suggests that the “cues” of speech categorization, to some extent, are determined online.
Perhaps the most convincing demonstrations of the flexibility of SP come from studies demonstrating that listeners can maintain veridical perception in the face of radical distortions of the speech signal. The upshot of this work is that there do not appear to be acoustic dimensions or features that are absolutely necessary for SP. Listeners can understand a signal of three sine waves following the center frequencies of the first three formants in so-called sine-wave speech, despite the loss of the harmonic structure and fine-grained acoustic detail (Remez, Rubin, Pisoni, & Carrell, 1981). In this case, the spectral envelope defined by the formant frequencies and the temporal envelope defined by the changes in the overall amplitude of the signal across time are maintained. However, listeners can also maintain veridical SP when the spectral envelope and harmonic structure are distorted, as in the case of noise-vocoded speech (Davis, Johnsrude, Hervais-Adelman, Taylor, & McGettigan, 2005; Hervais-Adelman, Davis, Johnsrude, & Carlyon, 2008; Shannon, Zeng, Kamath, Wygonksi, & Ekelid, 1995). This distortion involves dividing the signal into a small number of frequency bands and replacing acoustic information in those bands with noise that maintains the slow amplitude changes (typically less than 50 Hz) of the frequency band.
Noise-vocoded speech is similar, in some aspects, to the signal presented to listeners with cochlear implants, particularly in its destruction of frequency resolution and harmonic detail. The amazing perceptual performance of some listeners with cochlear implants is one of the most remarkable demonstrations of SP flexibility. Despite the major differences in the signal conveyed by a cochlear implant versus ordinary auditory processing, some implanted listeners achieve normal-level SP for sounds presented in quiet (e.g., Wilson & Dorman, 2007). With some training, normal-hearing listeners can also achieve reasonably good SP performance with severely time-compressed (Dupoux & Green, 1997; Pallier, Sebastian-Gallés, Dupoux, Christophe, & Mehler, 1998), spectrally shifted (Fu & Galvin, 2003), or highly synthetic (Greenspan, Nusbaum, & Pisoni, 1988) speech signals. One can even divide the signal into 50-msec chunks, reverse each of these chunks in time (so that the chunks maintain their order, but are each reversals of original chunks), and maintain nearly 100% intelligibility (Saberi & Perrott, 1999). We can maintain normal conversations on phones with bandwidths between 300 and 3000 Hz, suggesting that all of the important information in speech is in this frequency band. But, listeners can achieve nearly 90% correct categorization performance for consonants when the signal is filtered to contain information only below 800 Hz and above 4000 Hz (Lippmann, 1996).
What does this mean for SP? It is common in the literature to see a constellation of acoustic cues associated with a speech category. This makes sense in many cases, because the task is constant, acoustics are relatively unambiguous, context is neutral, and perception is consistent. However, given the flexibility of SP detailed above, it is clear that we cannot hope to provide a definitive a priori description of the acoustic cues and dimensions that will be mapped to particular phonemes. A major challenge for SP researchers is to determine what kinds of processes allow listeners to maintain consistent perceptual performance in the face of varying acoustics and listening conditions.
Most models of language presume a mapping from acoustics to phoneme, with phonemes mapping to higher level language representations such as words (e.g., McClelland & Elman, 1986; Norris et al., 2000). However, it is worth keeping in mind that the evidence for speech categorization as a necessary stage of processing in everyday speech communication is not incredibly strong. For example, Broca's aphasia (which is produced by diffuse damage to the left frontal regions of the brain causing severe motor speech deficits while leaving speech recognition intact; Goodglass, Kaplan, & Barresi, 2001) may leave listeners impaired on SP tasks like classic syllable identification and discrimination CP (Blumstein, 1995), but this deficit doubly dissociates from impairments on speech recognition (e.g., comprehending words; Miceli, Gainotti, Caltagirone, & Masullo, 1980). Thus, the kinds of tasks that require listeners to make explicit use of phonetic information may tap differentially into processes such as attention, executive processing, or working memory in comparison with ordinary speech communication (see Hickok & Poeppel, 2007).
Spoken language possesses information and regularity at multiple levels. A single utterance of cupcake, for example, conveys indexical characteristics of the speaker's gender, whether she is familiar to the listener, her emotion, and her sociolinguistic background. It conveys information for the phonetic categories /kΛpkek/. Moreover, we recognize it as a real English word and link it to our semantic knowledge of cupcakes. This brief acoustic signal conveys much potential information.
It is important to remember, however, that the tasks we use to study SP differentially tap into this information. The kinds of identification and discrimination tasks that create canonical CP data highlight phonetic-level processing in identifying and differentiating /kΛ/ versus /gΛ/, whereas a lexical decision task highlights word-level knowledge of “cupcake.” Moreover, listeners make greater use of fine phonetic detail when nonwords outnumber words in a stimulus set, but lexical influences predominate when the task is biased toward word recognition with a greater proportion of words (Mirman, McClelland, Holt, & Magnuson, 2008). In SP research, the kinds of tasks and stimulus sets that we present shape the perceptual processing that we observe.
Everyday speech perception “in the wild” is likely to tap into a broader set of processes than those captured in individual laboratory tasks. It is important to note that this is not to suggest that adult (or even infant or animal) listeners cannot categorize speech; there is abundant evidence that they can. Rather, these data suggest that the cognitive and perceptual processes involved in speech categorization and those in online perception of fluent speech may not be one and the same. Although this possibility is not always acknowledged in research in SP, it is significant to our ultimate understanding of how SP relates to spoken language more generally.
At first blush, the caveat above would seem to diminish the importance of studying and understanding speech categorization. On the contrary, however, the 60+ year history of SP research and its documentation of the multidimensional acoustic cues that covary with speech categories have provided what might be an unparalleled understanding of a natural, complex, ecologically valid perceptual categorization space (Kluender, 1994). Even the perceptual dimensions of faces—another prominent ecologically relevant perceptual category space—have not been studied in this detail. What is more, categorization within the highly multidimensional “speech space” (to compare to the “face space” considered in visual face categorization; Valentine, 1991) is completely dependent on experience with a native language. Perhaps no other domain is so rich in its potential for understanding perceptual categorization.
There remains much to learn. Beyond informing our understanding of perceptual categorization and auditory processing, generally speaking, SP extends to many core areas of cognitive science. As categorization, SP offers a platform from which to investigate development (Kuhl, 2004), learning (Holt, Lotto, & Kluender, 1998), adult plasticity (McClelland, 2001), and the prospect of critical periods in human learning (Flege, 1995). The multiple sources of information that covary with the acoustic speech signal provide an opportunity for understanding cross-modal integration (Massaro, 1998) and the role of feedback in language processing (McClelland et al., 2006). Classic issues of cognitive science such as working memory (Frankish, 2008), attention (Francis & Nusbaum, 2002), and the interplay of production and perception (Galantucci, Fowler, & Turvey, 2006) are all pieces of the puzzle in understanding SP. Moreover, the special status of speech as a human communication signal provides an opportunity for even further significant extensions. Research is just beginning to uncover how social cues support speech category acquisition (Kuhl, 2007) and how personality variables may predict the degree to which information in the speech signal is integrated (Stewart & Ota, 2008).
Studying SP informs us also about the general characteristics of auditory perception and cognition. Our understanding of auditory processing has come largely from studies of simple sounds such as tones, clicks, and noise bursts. By contrast, speech is much more like the complex sounds that our auditory systems have evolved to process (Lewicki, 2002; Smith & Lewicki, 2006). As such, it is perhaps even better situated to reveal the nature of relatively poorly understood (at least in comparison with vision) processes of auditory perception and cognition. Already, studying speech categorization has provided information about the kinds of processing that the auditory system must accomplish (e.g., Holt, 2005). SP, with its complex, multidimensional input space and experience-dependent perceptual space, can reveal characteristics of general auditory processing that are just not apparent with simple acoustic stimuli.
SP is traditionally studied as the mapping from acoustics to phonemes. We have argued here that this process is best understood as one of perceptual categorization, a position that places SP in direct contact with research from other areas of perception and cognition. Whereas the study of SP has long been relegated to the periphery of cognitive science as a “special” perceptual system that can tell us little about general issues of human behavior, the latest research in SP guides us away from the classic way of thinking about SP, to consider categorization rather than identification, the regularity that exists amidst variable speech acoustics as a source of rich information, and the online adaptive nature of speech categorization. These issues place SP in a central position in the cognitive and perceptual sciences.
The authors were supported by collaborative awards from the National Science Foundation (BCS0746067) and the National Institutes of Health (R01DC004674).
1Speech is not conveyed solely by sound. SP research has studied the influence of other important sources of information, especially visual information from the face (for a review, see Colin & Radeau, 2003). Some have argued that SP is best considered amodal (Rosenblum, 2005), whereas others have fruitfully used speech as a means of investigating multimodal integration from separate sources of information (Massaro, 1998). Nonetheless, SP is possible when only acoustic information is present (e.g., over a telephone), and since the majority of SP research has focused on the acoustic mapping, we highlight it in this review.