|Home | About | Journals | Submit | Contact Us | Français|
Speech and language are considered uniquely human abilities: animals have communication systems, but they do not match human linguistic skills in terms of recursive structure and combinatorial power. Yet, in evolution, spoken language must have emerged from neural mechanisms at least partially available in animals. In this paper, we will demonstrate how our understanding of speech perception, one important facet of language, has profited from findings and theory in nonhuman primate studies. Chief among these are physiological and anatomical studies showing that primate auditory cortex, across species, shows patterns of hierarchical structure, topographic mapping and streams of functional processing. We will identify roles for different cortical areas in the perceptual processing of speech and review functional imaging work in humans that bears on our understanding of how the brain decodes and monitors speech. A new model connects structures in the temporal, frontal and parietal lobes linking speech perception and production.
Our understanding of speech processing has both benefited and suffered from developments in neuroscience. The basic brain areas important for speech perception and production were established in the nineteenth century, and although our conception of their exact anatomy and function has changed substantially, some of the findings of Broca1 and Wernicke2 still stand (Supplementary Discussion 1 and Supplementary Fig. 1 online). What has lagged behind is a good model of how the brain decodes spoken language and how speech perception and speech production are linked. For example, the frameworks for cortical processes and pathways have taken longer to form in audition than in vision, and animal models of language have severe limitations3. The evolution of speech and language are likely to have depended on neural systems available in other primate brains. In this paper, we will demonstrate how our understanding of speech perception, one important facet of language, has profited from work in nonhuman primate studies.
A decade ago, it was suggested that auditory cortical processing pathways are organized dually, similar to those in the visual cortex (Fig. 1)4,5: that one main pathway projects from each of the primary sensory areas into posterior parietal cortex, another pathway into anterior temporal cortex. As in the visual system6, the posterior parietal pathway was hypothesized to subserve spatial processing in audition while the temporal pathway subserved the identification of complex patterns or objects. Per the directions of their projections in the auditory system, these pathways were referred to as the postero-dorsal and antero-ventral streams, respectively.
Anatomical tract tracing studies in monkeys support separate anterior and posterior projection streams in auditory cortex7,8. The long-range connections from the surrounding belt areas project from anterior belt directly to ventrolateral prefrontal cortex (PFC) and from the caudal (posterior) belt to dorsolateral PFC9. This latter finding provided evidence, on both anatomical and functional grounds10,11, for ventral and dorsal processing streams within auditory cortex. Single-unit studies in the lateral belt areas of macaques provided more direct functional evidence for this dual processing scheme. Tian et al.12 found that when species-specific communication sounds are presented in varying spatial locations, neurons in the antero-lateral belt (area AL) are more specific for the type of monkey call. By contrast, neurons in the caudo-lateral belt (area CL) are more responsive to spatial location than neurons in core or anterior belt. This result indicates that ‘what’ processing dissociates from ‘where’ processing in rhesus monkey auditory cortex.
The dual-stream hypothesis has found support from other studies13,14. Recanzone and co-workers15 found a tighter correlation of neuronal activity and sound localization in caudal belt, supporting a posterior ‘where’ stream. Lewis and Van Essen16 described a direct auditory projection from the posterior superior temporal (pST) region to the ventral inferior parietal (VIP) area in the posterior parietal cortex of the monkey. Single-unit as well as imaging studies in monkeys also reveal functional specialization17–21.
Functional magnetic resonance imaging in nonhuman primates identified, first, tonotopic maps on the superior temporal plane and gyrus22 and, then, a ‘voice region’ in the anterior part of the superior temporal gyrus23, a voice region that projects further to the anterior superior temporal sulcus and ventrolateral PFC24. Reversible cortical inactivation (using cortical cooling) in cat auditory cortex25 found that inactivating anterior areas leads to a deterioration of auditory pattern discrimination, whereas inactivating posterior areas impairs spatial discrimination. These studies corroborate the notion that an antero-ventral processing stream forms the substrate for the recognition of auditory objects, including communication sounds, whereas a postero-dorsal stream includes spatial perception as at least one of its functions.
Hierarchical organization in the cerebral cortex combines elements of serial as well as parallel processing: ‘lower’ cortical areas with simpler receptive-field organization, such as sensory core areas, project to ‘higher’ areas with increasingly complex response properties, such as belt, parabelt and PFC regions. These complex properties are generated by convergence and summation (Box 1 and Fig. 2). Parallel processing principles in hierarchical organization are evident in that specialized cortical areas (‘maps’) with related functions (corresponding to submodalities or modules) are bundled into parallel processing ‘streams’. Furthermore, highly interconnected neural networks, dynamically modulated by different task demands, may also exist within hierarchical processing structures, and well known feedback connections are sometimes not sufficiently accounted for in hierarchical models.
Functional specialization and streams of processing are central to theories of hierarchical organization. Cortical specialization is generated by specificity at the level of single neurons. Their complex response properties are in turn generated by convergence from lower-order neurons and nonlinear summation—‘combination sensitivity’ (Fig. 2). Discovered originally in bats97 and songbirds98, combination sensitivity has been demonstrated in nonhuman primates as well17. It is a fundamental mechanism for generating highly selective neurons (or small networks), as required for speech perception. Such higher-order specificity is generated by combining input from lower-level neurons specific to relatively simple features. Thus combination sensitivity is an example of hierarchical processing at the cellular level. Because it necessitates single-neuron recording techniques, it can only be explored in animal models. Therefore, it is an example of how animal research in general has led to an understanding of speech perception at the cellular level and how animal models will remain necessary to obtain a complete understanding of the neural mechanisms of speech perception that goes beyond localization of function.
In addition to the ‘what/where’ model in vision6, Goodale and Milner26 proposed that two pathways subserve behaviors related to perception and action. The auditory ventral pathway role in perception is largely consistent with a ‘what’ pathway, whereas the dorsal pathway takes on a sensorimotor role involved in action (‘how’), including spatial analysis. Fuster27 advocates a similar distinction with regard to PFC and unites the two pathways into a perception–action cycle. We argue here that the ‘what/where’ and ‘perception/action’ theories differ mainly in emphasis.
The concepts of auditory streams of processing can be a powerful framework for understanding functional imaging studies of speech perception28,29 and for understanding aphasic stroke3. Human studies also confirm the role of the postero-dorsal stream in the perception of auditory space and motion (see refs. 30 and 14 for review). But do more than two processing streams exist31 (Fig. 3)? The posterior superior temporal gyrus and inferior parietal cortex have long been implicated in the processing of speech and language, and ignoring these reports (Supplementary Discussion 1) and assigning an exclusively spatial function to the postero-dorsal auditory stream would be unwise. It is therefore essential to discuss how the planum temporale, the temporoparietal junction and the inferior parietal cortex are involved in speech and language, and whether we can assign a common computational function to the postero-dorsal stream that encompasses both spatial and language functions.
A meta-analysis of imaging studies of speech processing32 reports an antero-lateral gradient along which the complexity of preferred stimuli increases, from tones and noise bursts to words and sentences. As in nonhuman primates, frequency responses show tonotopy, while core regions responding to tones are surrounded by belt areas preferring band-pass noise bursts33. Using high-field scanners, multiple tonotopic fields34 and multiple processing levels (core, belt and parabelt)35 can be identified in human auditory cortex.
This sort of hierarchical organization in the antero-ventral auditory pathway of humans is important in auditory pattern recognition and object identification. As in animal models, preferred features of lower-order neurons combine to create selectivity for increasingly complex sounds36,37, and regions can be seen that are specialized in different auditory object classes (A.M. Leaver and J.P.R., unpublished data)38,39. Developments in how we conceive the structure of auditory objects40,41 will help extend these kinds of investigations. Like their visual counterparts, auditory objects coexist based on many attributes, such as timbre, pitch and loudness, that give each its distinctive perceptual identity41.
Within speech perception, there is evidence that speech sounds are hierarchically encoded, as the anterior superior temporal cortex responds as a function of speech intelligibility, and not stimulus complexity alone42–44. Similarly, Liebenthal et al.45 and Obleser et al.46 showed that the left middle and anterior superior temporal sulcus is more responsive to consonant–vowel syllables than auditory baselines. Thus, regions within the ‘what’ stream show the first clear responses to abstract, linguistic information in speech. Within these speech-specific regions of anterior superior temporal cortex, there may be subregions selective for particular speech-sound classes, such as vowels38,46, raising the possibility that phonetic maps have some anatomical implementation in anterior temporal lobe areas.
Activity related to speaker recognition also exists in antero-lateral temporal lobe areas39, sometimes extending into midtemporal regions as well. These human voice regions may be homologous, according to crude topological criteria, to monkey areas23 mentioned above. This human ‘voice area’ in the anterior auditory fields seems to process detailed spectral properties of talkers47. Notably, speech perception and voice discrimination dissociate clinically, suggesting that the two are supported by different systems within the anterior and middle temporal lobes.
An important problem in the task of speech perception is that of invariance against distortions in the scale of frequency (for example, pitch changes; Fig. 4a) or time (for example, compressions). For example, noise-vocoded speech, which simulates aspects of speech after cochlear implantation, is quite coarse in its spectro-temporal representation48 (Fig. 4b); it is, however, readily intelligible after a brief training session. Perceptual invariance is also important in the perception of normal speech, as the ‘same’ phoneme can be acoustically very different (owing to coarticulation) and still be identified as the same sound49: the sound /s/ is different at the start of “sue” than at the start of “see,” but remains an /s/.
These examples of perceptual constancy are computationally difficult to solve. This ability to deal with invariance problems is not unique to speech or audition; it is a hallmark of all higher cortical perceptual systems. The structural and functional organization of the anterior-ventral streams in both the visual and auditory systems could illustrate how the cerebral cortex solves this problem. For example, it has been suggested that visual categories are formed in the lateral PFC50, which receives input from higher-order object representations in the anterior temporal lobe10. In audition, using species-specific communication sounds, Romanski et al.51 found clusters of neurons in the macaque ventrolateral PFC encoding similar complex calls, and category-specific cells encoding single semantic categories have also been reported52. In humans, rapid adaptation studies with functional MRI in the visual system have recently led to similar conclusions53. The invariance problem in speech perception may be solved in the inferior frontal cortex, or by interactions between inferior frontal and anterior superior temporal cortex.
Speech perception and production are left-lateralized in the human brain (for example, refs. 3,42,54), and there is considerable interest in the neural basis of this (for example, ref. 55). Hemispheric specialization is an important feature of the human brain, particularly in relation to speech and spatial processing. It remains to be seen to what extent animal models can contribute to our understanding of these asymmetries.
Evidence for a postero-dorsal stream in auditory spatial processing is just as strong, if not stronger, in the human as in nonhuman primates. Stroke studies as well as modern neuroimaging have shown that spatial processing in the temporo-parietal cortex is often right-lateralized in humans, contralateral to language. Generally, spatial neglect is more frequent and severe after damage to the right hemisphere. We cannot discuss all pertinent results in this focus paper, but we refer the reader to other reviews (refs. 14,30; see also Supplementary Discussion 2 online).
The pST region (or planum temporale) in humans (and the dorsal stream emanating from it) has classically been assigned a role in speech perception56. This contradicts the evidence for a spatial role for pST, as well as a more anterior location for speech sound decoding, as discussed above (see also Supplementary Discussions 1 and 2). One unifying view is that the planum temporale is generally involved in the processing of spectro-temporally complex sounds46, which includes music processing57. According to this view, the planum temporale operates as a ‘computational hub’58.
The inferior parietal lobule (IPL), particularly the angular and supramarginal gyri (Brodmann areas 39 and 40), has also been linked to linguistic functions59, such as the ‘phonological-articulatory loop’60. Functional imaging has confirmed this role, though activity varies with working memory task load61,62. However, the IPL does not seem to be driven by acoustic processing of speech: the angular gyrus (together with extensive prefrontal activation) is recruited when higher-order linguistic factors improve speech comprehension63, rather than by acoustic influences on intelligibility. Thus the parietal cortex is associated with more domain-general, linguistic factors in speech comprehension, rather than acoustic or phonetic processing.
There is now neurophysiological evidence that auditory caudal belt areas are not solely responsive to auditory input but show multimodal responses64,65: both caudal medial and lateral belt fields receive input from somatosensory and multisensory cortex. Thus any spatial transformations conducted in the postero-dorsal stream may be based on a multisensory reference frame66,67.
These multisensory responses in caudal auditory areas may underlie some functional specificity in humans. Several studies of silent articulation68 and nonspeech auditory stimuli69 find activation in a posterior medial planum temporale region, within the postero-dorsal stream. The medial planum temporale in man70 has been associated with the representation of templates for “doable” articulations and sounds (not limited to speech sounds). This approach can be compared to the “affordance” model of Gibson71,72, in which objects and events are described in terms of action possibilities. Such a sensorimotor role for the dorsal stream is consistent with the notion of an “action” stream in vision26. The concept can be extended to auditory-motor transformations in verbal working memory tasks73,74 that involve articulatory representations60,75. The postero-medial planum temporale area has also been identified as a key node for the control of speech production54, as it shows a response to somatosensory input from articulators.
There is considerable neural convergence between speech perception and production systems. For example, the postero-medial planum temporale area described in the previous section is an auditory area important in the motor act of articulation. Conversely, real or imagined speech sounds and music result in activation within premotor areas important in overt production of speech76 and music77,78. Within auditory areas, monkey studies have shown that auditory neurons are suppressed during the monkey's own vocalizations79,80. This finding is consistent with results from humans indicating that superior temporal areas are suppressed during speech production81,82 and that the response to one's own voice is always less than the response to someone else's.
At one level these findings may simply reflect the ways that sensory responses to actions caused by oneself are always differently processed from those caused by the actions of others83, and this may support mechanisms important in differentiating between one's own voice and the voices of others. In primate studies, however, auditory neurons that are suppressed during vocalizations are often more activated if the sound of the vocalizations is distorted80. This might indicate a specific role for these auditory responses in the comparison of feedforward and feedback information from the motor and auditory system during speech production84. Distorting speech production in real time reveals enhanced activation in bilateral (posterior temporal) auditory fields to distorted feedback85. New work using high-resolution diffusion tensor imaging in humans has revealed that there are direct projections from the pars opercularis of Broca's area (Brodmann area 44) to the IPL86, in addition to the ones from ventral premotor cortex87. With the known connections between parietal cortex and posterior auditory fields, this could form the basis for feed-forward connections between speech production areas and posterior temporal auditory areas (Fig. 5).
The dual-stream processing model in audition4,5 has been a useful construct in hearing research, perceptual physiology and, in particular, psycholinguistics, where it has spawned several further models73,74 that have tried to accommodate specific results from this field. The role of a ventral stream in hierarchical processing of objects, as in the visual system, is now widely accepted. Specifically for speech, anterior regions of the superior temporal cortex respond to native speech sounds and intelligible speech, and these sounds are mapped along phonological parameter domains. By contrast, early posterior regions in and around the planum temporale are involved in the processing of many different types of complex sound. Later posterior regions participate in the processing of auditory space and motion but seem to integrate input from several other modalities as well.
Although evidence is strong for the role of the dorsal pathway (including pST) in space processing, the dorsal pathway needs to accommodate speech and language functions as well. Spatial transformations may be one example of fast adaptations used by ‘internal models’ or ‘emulators’, as first developed in motor control theory. Within these models, ‘forward models’ (predictors) can be used to predict the consequences of actions, whereas ‘inverse models’ (controllers) determine the motor commands required to produce a desired outcome88. More recently, forward models have been used to describe the predictive nature of perception and imagery89. The IPL could provide an ideal interface, where feed-forward signals from motor preparatory networks in the inferior frontal cortex and premotor cortex (PMC) can be matched with feedback signals from sensory areas72.
In speech perception and production, projections from articulatory networks in Broca's area and PMC to the IPL and pST interact with signals from auditory cortex (Fig. 5). The feed-forward projection from Brodmann area 44 (and ventral PMC) may provide an efference copy in the classical sense of von Holst and Mittelstaedt90, informing the sensory system of motor articulations that are about to happen. This occurs in anticipation of a motor signal if the behavior is enacted, or as imagery if it is not. The activity arriving in the IPL and pST from frontal areas anticipates the sensory consequences of action. The feedback signal coming to the IPL from pST, conversely, could be considered an “afference copy”91 with relatively short latencies and high temporal precision92—a sparse but fast primal sketch of ongoing sensory events93 that are compared with the predictive motor signal in the IPL at every instance.
‘Internal model’ structures in the brain are generally thought to enable smooth sequential motor behaviors, from visuospatial reaching to articulation of speech. The goal of these models is to minimize the resulting error signal through adaptive mechanisms. At the same time, these motor behaviors also support aspects of perception, such as stabilization of the retinal image and disambiguation of phonological information, thus switching between forward and inverse modes. As Indefrey and Levelt94 point out, spoken language “constantly operates a dual system, perceiving and producing utterances. These systems not only alternate, but in many cases they partially or wholly operate in concert.” What is more, both spatial processing and real-time speech processing make use of the same internal model structures.
In summary, our new model of the auditory cortical pathways builds on the previous model of dual processing pathways for object identification and spatial analysis5,6, but integrates the spatial (dorsal) pathway with findings from speech and music processing as well. The model is based on neuroanatomical data from nonhuman primates, operating under the assumption that mechanisms of speech and language in humans have built on structures available in other primates. Finally, our new model extends beyond speed processing74 and applies in a very general sense to both vision and audition, in its relationship with previous models of perception and action26,27.
We wish to thank D. Klemm for help with graphic design and T. Tan for help with editing. The work was supported by grants from the US National Institutes of Health (R01NS52494) and the US National Science Foundation (BCS-0519127 and PIRE-OISE-0730255) to J.P.R., and by Wellcome Trust Grant WT074414MA to S.K.S.
Note: Supplementary information is available on the Nature Neuroscience website.