|Home | About | Journals | Submit | Contact Us | Français|
It is widely accepted that human speech is fundamentally a multisensory behavior, with face-to-face communication perceived through both the visual and auditory channels. Such multisensory speech perception is evident even at the earliest stages of human cognitive development (Gogate et al., 2001; Patterson et al., 2003); its integration across the two modalities is ubiquitous and automatic (McGurk et al., 1976), and at the neural level, audiovisual speech integration occurs at the ‘earliest’ stages of cortical processing (Ghazanfar et al., 2006a). Indeed, there are strong arguments which suggest that multisensory speech is the primary mode of speech perception and is not a capacity that is “piggybacked” on to auditory speech perception (Rosenblum, 2005). This implies that the perceptual mechanisms, neurophysiology and evolution of speech perception are based on primitives which are not tied to a single sensory modality (Romanski et al., 2009). The essence of these ideas is shared by many investigators in the domain of perception (Fowler, 2004; Liberman et al., 1985; Meltzoff et al., 1997), but has only caught on belatedly for those of us who study the auditory cortex.
The auditory cortex of primates consists of numerous fields (Hackett et al., 1998; Petkov et al., 2006), which in turn are connected to numerous auditory-related areas in the frontal and temporal lobes (Romanski et al., 2009). At least a subset of these fields appears to be homologous across monkeys and great apes (including humans) (Hackett et al., 2001). These fields are delineated largely by their tonotopic organization and anatomical criteria. The reasons for why there are so many areas are not known, and how each of them, together or separately, relate to behavior is also somewhat of a mystery. That they must be involved in multiple auditory-related behaviors is a given. The fundamental question is thus: how do these multiple auditory areas mediate specific behaviors through their interactions with each other and with other sensory and motor systems?
The question posed above has an underlying assumption that the different auditory cortical fields do not each have a specific function that is used in all types of auditory-related behavior. Rather, their roles (or weights) as nodes in a larger network change according to whatever specific behavior is being mediated. In this review, I focus on one behavior, vocal communication, to illustrate how multiple fields of auditory cortex in non-human primates (hereafter, primates) may play a role in the perception and production of this multisensory behavior (see Kayser et al, this volume, for a review of auditory cortical organization and its relationship to visual and somatosensory inputs). I will begin by briefly presenting evidence that primates do indeed link visual and auditory communication signals, and then describe how such perception may be mediated by the auditory cortex through its connections with association areas. I will then speculate as to how proprioceptive, somatosensory and motor inputs into auditory cortex also play a role in vocal communication. Finally, I will conclude with the suggestion that the use of cytoarchitecture and tonotopy represent one way of defining auditory cortical organization, and that there may be behavior-specific functional organizations that do not neatly fall within the delineation of different auditory cortical fields using these methods. Identifying specific behaviors a priori is the key to illuminating such putative organizational schemes.
Human and primate vocalizations are produced by coordinated movements of the lungs, larynx (vocal folds), and the supralaryngeal vocal tract (Fitch et al., 1995; Ghazanfar et al., 2008a). The vocal tract consists of the column of air derived from the pharynx, mouth and nasal cavity. In humans, speech-related vocal tract motion results in the predictable deformation of the face around the oral aperture and other parts of the face (Jiang et al., 2002; Yehia et al., 1998; Yehia et al., 2002). For example, human adults automatically link high-pitched sounds to facial postures producing an /i/ sound and low-pitched sounds to faces producing an /a/ sound (Kuhl et al., 1991). In primate vocal production, there is a similar link between acoustic output and facial dynamics. Different macaque monkey vocalizations are produced with unique lip configurations and mandibular positions and the motion of such articulators influences the acoustics of the signal (Hauser et al., 1994; Hauser et al., 1993). Coo calls, like /u/ in speech, are produced with the lips protruded, while screams, like the /i/ in speech, are produced with the lips retracted (Figure 1). Thus, it is likely that many of the facial motion cues that humans use for speech-reading are present in other primates as well.
Given that both humans and other extant primates use both facial and vocal expressions as communication signals, it is perhaps not surprising that many primates other than humans recognize the correspondence between the visual and auditory components of vocal signals. Macaque monkeys (Macaca mulatta), capuchins (Cebus apella) and chimpanzees (Pan troglodytes) all recognize auditory-visual correspondences between their various vocalizations (Evans et al., 2005; Ghazanfar et al., 2003; Izumi et al., 2004; Parr, 2004). For example, rhesus monkeys readily match the facial expressions of ‘coo’ and ‘threat’ calls with their associated vocal components (Ghazanfar et al., 2003). Perhaps more pertinent, rhesus monkeys can also segregate competing voices in a chorus of coos, much as humans might with speech in a cocktail party scenario, and match them to the correct number of individuals seen cooing on a video screen (Jordan et al., 2005). Finally, macaque monkeys use formants (i.e., vocal tract resonances) as acoustic cues to assess age-related body size differences among conspecifics (Ghazanfar et al., 2007b). They do so by linking across modalities the body size information embedded in the formant spacing of vocalizations (Fitch, 1997) with the visual size of animals who are likely to produce such vocalizations (Ghazanfar et al., 2007b).
Traditionally, the linking of vision with audition in the multisensory vocal perception described above would be attributed to the functions of association areas such as the superior temporal sulcus in the temporal lobe or the principal and intraparietal sulci located in the frontal and parietal lobes, respectively. Although these regions may certainly play important roles (see below), they are certainly not necessary for all types of multisensory behaviors (Ettlinger et al., 1990), nor are they the sole regions for multisensory convergence (Driver et al., 2008; Ghazanfar et al., 2006a). The auditory cortex, in particular, has many potential sources of visual inputs (Ghazanfar et al., 2006a) and this is borne out in the increasing number of studies demonstrating visual modulation of auditory cortical activity (Bizley et al., 2007; Ghazanfar et al., 2008b; Ghazanfar et al., 2005; Kayser et al., 2008; Kayser et al., 2007; Schroeder et al., 2002). Here we focus on those auditory cortical studies investigating face/voice integration specifically.
Recordings from both primary and lateral belt auditory cortex reveal that responses to the voice are influenced by the presence of a dynamic face (Ghazanfar et al., 2008b; Ghazanfar et al., 2005). Monkey subjects viewing unimodal and bimodal versions of two different species-typical vocalizations (‘coos’ and ‘grunts’) show both enhanced and suppressed local field potential (LFP) responses in the bimodal condition relative to the unimodal auditory condition (Ghazanfar et al., 2005). Consistent with evoked potential studies in humans (Besle et al., 2004; van Wassenhove et al., 2005), the combination of faces and voices led to integrative responses (significantly different from unimodal responses) in the vast majority of auditory cortical sites—both in primary auditory cortex and the lateral belt auditory cortex. The data demonstrated that LFP signals in the auditory cortex are capable of multisensory integration of facial and vocal signals in monkeys (Ghazanfar et al., 2005) and have subsequently been confirmed at the single unit level in the lateral belt cortex as well (Ghazanfar et al., 2008b) (Figure 2A). By ‘integration’, I simply mean that bimodal stimuli elicit significantly enhanced or suppressed responses relative to the best (strongest) response elicited by unimodal stimuli.
The specificity of face/voice integrative responses was tested by replacing the dynamic faces with dynamic discs which mimicked the aperture and displacement of the mouth. In human psychophysical experiments, such artificial dynamic stimuli can still lead to enhanced speech detection, but not to the same degree as a real face (Bernstein et al., 2004; Schwartz et al., 2004). When cortical sites or single units were tested with dynamic discs, far less integration was seen when compared to the real monkey faces (Ghazanfar et al., 2008b; Ghazanfar et al., 2005) (Figure 2). This was true primarily for the lateral belt auditory cortex (LFPs and single units) and was observed to a lesser extent in the primary auditory cortex (LFPs only). This is perhaps not surprising given that the lateral belt is well known for its responsiveness and, to some degree, selectivity, to vocalizations and other complex stimuli (Rauschecker et al., 1995; Recanzone, 2008). (See Ghazanfar et al., 1999; Ghazanfar et al., 2001a for review of vocal responses in auditory cortex.)
Unexpectedly, grunt vocalizations were over-represented relative to coos in terms of enhanced multisensory LFP responses (Ghazanfar et al., 2005). As coos and grunts are both produced frequently in a variety of affiliative contexts and are broadband spectrally, the differential representation cannot be attributed to experience, valence or the frequency tuning of neurons. One remaining possibility is that this differential representation may reflect a behaviorally-relevant distinction, as coos and grunts differ in their direction of expression and range. Coos are generally contact calls rarely directed toward any particular individual. In contrast, grunts are often directed towards individuals in one-on-one situations, often during social approaches as in baboons and vervet monkeys (Cheney et al., 1982; Palombit et al., 1999). Given their production at close range and context, grunts may produce a stronger face/voice association than coo calls. This distinction appeared to be reflected in the pattern of significant multisensory responses in auditory cortex; that is, this multisensory bias towards grunt calls may be related to the fact the grunts (relative to coos) are often produced during intimate, one-to-one social interactions.
The “face-specific” visual influence on the lateral belt auditory cortex begs the question as to its anatomical source. Although there are multiple possible sources of visual input to auditory cortex (Ghazanfar et al., 2006a), the STS is likely to be a prominent one, particularly for integrating faces and voices, for the following reasons. First, there are reciprocal connections between the STS and the lateral belt and other parts of auditory cortex (Barnes et al., 1992; Seltzer et al., 1994). Second, neurons in the STS are sensitive to both faces and biological motion (Harries et al., 1991; Oram et al., 1994). Finally, the STS is known to be multisensory (Barraclough et al., 2005; Benevento et al., 1977; Bruce et al., 1981; Chandrasekaran et al., 2009; Schroeder et al., 2002). One mechanism for establishing whether auditory cortex and the STS interact at the functional level is to measure their temporal correlations as a function stimulus condition. Concurrent recordings of LFPs and spiking activity in the lateral belt of auditory cortex and the upper bank of the STS revealed that functional interactions, in the form of gamma band (>30Hz) correlations, between these two regions increased in strength during presentations of faces and voices together relative to the unimodal conditions (Ghazanfar et al., 2008b) (Figure 3A). Furthermore, these interactions were not solely modulations of response strength, as phase relationships were significantly less variable (tighter) in the multisensory conditions (Figure 3B).
The influence of the STS on auditory cortex was not merely on its gamma oscillations. Spiking activity seems to be modulated, but not ‘driven’, by on-going activity arising from the STS. Three lines of evidence suggest this scenario. First, visual influences on single neurons were most robust when in the form of dynamic faces and were only apparent when neurons had a significant response to a vocalization (i.e., there were no overt responses to faces alone). Second, these integrative responses were often “face-specific” and had a wide distribution of latencies, which suggested that the face signal was an ongoing signal that influenced auditory responses (Ghazanfar et al., 2008b). Finally, this hypothesis for an ongoing signal is supported by the sustained gamma band activity between auditory cortex and STS and by a spike-field coherence analysis of the relationship between auditory cortical spiking activity and gamma oscillations from the STS (Ghazanfar et al., 2008b) (Figure 3C).
Both the auditory cortex and the STS have multiple bands of oscillatory activity generated in responses to stimuli that may mediate different functions (Chandrasekaran et al., 2009; Lakatos et al., 2005). Thus, interactions between the auditory cortex and the STS are not limited to spiking activity and high frequency gamma oscillations. Below 20Hz, and in response to naturalistic audiovisual stimuli, there are directed interactions from auditory cortex to STS, while above 20Hz (but below the gamma range), there are directed interactions from STS to auditory cortex (Kayser et al., 2009). Given that different frequency bands in the STS integrate faces and voices in distinct ways (Chandrasekaran et al., 2009), it’s possible that these lower frequency interactions between the STS and auditory cortex also represent distinct multisensory processing channels.
Two things should be noted here. The first is that functional interactions between STS and auditory cortex are not likely to occur solely during the presentation of faces with voices. Other congruent, behaviorally-salient audiovisual events such as looming signals (Cappe et al., 2009; Gordon et al., 2005; Maier et al., 2004) or other temporally coincident signals may elicit similar functional interactions (Maier et al., 2008; Noesselt et al., 2007). The second is that there are other areas that, consistent with their connectivity and response properties (e.g., sensitivity to faces and voices), could also (and very likely) have a visual influence on auditory cortex. These include the ventrolateral prefrontal cortex (Romanski et al., 2005; Sugihara et al., 2006) and the amygdala (Gothard et al., 2007; Kuraoka et al., 2007). It is not known whether STS, for instance, plays a more influential role than these two other ‘face sensitive’ areas. Indeed, it may be that all three play very different roles in face/voice integration. What is missing is a direct link between multisensory behaviors and neural activity—that is only way to assess the true contributions of these regions, along with auditory cortex, in vocal behavior.
Humans and other primates readily link facial expressions with appropriate, congruent vocal expressions. What cues they use to make such matches are not known. One method for investigating such behavioural strategies is the measurement of eye movement patterns. When human subjects are given no task or instruction regarding what acoustic cues to attend, they will consistently look at the eye region more than the mouth when viewing videos of human speakers (Klin et al., 2002). Macaque monkeys exhibit the exact same strategy. The eye movement patterns of monkeys viewing conspecifics producing vocalizations reveal that monkeys spend most of their time inspecting the eye region relative to the mouth (Ghazanfar et al., 2006b) (Figure 4A). When they did fixate on the mouth, it was highly correlated with the onset of mouth movements (Figure 4B). This, too, was highly reminiscent of human strategies: subjects asked to identify words increased their fixations onto the mouth region with the onset of facial motion (Lansing et al., 2003).
Somewhat surprisingly, activity in both primary auditory cortex and belt areas is influenced by eye position. When the spatial tuning of primary auditory cortical neurons is measured with the eyes gazing in different directions, ~30% of the neurons are affected by the position of the eyes (Werner-Reiss et al., 2003). Similarly, when LFP-derived current-source density activity was measured from auditory cortex (both primary auditory cortex and caudal belt regions), eye position significantly modulated auditory-evoked amplitude in about 80% of sites (Fu et al., 2004). These eye-position effects occurred mainly in the upper cortical layers, suggesting that the signal is fedback from another cortical area. A possible source includes the frontal eye field (FEF) located in the frontal lobes, the medial portion of which generates relatively long saccades (Robinson et al., 1969), is interconnected with both the STS (Schall et al., 1995; Seltzer et al., 1989) and multiple regions of the auditory cortex (Hackett et al., 1999; Romanski et al., 1999; Schall etal., 1995).
It does not take a huge stretch of the imagination to link these auditory cortical processes to the oculomotor strategy for looking at vocalizing faces. A dynamic, vocalizing face is a complex sequence of sensory events, but one that elicits fairly stereotypical eye movements: we and other primates fixate on the eyes but then saccade to mouth when it moves before saccading back to the eyes. Is there a simple scenario that could link the proprioceptive eye position effects in the auditory cortex with its face/voice integrative properties (Ghazanfar et al., 2007a)? Reframing (ever so slightly) the hypothesis of Schroeder and colleagues (Lakatos et al., 2007; Schroeder et al., 2008), one possibility is that the fixations at the onset of mouth movements sends a signal to the auditory cortex which resets the phase of an on-going oscillation. This proprioceptive signal thus primes the auditory cortex to amplify or suppress (depending on the timing) of a subsequent auditory signal originating from the mouth. Given that mouth movements precede the voiced components of both human (Abry et al., 1996) and monkey vocalizations (Chandrasekaran et al., 2009; Ghazanfar et al., 2005), the temporal order of visual to proprioceptive to auditory signals is consistent with this idea. This hypothesis is also supported (though indirectly) by the finding that sign of face/voice integration in the auditory cortex and the STS is influenced by the timing of mouth movements relative to the onset of the voice (Chandrasekaran et al., 2009; Ghazanfar et al., 2005).
Numerous lines of both physiological and anatomical evidence demonstrate that at least some regions of the auditory cortex respond to touch as well as sound (Fu et al., 2003; Hackett et al., 2007a; Hackett et al., 2007b; Kayser et al., 2005; Lakatos et al., 2007; Schroeder et al., 2002; Smiley et al., 2007). Yet, the sense of touch is not something we normally associate with vocal communication. It can, however, influence what we hear under certain circumstances. For example, kinesthetic feedback from one’s own speech movements also integrates with heard speech (Sams et al., 2005). More directly, if a robotic device is used to artificially deform the facial skin of subjects in a way that mimics the deformation seen during speech production, then subjects actually hear speech differently (Ito et al., 2009). Surprisingly, there is a systematic perceptual variation with speech-like patterns of skin deformation that implicate a robust somatosensory influence on auditory processes under normal conditions (Ito et al., 2009).
The somatosensory system’s influence on the auditory system may also occur during vocal learning. When a mechanical load is applied to the jaw, causing a slight protrusion, as subjects repeat words (‘saw’, ‘say’, ‘sass’ and ‘sane’) it can alter somatosensory feedback without changing the acoustics of the words (Tremblay et al., 2003). Measuring adaptation in the jaw trajectory after many trials revealed that subjects learn to change their jaw trajectories so that they are similar to the pre-load trajectory—despite not hearing anything different. This strongly implicates a role for somatosensory feedback that parallels the role for auditory feedback in guiding vocal production (Jones et al., 2003; Jones et al., 2005). Indeed, the very same learning effects are observed with deaf subjects when they turn their hearing aids off (Nasir et al., 2008).
While the substrates for these somatosensory-auditory effects have not been explored, interactions between the somatosensory system and the auditory cortex seem like a likely source for the phenomena described above for the following reasons. First, many auditory cortical fields respond to, or are modulated by, tactile inputs (Fu et al., 2003; Kayser et al., 2005; Schroeder et al., 2001). Second, there are intercortical connections between somatosensory areas and the auditory cortex (Cappe et al., 2005; de la Mothe et al., 2006; Smiley et al., 2006). Third, auditory area CM, where many auditory-tactile responses seem to converge, is directly connected to somatosensory areas in the retroinsular cortex and the granular insula (de la Mothe et al., 2006; Smiley et al., 2006). Oddly enough, a parallel influence of audition on somatosensory areas has also been reported: neurons in the “somatosensory” insula readily and selectively respond to vocalizations (Beiser, 1998; Remedios et al., 2009). Finally, the tactile receptive fields of neurons in auditory cortical area CM are confined to the upper body, primarily the face and neck regions (areas consisting of, or covering, the vocal tract) (Fu et al., 2003) (Figure 5) and the primary somatosensory cortical (area 3b) representation for the tongue (a vocal tract articulator) projects to auditory areas in the lower bank of the lateral sulcus (Iyengar et al., 2007). All of these facts lend further credibility to the putative role of somatosensory-auditory interactions during vocal production and perception.
Like humans, other primates also adjust their vocal output according to what they hear. For example, macaques, marmosets (Callithrix jacchus), and cotton-top tamarins (Saguinus oedipus) adjust the loudness, timing and acoustic structure of their vocalizations depending background noise levels and patterns (Brumm et al., 2004; Egnor et al., 2006a; Sinnott et al., 1975); (Egnor et al., 2006b; Egnor et al., 2007). The specific number of syllables and temporal modulations in heard conspecific calls can also differentially trigger vocal production in tamarins (Ghazanfar et al., 2001b; Ghazanfar et al., 2002). Thus, auditory feedback is also very important for nonhuman primates and altering such feedback can influence neurons in the auditory cortex (Eliades et al., 2008). At this time, however, no experiments have investigated whether somatosensory feedback plays a role in influencing vocal feedback. The neurophysiological and neuroanatomical data described above suggest that it is not unreasonable to think that it does.
The putative neural processes underlying multisensory vocal communication in primates calls to mind what the philosopher, Andy Clark, refers to as “action-oriented” representations (Clark, 1997). In generic terms, action-oriented representations simultaneously describe aspects of the world and prescribe possible actions. They are poised between pure control (motor) structures and passive (sensory) representations of the external world. For neural representations of primate vocal communication, this suggests that the laryngeal, articulatory and respiratory movements during vocalizations are inseparable from the visual, auditory and somatosensory processes that accompany vocal perception. This idea seems to fit well with the data reviewed above and suggests an alternative way of thinking about auditory cortical organization.
Typically, we think of auditory cortex as a set of very discrete fields, most of which can be defined by a tonotopic map. These physiological maps often correspond to cytoarchitectural and hodological signatures as well. It is possible, however, that this is just one of many possible schemes for auditory cortical organization (albeit an important one). One alternative is that different behaviors, such as multisensory vocal communication, each have their own organizational scheme super-imposed on these tonotopic maps, but not necessarily in one-to-one fashion. This is almost pure speculation at this point, but the main reason for thinking it a possibility is that many of the multisensory and vocalization-related responses properties for communication have no relationship to the tonotopic maps for, or frequency tuning of neurons in, those regions. For example, there is no reported relationship between the frequency tuning of auditory cortical neurons and the influence of eye position (Fu et al., 2004; Werner-Reiss et al., 2003), somatosensory receptive fields (Fu et al., 2003; Schroeder et al., 2001), or face/voice integration (Ghazanfar et al., 2005; Ghazanfar, unpublished observations). Likewise, the influence of vocal feedback on auditory cortex has no relationship to the underlying frequency tuning of neurons (Eliades et al., 2008). At a more global level, somatosensory and visual influences on the auditory cortex seem to take the form of a gradient, extending in the posterior-to-anterior direction, rather than having a discrete influence on particular tonotopically-defined subsets (Kayser et al., 2005; Kayser et al., 2007). Finally, the representation for pitch processing, important for vocal recognition, spans the low frequency borders of two core auditory cortical areas (primary auditory cortex and area R), violating the often implicit assumption of “single area-single function” rule (Bendor et al., 2005). The idea that there may be multiple behaviorally-specific auditory cortical organizations is very similar to the one recently put forth regarding the organization of the various somatotopically-defined motor cortical areas (Graziano et al., 2007).
To summarize, vocal communication is a fundamentally multisensory behavior and this will be reflected in the different roles brain regions play in mediating this behavior. Auditory cortex is illustrative, being influenced by both the visual, somatosensory, proprioceptive and motor modalities during vocal communication. In all, I hope that the data reviewed above suggest that investigating auditory cortex through the lens of a specific behavior may lead to a much clearer picture of its functions and dynamic organization.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.