Early Studies in Old and New World Primates
The PFC has long been thought to play a role in the processing of complex and especially communication-relevant sounds. For more than a century, the inferior frontal gyrus in the human brain (including Broca’s area) has been linked with speech and language processes (Broca 1861
). Neuroimaging studies of the human brain have shown activation of ventrolateral frontal lobe areas such as Brodmann’s areas 44, 45, and 47 in auditory working memory, phonological processing, comprehension, and semantic judgment (Buckner et al. 1995
, Demb et al. 1995
, Fiez et al. 1996
, Stromswold et al. 1996
, Zatorre et al. 1996
, Gabrieli et al. 1998
, Stevens et al. 1998
, Friederici et al. 2003
). In animal studies, some investigators have reported that large lesions of lateral frontal cortical regions (which include the sulcus principalis region) in primates disrupt performance of auditory discrimination tasks (Weiskrantz & Mishkin 1958
, Gross & Weiskrantz 1962
, Gross 1963
, Goldman & Rosvold 1970
Several studies have demonstrated that neurons in the PFC respond to auditory stimuli or are active during auditory tasks in Old and New World primates (Newman & Lindsley 1976
; Ito 1982
; Azuma & Suzuki 1984
; Vaadia et al. 1986
; Watanabe 1986
; Tanila et al. 1992
; Russo & Bruce 1994
; Bodner et al. 1996
), but extensive analyses of the encoding of complex sounds at the single-cell level was lacking. In these studies, weakly responsive auditory neurons were found sporadically and were distributed across a wide region of the PFC. Few of the early electrophysiological studies observed robust auditory responses in macaque prefrontal neurons. A single study noted phasic responses to click stimuli in the lateral orbital cortex, area 12o (Benevento et al. 1977
). The lack of auditory activity in the PFC of nonhuman primates in earlier studies may be due to the fact that studies have often confined electrode penetrations to caudal and dorso-lateral PFC (Ito 1982
; Azuma & Suzuki 1984
; Tanila et al. 1992
; Bodner et al. 1996
), where presumptive auditory inputs to the frontal lobe are more dispersed. Second, recent data has shown that the auditory responsive zone in the macaque ventral prefrontal cortex (VLPFC) is small, making this region difficult to locate unless anatomical data or other physiological landmarks are utilized (O’Scalaidhe et al. 1997
; Romanski et al. 1999a
Auditory Responsive Domain in VLPFC
The demonstration of direct projections from physiologically characterized regions of the auditory belt cortex to distinct prefrontal targets has helped guide electrophysiological recording studies in the PFC (Romanski et al. 1999b
). Using this anatomical information, an auditory responsive domain has been defined within areas 12/47 and 45 (collectively referred to as VLPFC) of the primate PFC (Romanski & Goldman-Rakic 2002
). Neurons in VLPFC are responsive to complex acoustic stimuli including, but not limited to, species-specific vocalizations (). The auditory-responsive neurons are located adjacent to a face-responsive region that has been previously described (Wilson et al. 1993
; O’Scalaidhe et al. 1997
), which suggests the possibility of multimodal interactions, discussed below. VLPFC auditory neurons are located in an area that has been shown to receive acoustic afferents from ventral stream auditory neurons in the anterior belt, the parabelt, and the dorsal bank of the STS (Hackett et al. 1999
; Romanski et al. 1999a
; Diehl et al. 2008
). The discovery of complex auditory responses in the macaque VLPFC is in line with human fMRI studies indicating that a homologous region of the human brain, area 47 (pars orbitalis), is activated specifically by human vocal sounds compared with animal and nonvocal sounds (Fecteau et al. 2005
). The precise homology between the monkey prefrontal auditory area and auditory processing areas of the human brain has not been definitively characterized and awaits further study.
Figure 6 Location of auditory responsive cells in the ventrolateral prefrontal cortex (PFC). On the left, three coronal sections with electrode tracks (red lines) are shown with locations of auditory responsive cells indicated (black tics). On the right, the lateral (more ...)
The initial study reporting auditory responses in the nonhuman primate VLPFC characterized the auditory responsive cells as being responsive to several types of complex sounds including species-specific vocalizations, human speech sounds, environmental sounds, and other complex acoustic stimuli () (Romanski & Goldman-Rakic 2002
). Whereas 74% of the auditory neurons in this study responded to vocalizations, fewer than 10% of cells responded to pure tones or noise stimuli. Some neurons had phasic responses that peaked at the onset of the stimulus (), whereas other cells produced sustained responses to complex stimuli that lasted the length of, or sometimes beyond the duration of, the auditory stimulus (). The demonstration of neurons in the VLPFC responsive to complex auditory stimuli expanded the circuitry for complex auditory processing and prompted a number of research questions about the role of PFC in auditory processing.
Figure 7 Types of responses to auditory stimuli by prefrontal neurons. The responses of two single units to three different exemplars of auditory stimuli are shown in raster and histogram plots. The onset of the auditory stimulus (vocalizations in the first two (more ...)
Representation of Vocalizations in VLPFC
Since investigators localized a discrete sound-processing region in the PFC of nonhuman primates, research has focused on determining what the neurons in this prefrontal area encode; perhaps neurons at higher levels of the auditory hierarchy process complex stimuli in a more abstract manner than do lower-order sensory neurons or show evidence of greater selectivity. As mentioned above, studies have shown that VLPFC auditory neurons do not readily respond to simple acoustic stimuli such as pure tones (Romanski & Goldman-Rakic 2002
) but are robustly responsive to vocalizations and other complex sounds (Averbeck & Romanski 2004
, Gifford et al. 2005
, Romanski et al. 2005
, Russ et al. 2008
). Would these higher-order auditory neurons be more likely to process the referential meaning within communication sounds or complex acoustic features that are a part of these and other sounds? PET and fMRI studies have suggested that the human inferior frontal gyrus, or ventral frontal lobe, plays a role in semantic processing (Demb et al. 1995
, Poldrack et al. 1999
). Would nonhuman primates have a frontal lobe homologue that contains neurons that encode the referents of particular vocalizations?
In playback experiments using rhesus macaque vocalizations, monkeys respond behaviorally in a manner similar to vocalizations with similar functional referents regardless of acoustic similarity (Hauser 1998
, Gifford et al. 2003
). Thus the neural circuit guiding this behavior may include the VLPFC, which receives auditory information and is involved in a number of complex cognitive processes. However, it is highly unlikely that individual neurons would be semantic detectors. Such representations, i.e., neurons that have a very selective response to a specific stimulus (i.e., grandmother cells) or a set of related stimuli, are rarely found in the macaque brain, where distributed representations are the norm (see, for example, Averbeck et al. 2003
). However, rather sparse representations have recently been reported in the human medial temporal lobe (Quiroga et al. 2008
), where a single cell may respond best to various referents of a single famous individual.
In a series of studies (Romanski et al. 2005
, Averbeck & Romanski 2006
), investigators have examined the coding properties of VLPFC neurons with respect to macaque vocalizations. VLPFC neurons were tested with a behaviourally and acoustically categorized library of Macaca mulatta
calls (Hauser 1996
, Hauser & Marler 1993
), which contained exemplars from each of 10 identified call categories. Neurons in this study responded to between 1 and 4 of the different calls () (Romanski et al. 2005
). It is interesting to note the similarity of call selectivity in VLPFC and in the lateral belt auditory cortex, where call preference indices also range between 1 and 4 for 75% of the recorded population () (Tian et al. 2001
). Although these analyses suggest that prefrontal neurons are not simply call detectors, where a call detector would be defined as a cell that responded with a high degree of selectivity to only a single vocalization, they do not offer direct evidence about whether neurons provide information about multiple call categories.
Figure 8 A comparison of selectivity to vocalizations (a) ventral prefrontal cortex (VLPFC) and (b) the anterior lateral belt (AL). The number of cells responding to one or more vocalizations on the basis of the neuron’s half-peak response to all stimuli (more ...)
A number of studies have related information to stimulus or, more importantly, vocalization selectivity. Although firing rates, which form the basis of the call selectivity index and information (specifically Shannon information; Cover & Thomas 2006
) are related, the relationship is indirect. For example, if a neuron responded to 5 calls from a list of 10 calls but had a different response to each of those 5 calls, it may show a selectivity index of one, two, three, four, or five, even though it was actually providing considerable information about all five of the calls to which it responded. In contrast to this response profile, a neuron that responded to five of the calls but had a similar response to all five calls would show a selectivity of five, but it would provide less information because the response of the neuron would provide information only about whether one of the five responsive calls or five nonresponsive calls had been presented.
Thus, decoding analyses and associated information theoretic techniques were used to examine the amount of information single neurons provide about individual stimuli (Romanski et al. 2005
). Specifically, it characterized the number of stimuli that a single cell could discriminate and how well it could discriminate them. These analyses showed that single cells, on average, could correctly classify their best call in ~55% of individual trials, where 10% is chance. Performance for the second- and third-best calls fell quickly to ~32% and 22% (). The information estimates showed similar values and decreased accordingly. Thus, although single neurons certainly were not detecting individual calls, their classification performance dropped off quickly. This result is similar to the encoding of faces by temporal lobe “face” cells (Rolls & Tovee 1995
Figure 9 Total information (in bits) and average percent correct (as percent × 0.01). The graph shows a tuning curve for the population average of VLPFC cells rank-ordered according to optimum vocalization. On average, VLPFC cells contain 1.3 bits of information (more ...)
Auditory responsive neurons in the VLPFC have been further examined by determining how they classify different vocalization types. Investigators have used a hierarchical cluster analysis based on the neural response to exemplars from each of 10 classes of vocalizations. In this analysis, stimuli that gave rise to similar firing rates in prefrontal neurons were clustered together, and stimuli that gave rise to different firing rates fell into different clusters (). After dendrograms were fit to individual neurons, a consensus tree (Margush & McMorris 1981
) was built from the clusters that occurred most frequently across neurons. Because different neurons were tested with different lists of vocalization exemplars from each call category, any consistent clustering of particular stimuli across lists would suggest that the VLPFC neurons were really responding similarly to those classes of stimuli, rather than to individual tokens, in a consistent manner.
Figure 10 Typical prefrontal responses to macaque vocalizations and cluster analysis of mean response. (a, d) The neuronal response to 5 of 10 vocalization stimuli that were presented during passive fixation is shown (spike density function). (b, e) The mean response (more ...)
At the population level, we found a few classes of stimuli that often clustered together (). For example, aggressive calls and grunts, coos and warbles, and copulation screams and shrill barks all clustered together relatively often. When the vocalizations themselves were analyzed for similar spectral structure, several of the same clusters emerged, notably warbles and coos and aggressive calls and grunts (). Thus, the clusters, which were consistent across the neural population, were all composed of stimuli that were acoustically similar. Using other analysis methods, Tian et al. (2001)
found similar results in lateral belt auditory cortex responses to a subset of rhesus macaque vocalizations (). In this study, Tian et al. (2001)
demonstrated that lateral belt neurons tended to respond in a similar manner to calls that had similar acoustic morphology.
Not all studies agree on the way vocalizations are categorized by neurons in the frontal lobe. One study examined responses to vocalization sequences in which one vocalization was presented several times, after which it was followed by a different vocalization that differed either semantically or acoustically (similar to the habituation/dishabituation task used previously; Saffran et al. 1996
, Gifford et al. 2003
). Although this task differs in important ways from the task used in Romanski et al. (2005)
, both are passive listening tasks. Gifford et al. (2005)
, however, found that summed population neural responses tended to be different when there were transitions between semantically different categories but not between semantically similar categories. Summed population responses, however, lose much of the information present in neural responses (Reich et al. 2001
, Montani et al. 2007
). Thus, there may be information at the single-cell level that is being lost when population responses are examined. Moreover, although the early responses to some semantically similar calls appeared similar, after ~200 ms the neural responses appear to diverge. This is about the same time that acoustic differences between the relevant stimuli emerge. Furthermore, to truly distinguish semantic categorization from acoustic categorization, it seems important to show that sounds that have a similar acoustic morphology but differ in semantic context do not evoke a similar response, which has not yet been demonstrated. Finally, it may not be possible to dissociate acoustics and semantics because some calls that share a number of acoustic features may be uttered under similar behavioral contexts (Hauser & Marler 1993
Another recent study has used neural decoding approaches to compare how well STG and VLPFC neurons discriminate among the 10 vocalization classes (Russ et al. 2008
). This study showed that STG neurons carry more information about the vocalizations than do VLPFC neurons and also suggested that information about stimuli was maximal with extremely small bin sizes. It is difficult to interpret this result, however, because one needs more trials than parameters to estimate decoding models, usually by at least a factor of 10 (Averbeck 2009
), unless regularization techniques are being used (Machens et al. 2004
). The small bin sizes in the analysis used in Russ et al. (2008)
, however, led to the opposite situation (i.e., when a bin size of 2 ms is used over a window of 700 ms, to estimate means for 10 stimuli, 3500 parameters need to be estimated; whereas, in fact, only on the order of several hundred trials were available in total to fit these models). Similar analyses (Averbeck & Romanski 2006
) have found that a bin size of 60 ms is optimal using twofold cross validation, which minimizes the problem of overfitting. Overfitting is often still present with leave-one-out cross validation.
Selectivity in VLPFC neurons is also somewhat controversial. In Russ et al. (2008)
, a preference index was calculated for VLPFC neurons, and it showed that prefrontal neurons responded to more than 5 of 10 vocalizations, indicating very little selectivity in prefrontal neurons (Russ et al. 2008
). This result stands in contrast to that of Romanski et al. (2005)
, which calculated an average preference index for VLPFC neurons of 2–3 call types out of 10 () in neurons that had first been shown to be responsive for the vocalization category. Lack of selectivity (i.e., high call preference index values) in some studies could be due to inclusion of neurons that were not responding to the vocalizations because this practice would lead to an increase in this metric.
Computational Approaches to Understanding Vocalization Feature Coding in VLPFC
Although the previous analyses have determined that prefrontal neurons respond to complex sounds including vocalizations, they still do not address the question of which features of the stimuli are actually driving the neural responses. One way to address this question is to use a feature elimination approach (Rolls et al. 1987
, Rauschecker et al. 1995
, Tanaka 1997
, Rauschecker 1998a
, Kayaert et al. 2003
). In this approach, one starts with a complex stimulus that robustly drives the neuron to respond, and then one removes features from the stimulus. If the neuron still responds to the reduced stimulus, the remaining features must be driving the neural response. This approach has been used in studies of marmoset primary auditory cortex neuroons and macaque belt auditory cortex (Wang 2000
; Rauschecker et al. 1995
; Rauschecker 1998a
One approach to this question is to use principal and independent component analysis (PCA/ICA) to rigorously define feature dimensions (Averbeck & Romanski 2004
). PCA identifies features that correspond closely to the dominant spectral or second-order components of the calls, for example, the formants, whereas ICA identifies features related to components beyond second order. Specifically, ICA can be used to extract features that retain bi-spectral (thirdor higher-order) nonlinear components of the calls. This study showed that after projecting each call into a subset of the principal or independent components, the dominant Fourier components seen in the spectrogram were preserved (). The independent components, however, retained power across multiple harmonically related frequencies (). Furthermore, examination of the bi-coherence, which shows phase locking across harmonically related frequencies, showed that the ICA subspace tended to retain phase information across multiple harmonically related frequencies, whereas the PCA retained phase information only at frequencies that had more power, which tended to be lower frequencies, and phase may be highly important for stimulus identification (Oppenheim & Lim 1981
Figure 12 Principal and independent component filtering of a coo. (a) Spectrogram of an unfiltered coo. (b) Spectrogram of a coo after projecting into the first 10 principal components. (c) Spectrogram of a coo after projecting into the first 10 independent components. (more ...)
Preliminary neurophysiological evidence suggests that the independent components preserve more of the features that are important to the neural responses. In many cases, the neural responses to the ICA-filtered calls was similar to that seen to the original call (), and in some cases the response to the ICA-filtered call is even stronger than the responses to the unfiltered call (). This result could occur if some of the feature dimensions present in the original call were actually suppressive, in which case removing them would result in a stronger response. Thus, this approach provides a tool for uncovering the important stimulus features in macaque vocalizations that may be driving neural responses. Furthermore, it allows one to compare the ability of PCA and ICA to identify features of vocalizations that are most relevant to auditory neurons.
Figure 13 Single neuron responses to original and filtered calls. (a) Response of a single VLPFC neuron to a shrill bark and to the same shrill bark filtered with either the first 10 principal or independent components. (b,d) Mean firing rate to original and filtered (more ...)
Complementary to the work on feature elimination, other techniques have been used to identify which features of the vocalizations are driving the time-varying neural responses (Averbeck & Romanski 2006
, Cohen et al. 2007
). In this work, rather than taking a featureelimination approach to try to preserve neural responses to reduced stimuli, investigators compared models that attempt to predict bin-by-bin neural responses on the basis of the time course of the vocalizations. In one study, Averbeck & Romanski (2006)
examined first the features of the vocalizations that were useful for discriminating among the classes of calls because these are likely to be the behaviourally relevant features and are therefore the features being processed by the auditory system. Models were then used to examine how these might be encoded in the responses of single neurons.
This study found that global features of the auditory stimuli, including frequency and temporal contrast, were not highly useful for discriminating among stimuli (). Dynamic features of the stimuli, however, captured by a Hidden Markov Model (HMM) were much more effective at discriminating among stimuli (~40% discrimination performance for global features, 75% for dynamic features). Thus, the HMM captured much more of the statistical structure of the vocalizations necessary to distinguish among them.
Figure 14 Temporal and spectral modulation of coo and gekker. (a, top) Spectrogram of coo. (Second from top) Modulation spectra of coo (i.e., Fourier transform of spectrogram). (Third from top) Average frequency modulation, computed by averaging across the time (more ...)
The hypothesis that VLPFC neural responses could be accounted for by using the HMM was subsequently examined. The HMM produced, for each vocalization, an estimate of the probability that the vocalization comes from each of the ten classes as a function of time (). Thus, the HMM maps the time-frequency representation of the vocalization onto a time-probability representation, where it tracks the probability of the vocalization belonging to each of the 10 classes as a function of time. Because there is a time-frequency representation and a corresponding time-probability representation, linear transformations of these representations to predict neural responses could be examined. The linear transformation of the time-frequency representation is often known as a spectral-temporal-receptive-field (STRF). The linear transformation of the probability representation was analogously called a linear-probabilistic receptive field (LPRF). The LPRF was better able to predict the neural responses than was the STRF (Averbeck & Romanski 2006
). This finding suggests that the time-probability representation may be closer to the input of VLPFC neurons than a time-frequency representation would be, if one assumes that neurons cannot compute highly nonlinear functions of their inputs.
Spectrogram (a) and time-probability plot (b) of a copulation scream. Time probability values were generated by the Hidden Markov Model.
Much of the work just described has developed out of an interaction between theoretical and experimental approaches. This work can be further developed in several directions, both theoretical and experimental. For example, one of the most fruitful ways to assess sensory processing is to try to understand how signals evolve across synapses. Therefore, a better understanding of the representation of vocalizations in the areas that send acoustic information to VLPFC, using the same tools that have been previously used to study VLPFC, would be highly useful. For example, it is well known that spectral-temporal receptive fields can accurately characterize representations very early in the auditory system, whereas the representation in VLPFC can be better characterized using an HMM, which is related specifically to categories of macaque vocalizations. Where in the brain does the processing or computation take place that develops this categorical representation from the spectral-temporal representation? Has it happened already at the level of the sensory thalamus, or is it carried out at the cortical level? Recordings at early stages of auditory processing, using similar stimulus sets and analyzed with the HMM, could help resolve this question. Clarification in this area could have direct relevance for understanding why strokes or lesions at various stages of auditory processing induce particular deficits.
Moreover, although much of the work up to this point has focused on the sensory representation in VLPFC, there may also be an associated motor representation or this sensory representation may be important for motor processing at some level. Although some studies have claimed that the ventral premotor area in the macaque (area F5), which contains mirror neurons, is the evolutionary precursor to the human language system (Rizzolatti & Arbib 1998
), some features of VLPFC suggest that it too could be an important evolutionary precursor to human language areas. Specifically, it clearly contains a representation of vocalizations, which are a more likely candidate precursor to human language than are hand movements. Second, the VLPFC representation is just ventral to a significant motor-sequence representation in caudal area 46 (Isoda & Tanji 2003
, Averbeck et al. 2006
). Third, previous studies in nonhuman primates have suggested that VLPFC may be important in the association of a stimulus with a particular motor response, known as conditional association or action selection (Petrides 1985
, Rushworth et al. 2005
). One might consider the process of articulation and phonation as an elaborate series of conditional associations of specific acoustic stimuli with precise articulatory movements. Thus the combination of a motor sequence representation and a vocalization representation may be important components of a macaque system that could be precursors to the human language system.
Future studies that include recordings in more natural settings where callers and listeners can be monitored electrophysiologically may allow us to determine which features, behavioural or acoustic, are processed by VLPFC neurons. Combining vocalizations with the corresponding facial gestures may also offer clues to semantic processing by the PFC. This possibility is intriguing because of the discovery that some VLPFC neurons are, in fact, multisensory (Sugihara et al. 2006
) and respond to both particular vocalizations and the accompanying facial gesture.