|Home | About | Journals | Submit | Contact Us | Français|
Mouse ultrasonic vocalizations (USVs) are often used as behavioral readouts of internal states, to measure effects of social and pharmacological manipulations, and for behavioral phenotyping of mouse models for neuropsychiatric and neurodegenerative disorders. However, little is known about the neurobiological mechanisms of rodent USV production. Here we discuss the available data to assess whether male mouse song behavior and the supporting brain circuits resemble those of known vocal non-learning or vocal learning species. Recent neurobiology studies have demonstrated that the mouse USV brain system includes motor cortex and striatal regions, and that the vocal motor cortex sends a direct sparse projection to the brainstem vocal motor nucleus ambiguous, a projection thought be unique to humans among mammals. Recent behavioral studies have reported opposing conclusions on mouse vocal plasticity, including vocal ontogeny changes in USVs over early development that might not be explained by innate maturation processes, evidence for and against a role for auditory feedback in developing and maintaining normal mouse USVs, and evidence for and against limited vocal imitation of song pitch. To reconcile these findings, we suggest that the trait of vocal learning may not be dichotomous but encompass a broad set of behavioral and neural traits we call the continuum hypothesis, and that mice possess some of the traits associated with a capacity for limited vocal learning.
Laboratory mice (Mus musculus) and rats (Rattus norvegicus) participate in a significant amount of communication using ultrasonic vocalizations (USVs) produced at frequencies ranging from 30 - 110 kHz (Constantini & D'Amato, 2006; Portfors, 2007). Traditionally, two types of USVs have been studied in laboratory rodents as measures of internal states: pup isolation calls (Branchi, Santucci, & Alleva, 2001; Brudzynski, Kehoe, & Callahan, 1999; D'Amato, Scalera, Sarli, & Moles, 2005; Elwood & Keeling, 1982; Hahn, Hewitt, Adams, & Trully, 1987; Hofer & Shair, 1992; Ise & Ohta, 2009; Noirot & Pye, 1969; Sales & Smith, 1978; Wöhr, Dalhoff, et al., 2008a) and adult USVs in aversive or rewarding conditions (Brudzynski, 2007; 2009; Burgdorf et al., 2007; Knutson, Burgdorf, & Panksepp, 2002; Wöhr, Houx, et al., 2008b). Reliable elicitation of isolation calls by quantifiable stimuli and a well characterized developmental trajectory have made pup USVs a useful tool for testing the effects of anxiogenic or anxiolytic compounds (Dirks et al., 2002; Fish, Faccidomo, Gupta, & Miczek, 2004; Fish, Sekinda, Ferrari, Dirks, & Miczek, 2000) and for phenotyping mouse models of neuropsychiatric disorders associated with deficits in vocal communication (Scattoni, Crawley, & Ricceri, 2009).
Adult mouse USVs appear to both signal internal emotional states and facilitate social communication during non-aggressive encounters (Gourbal, Barthelemy, Petit, & Gabrion, 2004; Moles, Costantini, Garbugino, Zanettini, & D'Amato, 2007; Portfors, 2007). The most well characterized adult mouse USVs are those produced by males in a mating context. Males of many strains produce long bouts of USVs during courtship of a female and after copulation (Constantini & D'Amato, 2006; Gourbal et al., 2004; Nyby, 1983; Portfors, 2007). Male courtship USVs are sexually selective, and pheromones present in female urine are a strong and sufficient trigger (Guo & Holy, 2007). In two-choice experiments females responded with approach behavior preferentially to adult male USVs over pup isolation calls (Hammerschmidt, Radyushkin, Ehrenreich, & Fischer, 2009; Musolf, Hoffmann, & Penn, 2010), and spent more time with vocalizing males (Pomerantz, Nunez, & Bean, 1983).
Although the general occurrence of male mouse USVs has been known for decades, the spectro-temporal and syntactic features of male courtship USVs were only recently analyzed in depth. Holy and Guo showed that courtship USVs from different males contain identifiable syllable types produced in regular temporal patterns that differed between individuals (Holy & Guo, 2005). Moreover, the long strings of syllables they recorded sounded remarkably similar to some bird songs when the pitch of the USVs was shifted to the human audible frequency range and played in real time (Supplementary Audio 1). After observing the complexity of mouse USVs, individual differences, and their similarity to some birdsongs, many researchers wondered what is the neural substrate for USV production, whether mice might share central control mechanisms for vocalization with vocal learning species like songbirds and humans, and whether mouse vocalizations are innate or learned.
The generally accepted list of vocal learning species includes three lineages of birds (songbirds, parrots, hummingbirds) and up to four lineages of mammals (humans, cetaceans [dolphins and whales], bats, elephants, and pinnipeds [sea lions and seals]) (Janik & Slater, 1997; 1997; Jarvis, 2004; 2004; Schusterman, 2008; Schusterman & Reichmuth, 2008). This vocal learning ability, which includes the ability to modify the spectral and syntactic composition of vocalizations, is a rare trait that serves as a critical substrate for human speech (Doupe & Kuhl, 1999; Jarvis, 2004; Marler, 1970a). It has been well studied in humans and songbirds because songbirds display a capacity for vocal mimicry using a process similar to human speech acquisition (Doupe & Kuhl, 1999; Marler, 1970a) and some species are easy to breed and study in the laboratory. Underlying the vocal learning process in both humans and song learning birds are specialized forebrain circuits so far not found in species that produce only innate vocalizations, despite decades of searching for them (Jarvis, 2004; Jürgens, 2009). Even closely related non-human primate species reportedly lack the behavioral and neural elements classically associated with a capacity for vocal learning (Hammerschmidt, Freudenstein, & Jürgens, 2001; Janik & Slater, 1997; Jürgens, 2009). Like non-human primates, mice have been assumed to be vocal non-learners (Enard et al., 2009; Fischer & Hammerschmidt, 2010; Jarvis, 2004), but this had not been tested. Here we discuss the concepts of innate versus learned vocal communication, give an overview of the neural pathways involved, critically review recent studies that have approached the issue of vocalization in mice (Arriaga, Zhou, & Jarvis, 2012; Chabout et al., 2012; Grimsley, Monaghan, & Wenstrup, 2011; Hammerschmidt et al., 2012; Kikusui et al., 2011), address some conflicting views, and propose avenues for reconciliation. The views we propose will be relevant to all studies on innate and learned vocal communication in vertebrates.
Many animals communicate by broadcasting species-typical acoustic signals including insects, frogs, birds, and mammals. However, not all of these sounds are classically defined vocalizations, which are produced by a vocal organ. The vocal organ in birds is the syrinx, and it is the larynx in frogs and most mammals. Dolphins, a marine mammal, are believed to vocalize using specialized nasal sacs in addition to the larynx (Madsen, Jensen, Carder, & Ridgway, 2012). Gross laryngeal anatomy is well conserved among mammals, including between mouse and human, and most of the cartilages and muscles are similarly positioned in both species (Harrison, 1995; Thomas, Stemple, Andreatta, & Andrade, 2009). Premotor signals to the larynx are transmitted via the superior and recurrent laryngeal nerves, and their shared root is the brainstem nucleus ambiguus (Amb). Mouse USVs are most likely generated by the larynx, as revealed in laryngeal nerve transection and electrophysiology studies. Bilaterally severing the recurrent laryngeal nerve abolishes pup and adult USVs (Nunez, Pomerantz, Bean, & Youngstrom, 1985; Roberts, 1975). Electrical recordings in anesthetized rats show that a majority of the Amb motoneurons recorded display tonic bursts tightly coupled to and preceding sound production by 46 ms (Yajima, Hayashi, & Yoshii, 1982). Similar results were obtained for extracellular recordings in awake Southern pigtailed macaques (Macaca nemestrina), with bursts in Amb associated with variations in vocal output preceding vocalization by 100-200 ms (Yajima & Larson, 1993). Preliminary observations indicate that the explanted mouse larynx is capable of producing sounds displaying the non-linear dynamics characteristic of natural USVs (Berquist, Ho, & Metzner, 2010). However, these sounds were in the human audible spectrum and it remains unclear if they depend on vibrations of the vocal folds or a whistle mechanism. Other body parts can be used to produce sounds, such as the lips for lip smacking or whistling and wing beats in insects, but only the larynx and syrinx are known to have the capacity to produce the complex imitated vocalization repertoire observed in humans and song learning birds (Hauser & Konishi, 1999).
Vocalizations can take many forms, the parameters of which are often heavily determined by the production and perceptual mechanisms of the sender and receiver of the acoustic signals. Example spectrograms of a spoken human sentence, songs of a male zebra finch (Taeniopygia guttata) and canary (Serinus canaria), call of a ringdove (Streptopelia risoria), predator alarm call of a vervet monkey (Chlorocebus pygerythrus), and courtship USV of a male mouse reveal the diversity of sounds generated by laryngeal and syringeal mechanisms (Fig. 1). An example recording of a male mouse song shifted into frequencies audible to humans and slowed to highlight the pitch transitions can be heard in Supplementary Audio 2. A sonogram representing 1 second from the same USV bout is shown below (Fig. 1e). These USVs are typically composed of whistle-like syllables that are more similar to the vocalizations of dolphins, some songbirds like canaries, and several primate species, like marmosets. Spectrally, these USVs are unlike the typical vocalizations of zebra finches, parrots, and humans; however, such differences do not preclude them from being used to model mechanisms of vocal production across species.
Many species produce a diverse repertoire of vocalizations that can include calls, songs, “laughter”, and cries. We review some important classifications and describe how they may relate to mouse USVs.
Notes are the most basic acoustic unit, and are formed by a single continuous sound with gradual variations in fundamental frequency. One or more notes can be combined to form Calls and Syllables, which are reproducible single acoustic units separated by periods of silence. Although syllables are structurally similar to calls, we distinguish them from calls by patterns of usage. Calls are typically produced in isolation or in short bursts and may obtain semantic content on their own (Seyfarth, Cheney, & Marler, 1980). Syllables, however, derive their classification from being included in a larger unit representing a longer series of rapidly produced vocalizations of varying types. A reproducible series of syllables with a relatively fixed order is labeled a ‘motif’. By clustering units into motifs, an animal with a repertoire of only a few syllables can generate a wide variety of larger communication units. In this classification scheme, syllables can be void of specific meaning themselves, and they would not necessarily serve a communication function if produced in isolation. This distinction is not always entirely clear. For example, the long call of male zebra finches can function alone as a contact call or be incorporated into a motif that is reproduced in song bouts (Zann, 1990). In this case, the same unit could be labeled a call or a syllable depending on the context of production.
Adult mouse USVs feature reproducible sound units that different groups have categorized by their spectral morphology (Fig. 2) (Arriaga et al., 2012; Grimsley et al., 2011; Holy & Guo, 2005; Scattoni, Gandhy, Ricceri, & Crawley, 2008). Most of these units are frequently produced in long sequences containing different types of sound units, and some simple motifs (Holy & Guo, 2005). We will call these recurring units of adult male USVs ‘syllables’ because they are grouped into non-random series, rarely produced in isolation, and there is no evidence that they serve a communication function individually.
In a study by our laboratory on mouse USV produced in response to female urine, we used a modified version of the Holy & Guo categorization method to identify 8 common and 3-4 rare (<1% of repertoire) syllable types produced by adult males of the B6D2F1/J (BxD) and C57BL6/J (B6) strains (Fig. 2) (Arriaga et al., 2012). The first major morphological distinction between syllable types under this classification scheme is the presence or absence of an instantaneous ‘pitch jump’ separating notes within a syllable. Thus, the morphologically simplest note type doesn't contain any pitch jumps (Type A in Fig. 2). For syllables containing pitch jumps, each jump marks the end of one note and the beginning of the next note. Two-note syllables are identified by a single upward or downward pitch-jump (Types B & C in Fig. 2, respectively). Similarly, more complex syllables are identified by the series of upward and downward pitch jumps occurring as the fundamental frequency varies between notes of higher and lower pitch (Types D – H in Fig. 2).
Other researchers have categorized syllables differently, including grouping some of these types and splitting others into sub-types according to the pitch trajectory or note duration (Fischer & Hammerschmidt, 2010; Grimsley et al., 2011; Kikusui et al., 2011; Scattoni et al., 2008). For example, the single note contained in our Type A syllables can have short or long duration, and long notes can be further split based on a downward, upward, chevron-shaped, complex, or flat trajectory (Fig. 3). Our Type B & C syllables have been lumped by others into a two-note super-category (Fischer & Hammerschmidt, 2010; Grimsley et al., 2011; Kikusui et al., 2011; Scattoni et al., 2008) despite having clearly distinct morphologies. Similarly, our syllable Types D through H have been grouped into a ‘Frequency Steps’ super-category (Scattoni et al., 2008), or into a more than one jump category (Kikusui et al., 2011), and one study grouped all syllable types containing pitch jumps (Types B through H) into a ‘whistles with pitch jumps’ category (Fischer & Hammerschmidt, 2010). A combination of pitch jump sequences and frequency contours may be necessary to accurately capture the variability of mouse vocal behavior. The number and sequence of pitch jumps can serve as an initial discriminator, followed by a refined categorization based on frequency contours and duration, as described for syllable type A. However, arbitrarily grouping syllables with measurably different numbers and sequences of pitch jumps obscures real variability in vocal behavior and may complicate subsequent analysis of heterogeneous syllable categories. The issue of classification is an active area of investigation that has not yet reached consensus. A conference was recently held at the Institut Pasteur in Paris, France in April 2012 to address problems of syllable/note classification (http://www.ura2182.cnrs-bellevue.fr/workshop_usv/). Until a robust classification scheme is developed, negative results must be interpreted with caution due to the possibility of improper classification (grouping very different syllables) masking real effects.
A song is set of vocalizations, often elaborate, delivered periodically and sometimes with a rhythm. Songs may be produced spontaneously or in response to an external stimulus such as the presence of a conspecific. Songs typically contain multiple syllable types, or categories of reproducible vocalizations distinct from other vocalizations comprising the song. To distinguish a series of syllables in a song from a succession of calls we will apply the sensu strictissimo definition used previously (Holy & Guo, 2005) and borrowed from Broughton (Broughton, 1963):
‘a sound of animal origin which is not both accidental and meaningless’
‘a series of notes, generally of more than one type, uttered in succession and so related as to form a recognizable sequence or pattern in time,’
‘a complete succession of periods or phrases’
Holy and Guo's analysis of the spectro-temporal features of male courtship USVs demonstrated that these vocalizations satisfy all conditions required for classification as song (Holy & Guo, 2005). Visually, the song-like quality of male mouse courtship USVs can be appreciated in spectrograms of longer sequences (Fig. 4). Acoustically, when the pitch of courtship USVs is shifted to the audible spectrum they sound very similar to some birdsongs in both temporal and melodic structure (Supplementary Audio 1). The behavioral responses of conspecifics also provide clues that male mouse songs are distinct from calls. Males often do not sing in isolation or to other males, but are triggered to sing by the presence of a female or female urine (Guo & Holy, 2007; Musolf et al., 2010; Nyby, 1983). Female mice can distinguish male songs from pup isolation calls (Hammerschmidt et al., 2009). Given the choice, females selectively approach the source of the songs instead of the source of the isolation calls. Preference for male songs is striking given that pup calls are considered a very strong and reliable stimulus, and the frequency ranges of the two signals overlap significantly. Moreover, a separate choice experiment reported a slight tendency for females to prefer the songs of non-kin males (Musolf et al., 2010), further suggesting that individual songs are distinguishable and could serve an important social and reproductive function.
Despite the structural and behavioral evidence that they meet the sensu strictissimo definition of song, some researchers still prefer to refer to them as very long sequences of calls. We leave it to the reader to decide based on the evidence, but for the purpose of this review we will simply refer to these vocalizations as “mouse songs”. This designation does not necessarily imply learning. For example, the songs of some suboscine passerine birds are innate, although they share structural and behavioral characteristics with the learned songs of oscine songbirds (Kroodsma & Konishi, 1991). Likewise some calls of oscine songbirds are learned (Simpson & Vicario, 1990). Given the various learning strategies described, the multiple functions of vocal signals, and the existence of innate songs, our working definition of song deliberately excludes ontogeny and focuses primarily on phenotype.
Many types of learning are possible within the framework of vocal communication systems. Thus, it is important not only to determine what learning capabilities are present in the mouse vocal system, but also to distinguish which types of learning are most relevant to studies of speech learning in humans. Three types of learning are related to vocal communication systems: auditory comprehension learning, vocal usage learning, and vocal production learning (Egnor & Hauser, 2004; Janik & Slater, 1997; 2000; Jarvis, 2004; Schusterman, 2008).
Auditory comprehension learning is an auditory learning strategy characterized by the ability to associate a particular sound with an appropriate behavioral response or objects in the environment (Janik & Slater, 1997; Jarvis, 2004). Comprehension learning capabilities are broadly distributed among vertebrates. For example, dogs (Canis lupus familiaris) can be trained to respond to the human word ‘sit’; however, the vocal part of this training process is restricted to the act of correctly identifying the word through auditory learning. Learning in this case does not extend to the vocal production (i.e. vocal motor) component. Dogs don't learn to produce the word ‘sit’ by adaptively modifying motor commands to achieve the required sequence of laryngeal and respiratory patterns. However, some motor behaviors can be associated with a learned auditory cue. For example, the dog's typical learned response to the verbal command ‘sit’ is the motor act of sitting on the hindquarters.
Vocal usage learning is characterized by the ability to learn when and where, but not how, to produce vocalizations in a specific social or environmental context. Usage learning does not require acoustic vocal imitation. A well-studied example of usage learning is the alarm call repertoire of vervet monkeys produced in response to specific predator threats. An eagle (Polemaetus bellicosus) in the sky, a leopard (Pantheru pardus) in the trees, and a python (Python sebae) on the ground will elicit different species-typical calls, and a young vervet monkey must learn through experience when it is appropriate to produce each call (Seyfarth et al., 1980). However, the spectral content of alarm calls is thought to be innately determined (Seyfarth & Cheney, 1986). Learning is restricted to the context or ‘when’ of production, but the ‘how’ is inflexible. Usage learning and comprehension learning are often intimately linked. For example, it is critical that a young vervet monkey learn not only which call to produce in response to each predator, but also learn the appropriate predator-specific defensive behavior to produce upon hearing each call. The leopard-specific call triggers retreat into the trees, and the eagle-specific call causes listeners to hide in the dense bush (Seyfarth et al., 1980). The learned association of auditory cues with effective predator defense strategies is similar to the training of a dog's behavior to verbal commands.
In contrast, vocal production learning is the ability to generate experience-dependent modifications of acoustic signals, and is considered the most relevant to the study of human speech (Janik & Slater, 1997; Jarvis, 2004). Strictly defined, production learning excludes changes in the amplitude and duration of vocalizations because they rely on control of respiratory patterns rather than control of the musculature of the vocal organ (Janik & Slater, 1997). In this context, the most dramatic and well-studied examples of vocal production learning are song learning in birds and speech learning in humans. Birdsong and speech share many features: auditory acquisition of learning templates, dependence on auditory feedback for learning and maintenance of learned vocalizations, temporally restrictive critical periods for learning, and specialized forebrain networks for vocal control (Doupe & Kuhl, 1999; Jarvis, 2004; Marler, 1970a). Because of these important similarities, songbirds have become the dominant neuroethological animal models for vocal learning studies.
One consequence of the intense focus on the songbird model is a situation where the meaning of the term ‘vocal learning’ has been restricted to refer exclusively to learning vocalizations de novo with reference to an externally acquired model, as occurs for birdsong and speech learning. Certainly, this type of vocal mimicry is the most relevant to study for modeling and understanding the process of human speech acquisition. However, we believe this represents an overly restrictive definition of vocal learning that ignores many other factors and strategies that can be used to adaptively modify the spectral content of vocalizations. For example, white-crowned sparrows (Zonotrichia leucophrys) that normally learn songs from a tutor will still produce novel songs despite having been raised in social isolation (Konishi, 1985). This process of generating an isolate song without previous instruction, or adding novel parts to a tutored song has been called improvisation (Janik & Slater, 1997; Konishi, 1964; Kroodsma, Houlihan, Fallon, & Wells, 1997; Marler, 1997).
Improvisation is one of the simplest ways that animals may change their vocalizations without explicit need for a tutored model. Using improvisation, an animal could rely on internal preference or the response of conspecifics to guide the learning process. Therefore, it is important to evaluate the relative roles of improvisation and imitation in any vocal learning species. In some experiments, grey catbirds (Dumetella carolinensis), which are a type of songbird, often failed to copy song models and routinely generated normal songs with novel elements not present in the template (Kroodsma et al., 1997). More strikingly, when the abnormal song of a socially isolated adult zebra finch was used as the tutor template, the tutored juveniles modified the song to more closely match a more typical finch song (Fehér, Wang, Saar, Mitra, & Tchernichovski, 2009). Accumulation of corrective improvisations over 5 generations was sufficient to transform the isolate song to a normal-sounding zebra finch song. Preferential learning by improvisation was performed even though all the birds should be perfectly capable of mechanically reproducing the isolate songs heard.
In some vocal learning species determination of what is worth learning is shaped by individuals other than the one learning. For example, non-singing female cowbirds (Molothrus ater) exert a strong sexual selection on male song development by selectively reinforcing song variants with their wing displays (West & King, 1988). The effect of female selection is so strong that both tutored and untutored males develop different songs depending on the preferences of co-housed females from different sub-species (King, 1983). Experiments with Pacific walruses (Odobenus rosmarus divergens) demonstrated that the preferences of human trainers could also reinforce novel vocal behavior (Schusterman & Reichmuth, 2008). Using a contingency learning paradigm, walruses were rewarded with fish when a vocalization was judged by the human trainer to be significantly different than the preceding vocalization. Under stimulus control, sounds in the existing repertoire were elaborated with pitch and contour changes, and several novel vocalizations emerged that had not been heard before.
It is clear that mimicry is not the only viable strategy for vocal production learning. Indeed, different strategies could have been necessary for different species to transition from generating exclusively innate sounds to generating novel sounds. For these reasons, we subscribe to the view proposed by Konishi (Konishi, 1985) by accepting as production learning the development of any vocalizations that depend on auditory feedback for the development or maintenance of spectral content. Under this definition it is the reliance on auditory feedback to control the vocal organ and guide the trajectory of sound development that is most important. Of secondary importance is whether the trajectory results in convergence toward or divergence from an external model, the emergence of internal preferences, or acquisition of a social or food reward.
It has been proposed that two different, but converging pathways are involved in the production of learned and innate vocalizations (Jarvis, 2004; Jürgens, 2009; Simonyan & Horwitz, 2011; Wild, 1994; 1997). According to this division of labor, innate calls are programmed by a phylogenetically older brainstem pathway, and the forebrain influences the context (i.e. usage) of calling but not acoustic structure. In contrast, control of the spectral content of learned calls would be given over to a phylogenetically more recent vocal pathway driven directly by forebrain premotor structures — the so-called Kuypers/Jürgens hypothesis (Fitch, Huber, & Bugnyar, 2010).
The brain pathway for programming acoustically innate vocalizations includes midbrain premotor structures and medullary motoneuron pools for motor control of phonation and respiration. This pathway has been found in all vocalizing avian and mammalian species studied to date, and homologous pathways can even be found in vocalizing fish (Bass & McKibben, 2003; Jürgens, 2009; Kittelberger, Land, & Bass, 2006; Wild, 1997). In both vocal learning and non-learning birds, this innate vocal circuit comprises the dorsomedial nucleus (DM) in the midbrain that projects to multiple medullary nuclei including the parabrachial region (PBr), the expiratory premotor nucleus retroambigualis (RAm), and the tracheosyringeal part of the hypoglossal nucleus (XIIts) that innervates the syrinx (Fig. 5) (Wild, 1997). The analogous vocal circuit in mammalian brains comprises the caudal periaqueductal gray (PAG) in the midbrain which projects to brainstem respiratory premotor nuclei including RAm for control of respiration, and cranial nerve nuclei including Amb that directly innervates the larynx (Fig. 5) (Ennis, Xu, & Rizvi, 1997; Jürgens, 1998; 2002a; 2009; Mantyh, 1983).
These pathways have been identified in two well-studied non-human primate models of vocalizations, the squirrel monkey (Saimiri sciureus) and rhesus macaque (Macaca mulatta). Decades of work by Uwe Jürgens and colleagues using anatomical tracing (Dujardin & Jürgens, 2005; Hannig & Jürgens, 2005; Jürgens, 1982; 1983; 1984; Jürgens & Alipour, 2002; Müller-Preuss & Jürgens, 1976; Müller-Preuss, Newman, & Jürgens, 1980; Simonyan & Jürgens, 2002; 2003; 2004; 2005; Thoms & Jürgens, 1987), brain imaging (Jürgens, Ehrenreich, & de Lanerolle, 2002), electrophysiology (Düsterhöft, Häusler, & Jürgens, 2003; Hage & Jürgens, 2006a; 2006b; Jürgens, 2002a; Lüthe, Häusler, & Jürgens, 2000), electrical (Jürgens & Ploog, 1970) and chemical (Lu & Jürgens, 1993) brain activation, lesions (Jürgens & Pratt, 1979; Jürgens, Kirzinger, & Cramon, 1982; Kirzinger, 1985; Kirzinger & Jürgens, 1982; 1985), and reversible inactivations (Jürgens & Ehrenreich, 2007; Siebert & Jürgens, 2003) has produced a detailed description of the pathways involved in controlling innate primate vocalizations (Jürgens, 2009). The general conclusions drawn from this body of work are as follows: 1) limbic regions regulating arousal and the drive to vocalize including the amygdala and anterior cingulate cortex converge on the PAG; 2) the PAG serves a gating function to activate motor programs for specific calls associated with different arousal states; and 3) the spectral structure of calls is primarily determined at the level of medullary premotor circuits that coordinate the activity of phonatory motoneuron pools in various cranial nerve nuclei (Jürgens, 1998; 1998; 2002b; 2009; 2009;Jürgens & Alipour, 2002). Lesions of the anterior cingulate cortex or amygdala do not eliminate the ability to produce the innate vocalizations, but reduce the motivation to vocalize and to do so in the appropriate context. However, lesioning or blocking the PAG or Amb eliminates production of innate vocalizations (Floody & DeBold, 2004; Jürgens & Ehrenreich, 2007; Jürgens & Pratt, 1979; Kirzinger & Jürgens, 1985; Siebert & Jürgens, 2003). These findings suggest that what is truly indispensable for vocalization is the PAG and downstream circuits of the brainstem.
In addition to the limbic-midbrain-brainstem pathway for innate vocal production, vocal-learning species have evolved cortico-bulbar pathways and cortico-basal ganglia-thalamic loops for generating and learning novel vocalizations, respectively. Although the gross anatomy of avian and mammalian forebrains is remarkably different (nucleated in birds and layered in mammals) there are some general principles shared among all vocal learning systems (Jarvis 2004; Jarvis et al., 2005).
Learned song in birds is controlled by a hierarchically organized pre-motor control pathway contained within two nuclei of the caudal telencephalon that sends direct and indirect output to the vocal motoneurons of the brainstem located in XIIts (Wild, 1997). In songbirds, this premotor pathway begins with the nucleus HVC (used as the proper name), from which a specific subset of projection neurons innervates the robust nucleus of the arcopallium (RA) (Foster & Bottjer, 1998; Nottebohm, Stokes, & Leonard, 1976). These RA-projecting neurons appear to encode the timing of song via a sparse code that coordinates the bursting activity of neuron ensembles in RA (Fee, Kozhevnikov, & Hahnloser, 2004; Hahnloser, Kozhevnikov, & Fee, 2002; Leonardo & Fee, 2005; Yu & Margoliash, 1996). RA projects to various midbrain and brainstem nuclei including DM of the innate call generating pathway, the respiratory premotor nucleus RAm, Amb, and the motoneurons of XIIts that control the vocal organ (Nottebohm et al., 1976; Wild, 1993). These direct downstream targets of RA make it well positioned to allow forebrain control over the activity of respiratory, laryngeal, and syringeal muscle groups during vocalization. A similarly connected hierarchical vocal premotor pathway was found in the forebrain of parrots (Durand, Heaton, Amateau, & Brauth, 1997; Jarvis, 2004; Jarvis & Mello, 2000; Paton, Manogue, & Nottebohm, 1981; Striedter, 1994; 1994) and hummingbirds (Gahr, 2000; Jarvis et al., 2000). In parrots the pathway involves analogous projections from the central nucleus of the lateral nidopallium (NCL) to the central nucleus of the anterior arcopallium (AAc), which projects in turn to midbrain and brainstem vocal nuclei (Durand et al., 1997; Striedter, 1994). In hummingbirds, a nucleus similar in location and cytoarchitecture to songbird HVC was found called the vocal nucleus of the lateral nidopallium (VLN) or HB-HVC (Gahr, 2000; Jarvis et al., 2000). HB-HVC sends descending projections to the vocal nucleus of the arcopallium (VA) also called HB-RA, which resembles songbird RA and innervates XIIts (Gahr, 2000; Jarvis et al., 2000). In contrast, no such forebrain nuclei or direct projections from the arcopallium have been found in vocal non-learning birds, such as pigeons and chickens (Wada, Sakaguchi, Jarvis, & Hagiwara, 2004; Wild, 1997).
Among mammals, projections from primary motor cortex to phonatory brainstem nuclei have only been found in primates. In a comparative study of projections from the motor cortical tongue area to the hypoglossal nucleus (XII) that innervates the tongue muscles, it was observed that the density of the projection varies between primate species (Jürgens & Alipour, 2002). Rhesus macaques have a relatively denser projection than squirrel monkeys, and saddle-back tamarins (Saguinus fuscicollis) have putative fibers of passage but no terminals in XII. In chimpanzees (Pan troglodytes) (Kuypers, 1958a) and humans (Kuypers, 1958b) projections to XII are dense. By contrast, no motor cortical projection to XII was observed in tree shrews (Tupaia belangeri) (Jürgens & Alipour, 2002), cats (Felis catus) (Kuypers, 1958c), or rats (Travers & Norgren, 1983). A direct motor cortical vocal pathway, consisting of a direct cortical projection to the laryngeal motoneurons in Amb had only been found in humans among mammals (Iwatsubo, Kuzuhara, Kanemitsu, Shimada, & Toyokura, 1990; Kuypers, 1958d; 1958b; Simonyan & Jürgens, 2003). This distribution of cortico-bulbar projections to XII and Amb has been interpreted as a progressive increase in cortical innervation in phylogenetically newer primate species leading to improved vocal abilities (Jürgens & Alipour, 2002). This interpretation reflects the general assumption that presence of direct cortical input to phonatory motor nuclei determines the level of vocal abilities. Indeed, the presence of a direct motor cortical/pallial vocal pathway in vocal learning birds and humans has been proposed by many researchers as one of the key neural transformations in the evolution of spoken-language and learned song (Deacon, 2007; Fischer & Hammerschmidt, 2010; Fitch et al., 2010; Jarvis, 2004; Jürgens et al., 1982; Kirzinger & Jürgens, 1982; Okanoya, 2004; Simonyan & Horwitz, 2011; Simonyan & Jürgens, 2003).
In songbirds, there is a cortico-basal ganglia-thalamic loop dedicated to vocalization called the anterior forebrain pathway (AFP). Premotor input to the AFP comes from a distinct subset of HVC projection neurons that innervate a region of the anteromedial striatum specialized for vocal learning called Area X (Foster & Bottjer, 1998; Nottebohm et al., 1976). Area X sends a GABAergic projection to the dorsolateral anterior thalamic nucleus (DLM), which projects in turn to the lateral magnocellular nucleus of the anterior nidopallium (LMAN) (Bottjer, Halsema, Brown, & Miesner, 1989; Okuhata & Saito, 1987; Person, Gale, Farries, & Perkel, 2008). LMAN then projects back to Area X forming a cortico-striatal-thalamic loop specialized for vocalization (Okuhata & Saito, 1987). A similar second medial AFP loop has been proposed, which comprises a projection from Area X to the dorsomedial nucleus of the posterior thalamus (DMP), then to the medial magnocellular nucleus of the anterior nidopallium (MMAN) (Kubikova, Turner, & Jarvis, 2007). LMAN and MMAN are the output nuclei of the AFP, projecting to RA (Nottebohm, Paton, & Kelley, 1982) and HVC (Foster & Bottjer, 1998), respectively. These outputs allow the AFP to modulate the ongoing activity of the direct HVC-RA premotor circuit (Kao, Doupe, & Brainard, 2005). Lesions and chemical inactivation of MAN nuclei and Area X revealed that the AFP is not required for singing, but is critical for generating the acoustic variability necessary for vocal exploration in normal song learning (Bottjer, Miesner, & Arnold, 1984; Foster & Bottjer, 2001; Nottebohm et al., 1976; Olveczky, Andalman, & Fee, 2005; Scharff & Nottebohm, 1991), social context-dependent modulation of song (Kao et al., 2005; 2005; Kao & Brainard, 2006), experimentally-induced song deterioration (Brainard & Doupe, 2000; Williams & Mehta, 1999), and modulation of activity and singing-driven gene regulation of HVC and RA (Kubikova et al., 2007; Olveczky et al., 2005).
A similar recurrent cortico-basal ganglia-thalamic pathway was found in the forebrain of parrots, except that NLC (HVC analog) does not project to the basal ganglia song nucleus (MMSt) (Durand et al., 1997; Jarvis & Mello, 2000); instead the ventral portion of the RA analog (AACv) projects to the LMAN analog (Durand et al., 1997). In hummingbirds, analogous basal ganglia and cortical regions have been found to be active during song production (Jarvis et al., 2000). The connectivity between these AFP-like regions has not been established in hummingbirds except for the projection from the proposed LMAN analog to the RA analog, which is similar to the oscine and parrot song systems (Gahr, 2000). Thus, the general design of several similarly arranged discrete forebrain nuclei forming a direct forebrain premotor pathway modulated by a recurrent basal ganglia loop seems to be a universal feature among independently derived lineages of avian vocal learners (Jarvis, 2004).
In humans, cortical, basal ganglia, and thalamic vocalization-related brain regions have typically been identified with functional neuroimaging techniques during speech production or brain lesion case studies (Jürgens, 2002b; Ludlow, 2005). In contrast, vocalization-specific neural activity in vocal non-learning mammalian species had been demonstrated only in limbic, midbrain and brainstem circuits (Hage & Jürgens, 2006a; 2006b; Jürgens, 2002a; 2009; Wild, 1997; 1997). In non-human primates, electrical micro-stimulation of a specific premotor cortical region in area 6 produced movement of the vocal folds (Hast, Fischer, Wetzel, & Thompson, 1974). Tract tracing studies of this putative laryngeal premotor region revealed extensive subcortical projections to the basal ganglia, thalamus, pons and medulla (Simonyan & Jürgens, 2003). However, chemically inactivating these connecting structures does not abolish vocal fold movements elicited by motor cortical stimulation (Jürgens & Ehrenreich, 2007). Moreover, lesions to prefrontal and primary motor cortex (Aitken, 1981; Kirzinger & Jürgens, 1982; Sutton, Larson, & Lindeman, 1974) or globus pallidus (MacLean, 1978) do not produce changes in the structure of vocalizations in monkeys, but abolish learned volitional vocalizations in humans (Jürgens, 2002b). Therefore, it is questionable that these structures play a role in the programming of monkey vocalization, but they may serve other laryngeal functions in non-vocal behaviors like swallowing.
We were unaware of any previous studies attempting to define vocal premotor forebrain circuits in mice, so we addressed this issue first (Arriaga et al., 2012). We looked for motor-driven singing-regulated expression of activity-dependent immediate early genes using a similar experimental design as previous studies that identified seven similar forebrain song nuclei among the three lineages of song learning birds (Jarvis et al., 2000; Jarvis & Mello, 2000; Jarvis & Nottebohm, 1997). We found that relative to the non-singing treatment groups, male mice that produced USVs expressed higher levels of mRNA for two immediate early genes (IEGs), egr-1 and arc, bilaterally in restricted regions of the primary motor (M1) and premotor (M2) cortices, adjacent anterior cingulate cortex (Cg), and subjacent anterodorsal striatum (ADSt) (Fig. 6a-b). Importantly, similar amounts of egr-1 and arc expression were observed for mice singing with intact hearing and mice singing after deafening. Moreover, playback of mouse songs in the absence of active singing did not induce similar IEG expression in these forebrain regions. These results indicate that the greater levels of mRNA expression in these regions were not caused by auditory processing during singing. Instead, the results show that singing-induced expression of activity-dependent IEGs in motor cortical, limbic, and striatal regions of the mouse brain is motor-driven. This pattern of vocal motor specific activity is similar to what is observed in the songbird song system during singing (Jarvis & Nottebohm, 1997), but had not been previously shown in the forebrain of a non-human mammal.
Two recent studies claimed to find cortical activation during vocalization in marmosets (Callithrix jacchus) by examining brain expression patterns of egr-1 (Simões et al., 2010) and c-fos (Miller, DiMauro, Pistorio, Hendry, & Wang, 2010). In the first study, expression levels of egr-1 were measured in prefrontal cortex of two groups of animals that heard playbacks of conspecific calls and either vocalized or remained silent (Simões et al., 2010). Higher numbers of egr-1 immunopositive cells were observed in ventral and dorsal prefrontal cortex when animals vocalized than when they remained silent. However, given the audio-motor nature of the task it is difficult to separate the relative effects of sensory processing and preparation of the motor program for vocalization. The second study attempted to distinguish between sensory, motor, and sensorimotor integration effects by including a treatment group that vocalized without hearing any conspecific playbacks (Miller et al., 2010). Interestingly, this production-only group showed the lowest amount of c-fos induction for the majority of prefrontal sites tested. The animals that showed the highest levels of induction overall were those that only heard playbacks of calls. There was one area in the dorsal prefrontal cortex where the expression levels for the vocal production group matched the levels seen in other adjacent areas for the vocal perception group; however, it is still not possible to eliminate auditory feedback induced activation of this region.
Another recent study used PET imaging to identify activation of the inferior frontal gyrus (Broca's area analog) in chimpanzees while simultaneously producing vocalizations and hand gestures. The level of activation was greater than when the animals gestured without vocalizing (Taglialatela, Russell, Schaeffer, & Hopkins, 2011). Another study that recorded neuronal activity in macaques suggests that when the monkeys produce conditioned innate vocalizations, some neurons are activated in the ventral premotor cortex (Coudé et al., 2011). However, these neurons did not fire when the animals vocalized spontaneously, indicating that they do not encode motor commands for the vocalizations.
The authors of these studies concluded that this is the first time vocalizing-driven activity has been found in the non-cingulate cortex of a non-human primate. However, it is still possible that activity observed in vocalizing groups was largely due to sensory processing of conspecific calls, the animals hearing themselves vocalize, or other features of the vocalizing setting. A control group vocalizing after deafening, like the one included in our study on mice, is required to exclude the first two alternatives. Such studies may not be feasible due to ethical concerns regarding deafening experiments in primates. Neurophysiology experiments also need to demonstrate whether there is premotor neural firing for spontaneous vocalization, and if the recorded regions are analogous to the motor cortical areas that are critical for production of learned vocalizations in humans and songbirds. Therefore, until another approached is developed, it remains to be determined if cortical regions associated with vocal production in humans also control natural vocal production in non-human primates.
Mice have been assumed to lack a direct cortico-bulbar projection to Amb (Fischer & Hammerschmidt, 2010; Jarvis, 2004); however, this assumption had also never been experimentally tested until our recent study (Arriaga et al., 2012). To test the possibility of M1 input to the vocal premotor system, we performed neural tracing experiments in mice using the retrograde trans-synaptic tracer pseudorabies virus (PRV-Bartha) expressing enhanced green fluorescent protein (eGFP) injected into the cricothyroid and lateral cricoarytenoid laryngeal muscles in order to trace premotor brain pathways that converge on Amb. By approximately 4 days post-injection, a pattern of labeling was observed consistent with known connectivity in mammals (Jürgens, 2002b), including rodents (van Daele & Cassell, 2009). The PRV spread to a set of regions in the midbrain and limbic system with known roles in the control of innate species-specific calls and respiration (Jürgens, 2002b): the medullary reticular formation, spinal trigeminal nucleus, and solitary nucleus of the brainstem; PAG and ventral tegmentum of the midbrain; throughout the hypothalamus; and the amigdalopyriform transition area, and central amygdala in the telencephalon. At the same survival time, only two neocortical regions were reliably labeled: 1) a population of layer V pyramidal neurons in M1 within the motor cortex region that exhibited robust singing-driven IEG expression (Fig. 6c-d); and and2)2) a small number of layer III neurons in the insular cortex (IC).
The relatively short latency at which PRV label was observed in M1 suggested that perhaps it projects directly to Amb. To test this hypothesis, we injected BDA into the M1 region identified by PRV tracing, and injected cholera toxin subunit b (CTb) into the cricothyroid and lateral cricoarytenoid laryngeal muscles (Arriaga et al., 2012). This dual tracing technique permitted visualization of motor cortical axons as well as laryngeal motoneuron somata and dendrites from the same animals. We found that the singing-activated portion of M1 projects directly to Amb. There were fine caliber M1 axons that exited the pyramidal tract, extended laterally to the zone where Amb motoneuronal cell bodies were located, and terminated on labeled motoneurons (Fig. 6e). Compared to songbirds (Wild, 1993) and the limited data on humans (Iwatsubo et al., 1990), the mouse M1 connections was much more sparse; there appeared to be no more than one or two axons per connected motor neuron.
This region of M1 also projects densely to the region of ADSt that displayed a singing-related increase of IEG expression, and connects reciprocally to the ipsilateral ventral lateral nucleus of the thalamus (VL). These two projections are likely to form part of a cortico-striatal-thalamic loop for vocalization similar to those reported in humans and song learning birds; however, the striatal projection to globus pallidus or the pallidal projection to thalamus have not been confirmed for this circuit in mice. The tracer injections in M1 also showed that this region receives a projection from neurons of the ipsilateral secondary auditory cortex (Fig. 6f). The cell bodies for the secondary auditory cortex were in layer III. This projection still needs to be confirmed in the anterograde direction.
The combined retrograde and anterograde tracing patterns show that mice have a cortical vocal premotor circuit that projects directly to vocal motoneurons in the brainstem, the anterior striatum and thalamus, and it may receive a projection from secondary auditory cortex. These features are similar to those of known vocal production circuits in humans and song learning birds (Fig. 5). These findings suggest that a cortico-bulbar projection to vocal motoneurons is not unique to vocal learning birds and humans amongst mammals, as previously thought (Deacon, 2007; Fischer & Hammerschmidt, 2010; Fitch et al., 2010; Jarvis, 2004; Jürgens, 1982; Jürgens et al., 1982; Kirzinger & Jürgens, 1982; Okanoya, 2004; Simonyan & Horwitz, 2011; Simonyan & Jürgens, 2003).
Like input from motor cortex, auditory experience seems to be more important for the production of learned vocalizations than innate calls. In humans and songbirds auditory experience plays a critical role at multiple stages in the ontogeny of vocal behavior: 1) a sensory phase during which an auditory memory or ‘template’ is formed following exposure to an appropriate model; 2) a sensorimotor phase during which vocal output is monitored and compared to the model in a guided learning process; 3) an adult maintenance phase during which auditory feedback is used to maintain vocal output over the long-term (Doupe & Kuhl, 1999; Marler, 1970a). We posit that the main difference between learning by imitation and improvisation is the dependence on the first stage. In imitation, the model or template is acquired externally. In improvisation there is no external model against which to measure progress, so another instructive signal must guide the learning process; however, this strategy likely involves a similar mechanism of auditory self-monitoring followed by selection and retention of preferred learned features. Auditory experience is critical under either learning paradigm. Accordingly, experiments testing for vocal learning have typically focused on modifying, disrupting, or removing auditory information at the various developmental phases. We briefly review the results from known vocal learning and non-learning species, then discuss results from recent studies performed on mice.
It has been demonstrated in various mammalian (Hammerschmidt et al., 2001; Romand & Ehret, 1984; Talmage-Riggs, Winter, Ploog, & Mayer, 1972) and avian (Konishi, 1964; Kroodsma & Konishi, 1991; Nottebohm & Nottebohm, 1971) species that the acoustic structure of innate vocalizations does not depend on auditory experience at any developmental stage. Eastern phoebes (Sayornis phoebe), a sub-oscine vocal non-learning songbird species, develop normal species-specific songs after being mechanically deafened by cochlear removal before the onset of singing behavior (Kroodsma & Konishi, 1991), despite being very closely related to vocal learning songbirds. Similar results have been reported in the more distantly related ringdove (Nottebohm & Nottebohm, 1971) and chicken (Konishi, 1963). In non-human primates, neither hereditary deafness (Hammerschmidt et al., 2001) nor deafening by cochlear coagulation (Talmage-Riggs et al., 1972) affect normal vocal behavior. Unsurprisingly, the less severe auditory deprivation caused by social isolation also has no reported effect on monkey call spectral structure (Hammerschmidt et al., 2001; Winter, Handley, Ploog, & Schott, 1973). Even innate calls in male zebra finches, a vocal learner, are not affected by deafening (Simpson & Vicario, 1990).
In contrast, learned vocalizations are susceptible to elimination or disturbance of auditory feedback at various stages in development. In songbirds early deafening in the sensory acquisition (Marler & Waser, 1977) or sensorimotor phase of song learning (Konishi, 1965a; 1965b) has a dramatic effect, resulting in severely degraded songs characterized by a small repertoire with highly variable and unstable notes. Songbirds raised in social isolation develop highly abnormal ‘isolate song’ (Marler, 1970b; 1970a; Marler & Waser, 1977). Taken together these findings reveal that songbirds need to hear others to learn what to mimic and themselves to practice their own copy. But songbirds continue to depend on auditory information even after learning and stabilizing normal songs. For example, adult Bengalese and zebra finches suffer rapid deterioration of syntax and phonology when deafened (Horita, Wada, & Jarvis, 2008; Lombardino & Nottebohm, 2000; Okanoya & Yamaguchi, 1997; Woolley & Rubel, 1997). Even the milder treatment of disrupting auditory feedback signals in real-time without deafening is sufficient to cause a destabilization of learned song features (Leonardo & Konishi, 1999; Sakata & Brainard, 2006). Thus, songbirds clearly rely heavily on auditory experience throughout the entire song development process, including for maintenance and stabilization of songs learned early in life.
Human speech shares with birdsong a dependence on auditory information throughout life (Doupe & Kuhl, 1999). For example, early language deprivation by social isolation severely disrupts speech acquisition (Fromkin, Krashen, Curtiss, Rigler, & Rigler, 1974). In this regard, humans and some songbirds (Marler, 1970a; Thorpe, 1958) are subject to sensitive periods for vocal development. But the reliance on auditory feedback does not end when the sensitive period closes. Post-lingually deaf patients suffer a degradation of speech sounds that results in decreased control of phonation, disrupted prosody, and abnormal suprasegmental properties of sentences, with younger patients being more strongly afflicted (Waldstein, 1990). Thus, vocal learners seem to make use of auditory feedback to calibrate the fine phonetic control required to produce high-quality vocalizations even after the waning of a robust vocal learning ability.
Our laboratory and several others have been conducting behavioral studies in mice to test for the presence of features found in vocal learning mammals and birds (Arriaga et al., 2012; Grimsley et al., 2011; Hammerschmidt et al., 2012; Kikusui et al., 2011). We first focused on the role of auditory input. Based on the data from vocal learning and non-learning species discussed previously, we reasoned that if male mice learn any aspect of their courtship vocalizations, then they should require auditory information in order to maintain the spectral quality of songs. However, if songs are innate, then they should not be affected by deafening. We tested this hypothesis by mechanically deafening adult mice (Arriaga et al., 2012). Over the course of eight months after deafening the songs of the deaf mice became spectrally distorted with some noisy looking syllables and less spectral purity than songs of sham-operated controls (Fig. 7a-b). We wondered if the noisier syllables were due to deaf mice possibly singing louder and causing microphone recording distortion, but found that the vocalizations were not on average louder than pre-deafened song. The pitch of deaf mice songs had also increased such that 6 - 8 months after surgery they were reliably singing at a significantly higher frequency relative to both their own pre-deafening levels and those of hearing-intact controls.
The average increase in mean pitch of post-deafening mouse songs was comparable to the 4-6 KHz increase in USVs reported for deafened horseshoe bats, an accepted vocal learning species (Rübsamen & Schäfer, 1990). The combined effects on pitch and spectral purity were similar in character and timing to changes in vocalizations observed in post-lingually deaf humans and mechanically deafened song-learning birds (Brainard & Doupe, 2000; Heaton, Dooling, & Farabaugh, 1999; Waldstein, 1990; Watanabe, Eda-Fujiwara, & Kimura, 2006; Woolley & Rubel, 1997).
We also analyzed the songs of normal hearing-intact B6 males to those of males congenitally deaf due to loss of inner ear hair cells within several days after birth resulting from knockout (KO) of the caspase 3 gene (CASP3) (Takahashi et al., 2001). We found that these mice showed larger differences in their song syllables compared to the wild type (Fig. 7c-d). Some syllables were highly degraded and barely recognizable, but with some resemblance to normal syllable categories. The changes included producing a higher proportion of the more simple Type A syllable, lower mean frequency of the pitch, greater standard deviation of the pitch, and lower spectral purity. The changes in the CASP3 KO animals songs are the largest that we are aware of for any genetically manipulated animal. However, we could still recognize features of the songs and syllables, indicative an innate component to mouse songs.
A similar study using a mouse strain congenitally deaf due to knockout of the otoferlin gene generated on a mixed background (129 ola and B6) found no differences in the amount of syllables/calls produced between deaf and hearing-intact mice (Fig. 7e-f) (Hammerschmidt et al., 2012). The study also found no differences in duration and amplitude, which were not affected by hearing status in our studies, but also did not find differences in pitch, although only a subset of pitch measures were assessed. From this negative result, the authors conclude that it is questionable if mice could be used as models for vocal learning.
We offer two explanations for the differing results of the deafening studies: 1) the mechanical deafening of adults and the CASP3 KO caused changes in mouse songs due to some variable other than loss of hearing; or 2) the methods used to analyze the otorferlin knockout mouse songs did not capture changes in the songs seen in our study. In our study, sham operated mice did not show changes in song like those in the mechanically deafened group, suggesting that disruption of the facial musculature does not explain the differences. The CASP3 gene serves many functions in different neurons, and knocking it out could have affected other brain pathways or phonatory musculature. However, CASP3 knockout did not produce overt motor deficits. For the second explanation, in the otoferlin knockout study, the syllables were split into only 2-3 super categories. We believe that this method groups syllables with great morphological and spectral differences, thereby potentially increasing the variability within each category. As a result, this approach risks masking effects that might be better detected by analyzing syllable types individually. For analyses on amplitude, they did split the syllables into more categories and did not find differences in the amplitude before and after deafening, similar to our own study. Another methodological difference is that the otoferlin study introduced an awake behaving female into the recording chamber to elicit male songs. Because females also produce some ultrasounds, it is possible that the otoferlin knockout mouse song recordings were contaminated with vocalizations from hearing-intact females. Moreover, the study did not report data for the three acoustic features that showed the greatest differences in our deafening experiments (mean pitch, standard deviation of the pitch distribution, and spectral purity). We believe reconciling these differences will require standardizing the experimental designs, syllable classification schemes, and spectral analysis techniques across laboratories. Until then, the methodological issues make it difficult to draw strong conclusions from the current set of different deafening results, and thus we believe the possibility of auditory dependence for normal mouse song development remains open.
Deafening-induced song deterioration alone does not demonstrate presence or absence of the vocal learning ability, but it is a strong indication that this ability may be present; to date, destabilization of vocal production after deafening has only been observed in vocal learners. However, these observations remain correlative and not diagnostic. Diagnostic test require demonstrating some form of vocal production learning, the subject of the next section.
Imitation of another species' vocalizations when cross-fostered, such parrots raised by humans who then imitate human speech, is the gold standard for demonstrating vocal learning. However, not even all known vocal learning species have the ability to imitate other species, and successful cultural transfer of song elements under cross-fostering can require optimal social and developmental conditions. For example, juvenile zebra finches will imitate Bengalese finch songs when raised exclusively with Bengalese finches. Yet, young zebra finches show an innate predisposition to learn their own species song when given a choice between a Bengalese finch foster-father and a Zebra finch (Clayton, 1987).
A recent study conducted a cross-fostering experiment with two strains of mice (B6 & BALB/c) that sing at different pitches, and have different distributions of syllable types in their repertoires (Kikusui et al., 2011). They cross-fostered young mice from post-natal day 0 to 21 and then scored the acoustic and syntactic structure of their songs as adults. They did not find any changes in the pitch and syllable distribution of the songs of the cross-fostered mice (Fig. 8a-b). Therefore, the authors concluded that the strains were not able to imitate each other's songs and interpreted this negative result as evidence that mouse songs are innate.
Three recent studies, including one by our own lab, have found some evidence of adaptive vocal modification of mouse USVs by examining acoustic changes that occur over the course of development (Grimsley et al., 2011), after temporary social isolation (Chabout et al., 2012), or after being housed with another male mouse with a different song in a competitive social condition (Arriaga et al., 2012). The former two showed developmental or social experience changes that could not be easily explained by innate developmental vocal trajectories, and the latter demonstrated song pitch convergence that possibly resulted from imitation.
The first study analyzed the development of CBA/CaJ mouse pup isolation calls from post-natal day 5 to post-natal day 13 and compared them to adult USVs (Grimsley et al., 2011). Using a syllable classification scheme similar to that described earlier in this review (Scattoni et al., 2008) they report changes in repertoire composition over early development (Fig. 8c-d). Notes that were flat, or contained 1 frequency jump dominated the repertoire on post-natal day 5 and post-natal day 7. From post-natal day 9 to post-natal day 13, notes with 2 frequency steps were most common. This was very different from the adult repertoire, which was dominated by one-note syllables with an upward, flat, or chevron-like trajectory. Although the relative proportions of syllables varied, all types were produced from post-natal day 7 through adulthood. The authors used a Zipf's statistic to compare the complexity of the repertoire over different developmental ages. They found that complexity steadily increased from post-natal day 5 to post-natal day 13, resulting in a more diverse and less repetitious sequence of syllables with greater higher-order structure.
Developmental changes in Syllable morphology were also reported. Generally, the duration of pup syllables tended to decrease with age. For example, the distributions of flat syllable and chevron-shaped syllable durations were tighter and had a lower mean for adult vocalizations compared to pup vocalizations. Peak frequencies of both pup syllable types were distributed bi-modally over a broad frequency range, but adult syllable peak frequencies were normally distributed over a more restricted range with a lower mean. The narrowing of the peak frequency range resulted from exclusion of the higher and lower margins of the pup peak frequency distribution for both syllable types, and syllables with dominant frequencies above 100 kHz were common in pups but rare in adults. Although the developmental trajectory of each specific syllable type varied, overall, adult syllables were shorter in duration and lower in pitch than pup syllables.
The authors concluded that the complex spectro-temporal, repertoire composition, and sequencing changes observed in mouse syllables over development could indicate a learning process, whereby pups learn to produce syllables and sequences that permit identification and more reliable retrieval, and adults differentiate themselves from pups (Grimsley et al., 2011). Alternatively, there could be some complex innate maturation processes that cause the developmental patterns observed, an explanation proposed by other researchers (Hammerschmidt et al., 2012). Indeed, the authors do recognize that these data are descriptive and do not test for vocal learning capabilities, and they suggest examining vocal ontogeny in the absence of auditory feedback.
A later study found that the adult repertoire composition and some acoustic features (duration and peak frequency) of individual syllables is context-dependent (Chabout et al., 2012). Adult male mice isolated for three weeks produced significantly different songs than group-housed mice (Fig. 8e-f). Although not explicitly mentioned by the authors, the repertoire composition changes could represent a case of vocal usage learning through social experience. The peak frequency changes could represent vocal production learning though social experience. One issue the authors raise is that they were unable to sort out the vocalizations between the two different mice in the dyadic social recording situation. Nevertheless, these finding suggest that social isolation of young animals could strongly affect the development of a normal song repertoire.
The closest evidence for some form of vocal mimicry in mice comes from our study showing syllable pitch convergence (Arriaga et al., 2012). Although overt mimicry of novel sounds is considered the gold standard for vocal learning, some researchers argue that a more limited form of vocal imitation should also be considered whereby the spectral content of innately specified conspecific calls converges (Egnor & Hauser, 2004; Snowdon, 2009; Tyack, 2008). We considered that mouse songs are produced in a mating context, and tried cross-housing sexually mature males from different strains (B6 and BxD) in a sexually competitive environment (Arriaga et al., 2012). Before crossing, the average pitch of songs from B6 and BxD males segregated into two non-overlapping distributions. After cross strain housing pairs of males along with a BxD or B6 female, over the course of 8 weeks males showed a significant convergence in pitch independent of the strain of the female present (Fig. 8g). In particular, the pitch of all B6 animals shifted downward and some BxD's shifted upward, such that after 8 weeks of cross-housing the pitches of BxD and B6 songs were no longer statistically distinguishable. Before crossing, the mean pitch difference between pairs was 8.6 ± 0.51 kHz. By 3 weeks after crossing the mean pitch difference had decreased significantly, and continued to decline to a global minimum difference of 2.1 ± 1.4 kHz at 8 weeks (Fig. 8h). Importantly, after 8 weeks of cross-housing most of the pairs had reduced their difference in pitch by more than 80 percent of their specific cage mate, and many of the pairs had converged to within 1 kHz of each other's pitch (Fig. 8i).
The results of cross-housing pairs of BxD and B6 males support the hypothesis that mice are capable of copying some features of another male's songs. The changes observed were made to an existing note type shared between both strains. Therefore, the reported change is akin to vocal convergence reported in bats. The pitch of echolocation calls of young greater horseshoe bats (Rhinolophus ferrumequinum) correlate strongly with the calls of their mother (Jones & Ransome, 1993). Because the pitch of a mother's calls varies with her age, the correlation with her offspring's pitch is likely to result from their learning her pitch. When female greater spear-nosed bats (Phyllostomus hastatus) were transferred to a new social group both the residents of the group and the new members changed the spectro-temporal features of their existing screech calls to converge on a similar call (Boughman, 1998). A recent study on greater sac-winged bats (Saccopteryx bilineata) showed similar convergence of young male calls onto a tutor father's call (Knörnschild, Nagy, Metz, Mayer, & Helversen, 2010). It is unknown if the pre-convergence bat calls are innately specified or learned, but the changes are more striking than those reported for call convergence in non-human primates. Call convergence in non-human primates is based mostly on observations of within-group similarity and geographical variation in call features (Janik & Slater, 1997; Snowdon, 2009; Tyack, 2008). Some experimental evidence has been reported for pygmy marmosets (Cebuella pygmaea) that minimized spectral differences between each other's calls when new male/female pairs were housed in a cage together (Snowdon & Elowson, 1999).
Although the syllables we tested in mice were not novel, convergence does require the transfer of vocal elements between individuals and may reflect a rudimentary ability that could have been expanded to include production of novel elements. The finding that B6 males changed as a group but the BxD males were relatively unaffected by cross-housing, supports our hypothesis of sexual competition. We noted that the BxD males tend to be larger and sing more than the B6 males. Therefore, the greater shift in pitch by the B6 males could reflect a tendency to try to match the pitch of a more dominant singer in the presence of a female. Another possibility is that the females co-housed with the pairs provided a selection force in the direction of their preferred range for both BxD and B6 males. While females could certainly provide a reinforcing stimulus for convergence, as in the case of cowbirds (King, 1983; West & King, 1988), the close approximation of the BxD male's pitch in most B6/BxD pairs analyzed at 8 weeks post-crossing suggests that they were likely guided by auditory information.
The pitch matching results (Arriaga et al., 2012) contradict the findings of the previously mentioned cross-fostering study (Kikusui et al., 2011). We believe the differences between studies could be explained by experimental design. First, the learning paradigm used for cross-fostering (Kikusui et al., 2011) did not ensure or test for vocal production by the foster father. Absence of tutor song production would prevent the young males from acquiring a template to mimic. Second, the cross-fostered mice were tutored at a very early age (if at all) and for a very short period (21 days). For more than half of that period the pups' ear canals are closed, effectively leaving only 9 days of full auditory experience. In the pitch-matching study (Arriaga et al., 2012), the mice required at least 4-6 weeks of co-housing to begin showing pitch convergence. Lastly, prior to testing the cross-fostered mice (Kikusui et al., 2011) were returned to group housing in an acoustically unshielded colony for a much longer period (50 to 120 days) than the cross-fostering phase. Thus, the juveniles had more potential auditory experience with the songs of their own strain than with those of the foster father. Given the demonstrated predisposition of vocal learning species for learning their own species-typical songs, if mice are vocal learners, it is possible that the cross-fostered mice actively selected songs of their own strain for imitation during mixed housing. The mice in the pitch-matching study (Arriaga et al., 2012) were never returned to group housing during the experiment and were acoustically shielded from the songs of mice other than their cage-mate. Given the differences in design between the initial cross-fostering and pitch-matching studies (Kikusui et al., 2011; Arriaga et al., 2012), we believe that the available evidence supports the possibility of mouse song pitch learning by imitation or by improvisation.
This perspective report has examined the underlying neural circuits that support production of ultrasonic courtship songs of male laboratory mice, and described some basic capabilities of adult mice to modify and maintain the spectral content of their songs. Some of the currently available data indicate that a combination of neural and behavioral features is present in laboratory mice that had previously only been reported in humans and song learning birds. Some of these findings are being reported for the first time in non-human mammals. Further investigations will be necessary reconcile the conflicting conclusions on auditory feedback and mouse song imitation. The discovery of brain regions and pathways involved in mouse song production should aid interpretation of past studies and inform the design of future studies investigating the effects of social, genetic, and pharmacological manipulation on vocal behavior. Additionally, the discovery of a sparse direct cortical projection to the vocal motor nucleus ambiguus, input to motor cortex from secondary auditory cortex, a controversial requirement for auditory feedback, and a capacity for adaptive vocal modification based on social experience should inform studies investigating the distribution, development and evolution of the rare vocal learning trait. Below we propose such neurobiological and behavioral experiments to help advance the field.
The singing-associated forebrain pathways described in this review included brain regions and connectivity similar to cortico-striatal-thalamic loops for song learning in birds and proposed loops for speech learning in humans (Fig. 5) (Jarvis, 2004; Jürgens, 2009; Lieberman, 2001). To test this idea, future experiments should investigate the proposed connections between dorsolateral striatum and the thalamus, which are likely to go through the globus pallidus. Further investigation should also test whether the corticostriatal-thalamic circuit is dedicated to vocalization as in songbirds, a hypothesis that is difficult to test in human subjects. It is also possible that these circuits serve a non-motor function as suggested by neural activity recorded in monkey premotor cortex before and during conditioned but not spontaneous vocalizations (Coudé et al., 2011).
The direct forebrain projection to Amb in mice appears much less robust than in vocal learning birds (Wild, 1993). The analogous projection in humans also appears sparse relative to songbirds (Iwatsubo et al., 1990; Kuypers, 1958b) but stronger than in mice. We propose that density of direct motorneuron innervation could be a contributing factor to the degree of vocal learning complexity, as this aspect is known to correlate with the level of manual dexterity across mammalian species (Lemon, 2008). A recent study in rats using the same PRV-Bartha back-tracing technique employed in this study in the laryngeal muscles also found some motor cortical cells (van Daele & Cassell, 2009) as reviewed here for mice. However, they found fewer, isolated, labeled cells in primary motor cortex at a later survival time (more than 120 hours after injection into laryngeal muscles). They suggest a weak and indirect connection between M1 and Amb, and propose instead that laryngeal motor cortex is located laterally in the insular cortex; however, they did not demonstrate whether it was indirect or discuss the possible implications of these findings. We did so for mice, and suggest that rats might have a rudimentary projection. The presence of a direct cortico-bulbar connection from motor cortex suggests that mice, if not rodents generally, share a neuroanatomical feature with humans not found thus far in our closest primate relatives. Finding this projection in mice makes us wonder if a similar projection may have been missed in past studies on non-human primates. Although Kuypers stated later that non-human primates lack such a direct projection (Kuypers, 1982), his first study using the neural degeneration technique in chimpanzee and macaque did state (but not show) that after M1 lesions: “Only very few, if any, degenerating elements were found among the cells of the ambiguus nuclei.” (Kuypers, 1958a)
Our experiments suggest a need for re-evaluation of a possible direct motor cortical projection to Amb in non-human primates. We performed our tracing experiments by working our way up from the laryngeal muscles, whereas the studies performed in non-human primates worked their way down from the cortex (Jürgens & Ehrenreich, 2007; Simonyan & Jürgens, 2003). We believe future investigations should try using a similar approach by injecting transynaptic tracers in the laryngeal muscles of non-human primates.
Future studies in mice should test whether the motor cortical axons detected on Amb laryngeal motorneurons make functional synaptic connections. This can be accomplished with electron microscopy, a technique that was employed previously to identify the only known direct cortico-bulbar connection to brainstem motoneurons in rodents from vibrissa motor cortex to VII (Grinevich, Brecht, & Osten, 2005).
The pitch convergence after cross-strain pairings in adult mice is more pronounced than what has been reported previously for non-human primates (Snowdon & Elowson, 1999). Although the data from primates and mice are very different in nature and scale, we believe that together they could indicate a general property of limited vocal learning among mammals that was missed in prior investigations (beyond the changes to amplitude and duration that have been observed in many animal vocalizations). Furthermore, the nature and timing of the pitch convergence was similar to what has been reported for calls in bats and dolphins (Boughman, 1998; Knörnschild et al., 2010; Smolker & Pepper, 1999; Watwood, Tyack, & Wells, 2004). These results of our experiments suggest that mice are capable of at least limited vocal learning in the form of vocal convergence of existing call types. A major difference in our experiments relative Kikusui et al (2011), was that we cross-housed animals for up to 8 weeks whereas Kikusui et al cross-housed them for no more than 3 weeks. At 3 weeks, we did not yet see a significant group effect. To reconcile these findings, future work should be conducted on cross-fostering or tutoring for 8 weeks or longer. Future work should also investigate whether the learning abilities of mice extend beyond modification of innate templates to the generation of novel sounds or learning syllable sequences. The most convincing evidence of vocal learning would come through successful tutoring of spectral features from heterospecific, artificial, or anthropogenic sounds.
As a supplement to non-human primate studies and complement to songbird studies of vocal communication, mouse models can clearly serve to cover some gaps in understanding the molecular basis of vocal production, social communication dysfunctions, and the evolution of brain systems that form the basic substrates of speech. However, more work is necessary to establish how useful mouse models will be in studying the process of vocal learning. This conclusion will be chiefly determined by whether the vocal learning capabilities of mice extend beyond the limits of pitch convergence, but also requires clear definitions of what defines vocal learning.
The current framework for classifying vocal learning and non-learning species presents a dichotomous scheme whereby a species is either: 1) a vocal mimic with the associated neuroanatomical traits found shared among all vocal learning species studied to date; or 2) a vocal non-learner producing innate vocalizations without the associated neuroanatomical and developmental characteristics of learners. This schema overlooks some problematic examples, such as species that develop novel vocalizations without mimicry and the mouse, which does not appear to fully fit either category. Therefore, we propose a new scheme that we believe more accurately reflects the biophysical, ontogenetic, molecular and neuroanatomical evidence — the Continuum Hypothesis.
Examples for most of the proposed categories have already been presented in this review. Based on the available data, we believe that mice should be classified in Group 1d, along with bats. Both mice and bats appear able to adaptively modify existing syllables based on experience. This represents a limited form of vocal learning. Humans and song learning birds belong to Group 2b, which can be further divided into closed-ended and open-ended learners. The latter group continues to learn as adults. Each behavioral phenotype above will likely be associated with a particular type of neural architecture, as proposed below.
The combination of behavioral and neuroanatomical studies proposed will allow researchers to begin testing for a link between the degree of vocal learning capabilities exhibited by various species, and the distinct features of the underlying neural systems for vocalization. Properly classifying a species under this scheme will require both behavioral and neuroanatomical investigations of a given species. We predict that species able to modify the spectral content of songs will feature a direct motor cortical projection to Amb or XIIts.
Several recent studies have studied the effects of manipulating genes associated with speech disorders in non-human animal models. The most widely known studies investigated the FoxP2 transcription factor, a gene required for normal speech acquisition in humans and song acquisition in songbirds (Fisher & Scharff, 2009; Haesler et al., 2007; 2004; Lai, Fisher, Hurst, Vargha-Khadem, & Monaco, 2001). Mutating the FoxP2 gene, and introducing the human variant in mice, produced small changes in amplitude and pitch (Enard et al., 2009; Gaub, Groszer, Fisher, & Ehret, 2010). However, these studies did not employ the vocal behavior and neurobiological framework we present in this review, and the authors did not have information about the vocal neural circuits described in the present review when interpreting the effects of FoxP2. With this information, investigators can now ask if FoxP2 expression in the vocalization-activated striatal region in mice is required for pitch convergence, and whether changing the FoxP2 variant expressed in M1 (Hikosoka et al 2010) alters the strength of the projection to Amb. The identification of a direct M1 to Amb connection opens the possibility of studying the molecular basis for specifying this projection that is considered one of the most critical steps in the evolution of vocal learning. Identification of the genetic factors involved in developing this connection might even allow for inducing a connection de novo in non-learning species, enhancing the projection in species with limited learning abilities, and perhaps recovery of vocal learning abilities after brain injury in species that already learn vocalizations.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.