|Home | About | Journals | Submit | Contact Us | Français|
A sizable literature on the neuroimaging of speech production has reliably shown activations in the orofacial region of the primary motor cortex. These activations have invariably been interpreted as reflecting “mouth” functioning and thus articulation. We used functional magnetic resonance imaging to compare an overt speech task with tongue movement, lip movement, and vowel phonation. The results showed that the strongest motor activation for speech was the somatotopic larynx area of the motor cortex, thus reflecting the significant contribution of phonation to speech production. In order to analyze further the phonatory component of speech, we performed a voxel-based meta-analysis of neuroimaging studies of syllable-singing (11 studies) and compared the results with a previously-published meta-analysis of oral reading (11 studies), showing again a strong overlap in the larynx motor area. Overall, these findings highlight the under-recognized presence of phonation in imaging studies of speech production, and support the role of the larynx motor cortex in mediating the “melodicity” of speech.
Phonation is an important “umbrella” process when thinking about human vocalization, taking account of much of the segmental aspect of speech, of suprasegmental processes like intonation (Ladd, 1996) and lexical tone (Yip, 2002), and of singing (Sundberg, 1987). Modulation of the pitch and duration of voiced sounds underlies the melodic and rhythmic aspects of speech. The older literature on intonation employed the term “melodicity” to refer to the basic acoustic stream of voicing that occurs during speech production (Fónagy, 1981; Fónagy & Magdics, 1963).
Standard models of vocal production posit the existence of a vocal “source” – i.e., subglottal air pressure from the lungs producing vibration of the vocal-folds in the airstream – followed by “filtering” of the source’s sound wave by a series of articulators in the oral and nasal cavities, to ultimately select out certain resonant frequencies in that wave. While all vowels and most consonants require phonation, some consonants can be generated in a voiceless fashion. For fricatives like the/s/sound, this can simply involve the generation of broadband noise at the larynx in the absence of periodic vocal-fold vibration. However, the majority of the speech stream is phonated. For many languages, the proportion of a spoken sentence’s duration taken up by vowels alone is 40–50% (Ramus, Nespor, & Mehler, 1999). This does not take into account the degree of phonation that comes from voiced consonants, which would make the overall voiced component of a sentence’s duration even higher.
While phonation is a critical component of speech, neuroimaging studies have rarely recognized this point. Imaging studies of speech production reliably show activity in the ventral part of the precentral gyrus – corresponding with the somatotopic “orofacial” region of the motor and premotor cortices – and this activation has almost invariably been interpreted as reflecting articulation (e.g., Fox et al., 2001). The strong, if unspoken, assumption is that speech is first and foremost an articulatory process. Most studies that have sought to examine phonatory aspects of speech have (1) been perceptual rather than production studies (although see Barrett, Pike, & Paus, 2004), and (2) focused on suprasegmental processes like prosody or lexical tone rather than the basic speech stream. A handful of studies have tried to distinguish brain areas for articulation and phonation. For example, Murphy et al. (1997) compared vocalization of a simple phrase with silent mouthing of the phrase (to reveal phonation) and with mouth-closed vocalization of the phrase using the/a/vowel alone (to reveal articulation). Their primary interest was in examining brain areas involved in respiration for speech. They identified a bilateral region of the sensorimotor cortex that was more active when speech breathing was involved than simple mouthing. Likewise, Terumitsu, Fujii, Suzuki, Kwee, and Nakada (2006) used independent components analysis (ICA) to contrast vocalization of a string of labial syllables with silent articulation of the string without voicing of the vowels or consonants. Their analysis revealed a bilateral region close to the classical tongue region associated with tongue movement and a left-dominant area dorsal to that involved in phonation.
Recent work from our lab has led to the characterization of a somatotopic representation of the larynx in the human motor cortex (Brown, Ngan, & Liotti, 2008). Related work from another lab has shown that this same general region contains a representation of the expiratory muscles as well (Loucks, Poletto, Simonyan, Reynolds, & Ludlow, 2007; Simonyan, Saad, Loucks, Poletto, & Ludlow, 2007). In fact, this area is very close to that which Murphy et al. (1997) associated with speech breathing. (For simplicity, we will refer to this general area as the “larynx motor cortex” in this article.) Hence, the two major components that comprise the vocal source appear to be in close proximity in the motor cortex, perhaps reflecting a unique cortical-level type of respiratory/phonatory coupling specific to human vocalization; for almost all other species, this coupling occurs in the brainstem alone (Jürgens, 2002). Given that our fMRI study showed that the larynx motor cortex was activated comparably by vocal and non-vocal laryngeal tasks (i.e., vocal-fold adduction alone), this area would seem like a good candidate for being a regulator of the melodicity of complex human vocalizations such as speaking and singing.
In order to examine the phonatory component of speech, we analyzed motor cortex activations for a speech production task in comparison to elemental control tasks for tongue movement, lip movement, and monotone vowel phonation, with the intent of looking for potential additivity. In a second study, we used activation likelihood estimation (ALE) meta-analysis to compare a previously-published meta-analysis of word production (Turkeltaub, Eden, Jones, & Zeffiro, 2002) with a new meta-analysis of simple phonation, namely syllable production. The goal of the combined analysis was to characterize the neural contribution of phonation to speech production, a point that has been absent in most previous neuroimaging analyses of speech production.
Sixteen subjects (eight males, eight females), with a mean age of 28.4 years (ranging from 21 to 49 years), participated in the study after giving their informed consent (Clinical Research Ethics Board, University of British Columbia). Each individual was without neurological, psychiatric or audiological illness. Subjects were all fluent English speakers but were unselected with regard to handedness. Three of the subjects were left-handed.
Subjects performed six oral tasks (one task per fMRI scan), each one according to a simple blocked design of 16 s of a resting condition and 16 s of an oral task. The task order was randomized across subjects. All tasks were performed with the eyes open. Four of the tasks are described in this study. (1) Speech. Subjects read passages aloud from the medieval epic poem Beowulf with the teeth together, and thus with no jaw movement. Subjects were trained to read the passages at a very slow pace (1–2 syllables per second) so as to make the rate more comparable with the following three comparison tasks. (2) Monotone-phonation using the schwa vowel. Subjects were instructed to sing a comfortable pitch of their choice using the schwa vowel, with the teeth together but with a very small lip opening to permit oral air flow and avoid humming. Hence, articulatory changes should have been minimal within the task-blocks, as well as between the task and rest blocks. After each 4–6-note breath cycle, subjects were to take a gentle, controlled inspiration through the mouth. The recommended rate of vocalization was 1 Hz. This could be considered equivalently as a monovowel or monotone task. (3) Lip protrusion. Subjects were instructed to pucker their lips and then return them to a resting position, and to do so at a rate of roughly 1 Hz. They were encouraged to make a small gesture and to avoid contracting other facial muscles. (4) Vertical tongue movement within the mouth. Subjects were instructed to move the tip of their tongue from the floor of the mouth to the hard palate with the lips together but with the teeth just slightly separated so as to create adequate space for tongue movement. The recommended rate was 1 Hz. The results for the last two tasks are partially described in Brown et al. (2008). Subjects underwent a 30-min training session on a day prior to the scanning session in order to learn how to perform the tasks in a highly controlled manner with a minimum of head or body movement.
Magnetic resonance images were acquired with a Philips Achieva 3-Tesla MRI at the MRI Research Centre of the University of British Columbia in Vancouver. The subject’s head was firmly secured using a custom head holder and “memory” pillow. Ear plugs were used to help block out scanner noise. Subjects performed each task as 16 s epochs of an oral task alternating with 16 s epochs of rest during the course of a 6′24″ scan. During all tasks but speech, the name of the task (“Lips”) positioned above a cross-hair was projected from an LCD projector onto a screen mounted at the head of the MRI table, with an angled mirror on the head coil reflecting text from the screen into the participant’s field of view. During the speech task, short passages from Beowulf were projected; a different passage was presented during each task epoch. During the rest periods for all tasks but speech, the word “Rest”, positioned above a cross-hair, was projected onto the screen. During the rest periods for the speech task, an abstract line drawing was projected so as to subtract out visual activations as much as possible, as pilot testing showed that the cross-hair alone did not achieve this. All stimuli were created and presented using Presentation software (Neurobehavioral Systems, Albany, CA).
Functional images sensitive to the “blood oxygen level dependent” (BOLD) signal were collected with a gradient echo sequence (TR = 2000 ms, TE = 30 ms, flip angle 90°, 36 slices, 3 mm slice thickness, 1 mm gap, matrix = 80 × 80, field of view = 240 mm, voxel size 3 mm isotropic), effectively covering the whole brain (145 mm of axial extent). A total of 192 brain volumes was acquired over 6′24″ of scan time, corresponding with 12 alternations between 16 s epochs of rest and 16 s epochs of task.
Functional images were reconstructed offline, and the scan series was realigned and motion corrected using the methods in SPM2 (Welcome Department of Cognitive Neurology, University College London, UK), as implemented in Matlab (Mathworks, Natick, MA). While subject motion was a concern for this study, analysis of the realignment parameters indicated that translation and rotation corrections did not exceed an acceptable level of 1.5 mm and 1.5°, respectively, for any of the participants. Following realignment, a mean functional image was computed for each run. The mean image was normalized to the Montreal Neurological Institute (MNI) template (Friston et al., 1995a, 1995b), and this transformation was then applied to the corresponding functional series. The normalized functional images (4 mm isotropic voxels) were smoothed with an 8 mm (full-width-at-half-maximum) isotropic Gaussian filter. The BOLD response for each task-block was modeled as the convolution of a 16 s boxcar with a synthetic hemodynamic response function composed of two gamma functions. Beta weights associated with the modeled hemodynamic responses were computed to fit the observed BOLD-signal time course in each voxel for each subject using the general linear model, as implemented in SPM2. Each subject’s data was processed using a fixed-effects analysis, corrected for multiple comparisons using family-wise error, with a threshold of p < 0.05 (t > 4.99) and no extent threshold. Contrast images for each task-versus-rest analysis for each subject were brought forward into a random effects analysis, where a significance level of p < 0.025 was employed (“false discovery rate” correction for multiple comparisons for the whole brain; Genovese, Lazar, & Nichols, 2002) and no extent threshold. The critical t value varied across contrasts and was: t > 3.59 for speech, t > 4.36 for tongue movement, t > 4.47 for lip movement, and t > 5.07 for monotone-phonation. MNI coordinates were converted into the coordinates of Talairach and Tournoux (1988) using a non-linear transformation, as implemented in the WFU PickAtlas (Maldjian, Laurienti, Kraft, & Burdette, 2003) and based on the method of Brett (imaging.mrc-cbu.cam.ac.uk/imaging/MniTalairach), except for the case of the cerebellum, where MNI coordinates are retained. This was because of errors incurred by coordinate conversion.
Meta-analysis of 11 published studies of syllable-singing was performed using activation likelihood estimation (ALE) analysis. The studies are listed in Table 2. Our inclusion criteria were: (1) that the papers provided either Talairach or MNI coordinates for their activation foci (hence excluding Özdemir, Nortona, & Schlaug, 2006); (2) that all of the brain was imaged; (3) that only syllables were sung, but no words or sentences (hence excluding Jeffries, Fritz, & Braun, 2003, and Kleber, Birbaumer, Veit, Trevorrow, & Lotze, 2007); and (4) that overt phonation was used as part of the task (hence excluding all studies of covert production). (5) We decided to exclude articles that only presented high-level contrasts, i.e., no contrast to a low-level control condition such as rest or a perceptual baseline, as we wanted to place our focus on motor activations. On this basis, we excluded the article of Saito, Ishii, Yagi, Tatsumi, and Mizusawa (2006).
Coordinates for activation foci from conditional contrasts were taken from the original publications. No deactivations were examined in the meta-analysis, as none of the papers reported them. We used the implementation of ALE (Laird et al., 2005a) that is contained within the BrainMap database (http://brainmap.org; Fox & Lancaster, 2002; Laird et al., 2005b). MNI coordinates were automatically converted to Talairach coordinates using the method of Brett cited above. All coordinates were then blurred with a full-width-at-half-maximum of 12 mm. The ALE statistic was computed for every voxel in the brain according to the algorithm developed by Turkeltaub et al. (2002). A permutation test using 5000 permutations was performed to determine the statistical significance of the ALE results, which were thresholded at p < 0.05 using the “false discovery rate” correction for multiple comparisons (Laird et al., 2005a). The ALE maps presented in Fig. 2 are shown overlaid onto an anatomical template generated by spatially normalizing the International Consortium for Brain Mapping (ICBM) template to Talairach space (Kochunov et al., 2002).
Fig. 2 also shows the results of a re-analysis of the Turkeltaub et al. (2002) reading data that we use as a comparison for the results of the syllable-singing meta-analysis. The data are different in two respects compared to the original publication. First, our analysis used a false discovery rate threshold of 0.05 based on 5000 permutations in order to correct for multiple comparisons, whereas the original analysis was uncorrected and used a threshold of p < 0.0001 based on 1000 permutations; and second, MNI coordinates were presented in the Turkeltaub analysis, whereas we converted MNI coordinates into Talairach space. In addition, Table 3 presents submaxima for several of the major peaks, some of which are not reported in the original publication. In order to make the syllable-singing and reading analyses more comparable, we applied an extent threshold of 400 mm3 to the reading analysis, corresponding to the smallest cluster reported for the syllable-singing analysis. Finally, although a new transformation procedure for converting MNI coordinates to Talairach coordinates was published (Lancaster et al., 2007) and became implemented into the ALE procedure, we chose to use Brett procedure in this analysis for two reasons. First, we wanted the fMRI results to correspond with previously-published results from this data set (Brown et al., 2008), which used the Brett transform. Second, we wanted the reading meta-analysis coordinates to match as closely as possible the published coordinates in Turkeltaub et al. (2002).
An analysis of the speech task vs. rest (Fig. 1 and Table 1) showed bilateral activations in the part of the motor/premotor cortex that Brown et al. (2008) identified as the larynx representation, showing ventromedial peaks (slice at z of 32) and dorsolateral peaks (slice at z of 40). A second major activation focus in the motor cortex was found in the Rolandic operculum, which we showed previously contains, at least in part, the ventral portion of the somatotopic tongue representation (Brown et al., 2008), thus reflecting articulation. Examination of the peaks at z slice 32 shows that there is much activity smeared lateral to the larynx peak. This most likely represents the labial contribution to speaking, although SPM did not identify a separate focus of activation here. Additional motor activations were seen in the supplementary motor area (SMA) and two distinct regions of the cerebellum bilaterally, namely lobules VI and VIIIA. Auditory activations were seen bilaterally in both the anterior and posterior parts of the superior temporal gyrus (STG) and sulcus, including those involved in voice perception (Belin, Zatorre, LaFalce, Ahead, & Pike, 2000). Most of the sensorimotor activations for speech were bilateral except for a left-hemisphere focus in area Spt in the posterior part of the STG.
The bottom part of Fig. 1 shows direct subtractions of tongue movement, lip movement, or monotone-phonation from the speech task, with an emphasis on the motor cortex. Subtraction of either tongue or lip movement from speech revealed a residual peak in the ventromedial larynx area, suggestive of the role of this area in phonation. Subtraction of tongue movement, but not lip movement or phonation, eliminated activity in the Rolandic operculum, suggestive of a primary role of this region in tongue movement rather than phonation. The phonation condition was the least effective subtraction control, as it failed to appreciably subtract out larynx activity from the speech condition. The potential reasons for this are discussed below. Overall, these results show that individual components of speech can be eliminated using a subtractive approach, hence arguing for a basic additivity of the speech system as well as for the common recruitment of motor-cortical regions by speech and non-speech articulator movements.
A second approach was taken to look at the melodicity of speech, namely a comparison between a previously-published meta-analysis of overt speech (i.e., oral reading of word lists) and our own meta-analysis of 11 studies of syllable-singing. A total of 283 foci from these studies were used in the ALE meta-analysis. In contrast to the minimalist fMRI phonation task used here, almost all the phonatory tasks previously-published had a definite articulatory component to them, using syllables like/da/and/pa/(see Table 2 for details). The major ALE clusters are shown in Fig. 2a, with Talairach coordinates presented in Table 3.
The two meta-analyses showed common activations in both the larynx motor cortex, indicative of phonation, and the Rolandic operculum, indicative of articulation. It is interesting to note that, compared to the simple schwa vowel used in our fMRI monotone task, almost all of the tasks in the syllable-singing meta-analysis used syllables that involved articulatory transitions between consonants and vowels. Hence, while we did not see the Rolandic operculum in the fMRI phonation task, this area did indeed show up strongly in the meta-analysis, most likely reflecting the occurrence of articulation in these tasks. The meta-analysis findings bolster the fMRI results in highlighting the role of the larynx motor cortex in basic melodicity as well as in permitting a somatotopic assignment of phonation and lingual articulation to two regions of the motor cortex. Since syllable-singing reproduced much of the activation pattern of complex speech, it appears that, at the motoric level, speech is indeed a combination of its phonatory and articulatory components.
As a final step, we performed a comparison between the syllable-singing meta-analysis and a previously-published meta-analysis of 11 studies of overt word-reading (Turkeltaub et al., 2002), as shown in Fig. 2b and Table 4. Table 4 shows the strong overlap in activity between the peak coordinates of the syllable-singing and reading meta-analyses. The vast majority of the foci in the reading meta-analysis were present in the syllable-singing meta-analysis, again with a substantial overlap in the larynx motor cortex, this time having a good match to the ventromedial peak. Highly similar results were obtained when the syllable-singing meta-analysis was compared with another meta-analysis of overt reading, namely the data of Brown, Ingham, Ingham, Laird, and Fox (2005), looking at fluent control subjects in eight published studies of stuttering (data not shown). However, we chose to focus here on the Turkeltaub analysis because it was based on 172 foci, compared to only 73 foci for the stuttering controls. The overall profile, however, was very similar.
In this study, we attempted to look at speech in a somatotopic manner, and especially to illuminate the role of phonation in speech production. We use these analyses to formulate a general model of vocalization in the brain.
Our previous fMRI study (Brown et al., 2008) established a representation of the larynx in the motor cortex, one which overlaps an area involved in voluntary control of expiration (Loucks et al., 2007; Simonyan et al., 2007). Using this motor cortex focus as a reference, we were able to demonstrate for the first time that connected speech gives its principal motor cortex activation in the larynx area, thereby supporting the notion that much of the speech signal is voiced, including all vowels and a majority of consonants. Previous neuroimaging studies on speech production have not made this point about phonation, and have instead talked about activity in the “mouth” or “face” area of the motor cortex (e.g., Fox et al., 2001), with the implication being that speech is mainly articulatory. Knowing the location of the larynx area, we were able to interpret residual activations in the motor strip as being related to articulation, mainly in the Rolandic operculum for tongue movement and the region lateral to the larynx area for lip movement. This is a first step toward a somatotopic dissection of phonation and articulation in the cortical motor system. The study of Terumitsu et al. (2006) seemed poised to make the same point, in that the authors compared phonated vs. mouthed versions of the same polysyllable string. However, their analyses did not involve a direct contrast between the voiced and unvoiced tasks, and what they called “phonation” in their ICA analysis included articulation as well as phonation, as evidenced by ICA clusters in the Rolandic operculum.
The results with the speech task match very closely the findings of two voxel-based meta-analyses of overt reading. Turkeltaub et al. (2002) published an activation likelihood estimation (ALE) meta-analysis of 11 studies of oral reading, and found the region of greatest concordance across these studies in the motor cortex to be at −48, −12, 36, and 44, −10, 34, very close to our ventromedial speech peaks at −40, −12, 30, and 44, −10, 30. Likewise, Brown et al. (2005) performed two parallel ALE meta-analyses of eight studies of oral reading in stutterers and fluent controls, respectively. The peak M1 activations for the control subjects were at −49, −9, 32, and 54, −10, 34, and those for the stutterers were at −45, −16, 31, and 48, −12, 32, again quite close to the ventromedial M1 peaks for the speech task in this study. Both of these meta-analyses identified the larynx area as a major location of activation during oral reading. They also found bilateral activations at the Rolandic operculum, very close to our fMRI tongue coordinates. Hence, the general pattern seen to emerge from imaging studies of speech production is two major sites of activation in the motor cortex: the larynx area deep in the central sulcus, and the tongue area in the Rolandic operculum. While other processes are clearly critical for speech production – not least muscular activity in the lips, velum and pharynx – larynx and tongue activities might be the most readily identifiable ones because of their distance in the motor cortex. For example, in our previous imaging study (Brown et al., 2008), lip and tongue movement showed a region of overlapping activity, although this was dorsal to tongue-related region of the Rolandic operculum.
Given our separability of activity in the larynx area and Rolandic operculum during simple phonation and tongue movement, respectively, and their combination during speech (in addition to presumptive activity in lip-related areas), there does seem to be a basic additivity of phonation and articulation that comes into play during speech production. Looking to the subtraction analysis, we obtained mixed results. While tongue and lip movement nicely subtracted out articulation-related activity in the motor cortex during speech, the monotone-phonation task was not very effective at subtracting out the larynx peak of speech. Interestingly, a similar result was found in the study of Murphy et al. (1997). Their contrast was better matched than ours in that they compared the vocalization of a phrase with mouth-closed vocalizing of the same phrase using the/a/vowel. Hence, much about the melody and rhythm of the original phrase should have been contained in the unarticulated version. Their subtraction revealed bilateral peaks in the sensorimotor cortex quite close to the ventromedial larynx area. Why might the larynx activation during speech be difficult to subtract out with phonatory control tasks, especially given the efficiency of the subtraction of articulatory areas using articulatory controls tasks? One speculation is that co-articulation during speech production may activate the larynx area in a much stronger manner than tasks that involve a single articulatory posture, such as during the monovowel tasks used in this study and that of Murphy et al. (1997). Likewise, speech tasks show an oscillatory cycling between voiced and unvoiced sounds that is not seen in the controls tasks. Given the overlap in the larynx coordinates between the reading and syllable-singing meta-analyses, the effect that we and Murphy et al. are seeing is most likely quantitative rather than qualitative. Further work is needed to enlighten this point, not least an analysis of potential neural sub-domains within the larynx motor cortex for vocal-fold tension vs. relaxation, and abduction vs. adduction.
We would like to consolidate the results of the fMRI experiment and meta-analyses into a model of vocalization (Fig. 3), one that focuses on the generation of sounds at the vocal source, and hence phonation. A very similar model of vocal production is presented by Bohland and Guenther (2006), as discussed below. The fMRI monotone-phonation task as well as the 5-note phonation task used in Brown et al. (2008) were designed to be as pure a model of phonation as possible, minimizing the contribution of articulation to the brain activations. We would like to consider the activation pattern of these tasks as a minimal model of “primary” areas for phonation, and then contrast that with data from the fMRI speech task and the two meta-analyses in order to characterize “secondary” areas that may tap more into articulation or general orofacial functioning than phonation.
The primary vocal circuit consists principally of three motor areas: (1) the larynx motor cortex and associated premotor cortex; (2) lobule VI of the cerebellum, and (3) the SMA. The primary auditory areas are Heschl’s gyrus and the auditory association cortex of the posterior STG, including area Spt. Secondary vocal areas include: (1) the Rolandic operculum (the ventral part of the motor cortex, hence included with the M1/premotor box in the figure), (2) the putamen and ventral thalamus, (3) cingulate motor area, and (4) frontal operculum/anterior insula. In Fig. 3, primary areas are shown with shaded boxes, and secondary areas with white boxes. The connectivity model in the diagram is largely based on the connections of the larynx motor cortex in the Rhesus monkey (Simonyan & Jürgens, 2002, 2003, 2005a, 2005b), in which most of the areas listed are reciprocally connected with the motor cortex, the exceptions being the cerebellum and putamen, which feed back to the cortex indirectly via the ventral thalamus. Regarding auditory areas, it is not known if they project directly to the motor cortex or if they have to pass through a relay like Broca’s area, as posited in the standard Geschwind model of speech (Catania, Jones, & fitches, 2005). In the monkey, there is a minor projection from the posterior STG to the larynx motor cortex (Simonyan & Jürgens, 2002, 2005b); hence, such a pathway could exist in humans as well. Preliminary diffusion tensor imaging work from our lab suggests that there is indeed direct connectivity between temporoparietel auditory areas and the orofacial precentral gyrus via the arcuate fasciculus (unpublished observations).
Lobule VI of the posterior cerebellum showed the highest ALE score of any brain region in the meta-analysis. Somatotopic analysis has demonstrated that this is indeed an orofacial part of the cerebellum (Grodd, Hülsmann, Lotze, Wildgruber, & Erb, 2001). We showed that this region is activated by lip movement and tongue movement as well as vocalization. Hence, while this region seems to be an obligatory component of the vocal circuit, there is probably little about it that is voice-specific, although there may be somatotopic sub-domains for each effector within this general area. This stands in contrast to lobule VIIIA of the ventral cerebellum, which was activated by both speech and monotone-phonation but which did not show activity for lip and tongue movement (although see Watanabe et al. (2004) for activity in this region during tongue movement) or show ALE foci in either meta-analysis. Given that half of the studies in the syllable-singing meta-analysis were PET studies, and given the fact that many older PET machines had an axial span of only 10 cm, it is likely that the ventral part of the cerebellum was cut off in many of the studies used in the meta-analysis (e.g., Brown, Martinez, Hodges, Fox, & Parsons, 2004; Brown, Martinez, & Parsons, 2006; see Petacchi, Laird, Fox, & Bower, 2005, for a discussion of this topic). Hence, lobule VIIIA may be a brain area that has been under-represented in studies of overt vocalization thus far and may therefore be expected to appear with greater frequency in future publications of speech and song (e.g., Bohland & Guenther, 2006; Riecker et al., 2005).
The SMA is one of a handful of brain areas which when lesioned can give rise to mutism, and stimulation of this area can elicit vocalization in humans but not monkeys (Jürgens, 2002). The SMA is organized somatotopically (Fontaine, Capelle, & Duffau, 2002) but, as with lobule VI of the cerebellum, there is no information as to whether there are effector-specific zones within the somatotopic orofacial area of the SMA. While the SMA is routinely activated in studies of both speech and song production, its exact role is unclear. The SMA is classically associated with activities like bimanual coordination (Carson, 2005), and stimulation of the SMA can lead to simultaneous activation of linked effectors, such as the whole arm. It is thus reasonable to presume that the SMA plays some role in the sequential coordination of effectors during vocal production, although this area is clearly activated when single effectors such as the lips or tongue are used. In Indefrey & Level’s (2004) qualitative meta-analysis of 82 studies of single-word processing, they argued that the SMA was involved in articulatory planning. In addition, they found that the SMA was active in both covert and overt word-reading tasks. In support of this, the SMA has also been found to be active in many studies of covert singing (Callan et al., 2006; Halpern & Zatorre, 1999; Riecker, Ackermann, Wildgruber, Dogil, & Grodd, 2000a). Hence, the SMA plays some role in motor planning, motor sequencing, and/or sensorimotor integration, but the exact role in vocalization is not well understood.
The putamen gave one of the most complex profiles of any area in these analyses. No activity was seen for the fMRI monotone-phonation task, whereas there was a strong left-hemisphere focus for speech. That said, the putamen showed very strong ALE foci bilaterally in the syllable-singing meta-analysis and reasonably good concordance across its contributing studies, hence creating an inconsistency between the fMRI study and the meta-analysis. One potential resolution to this inconsistency is to posit that the putamen is more important for articulation than phonation. In the fMRI study, we found more putamen activity for lip movement and tongue movement than for simple phonation. Likewise, many studies have shown activity in the putamen during lip movement, tongue movement, and voluntary swallowing (Corfield et al., 1999; Gerardin et al., 2003; Martin et al., 2004; Rotte, Kanowski, & Heinze, 2002; Watanabe et al., 2004). One problem with this interpretation is that damage to the basal ganglia circuit gives rise to severe dysphonia in addition to articulatory problems (Merati et al., 2005). This would seem to suggest that the basal ganglia play a direct role in phonation. It is interesting in this regard that the only major voice therapy that seems successful at ameliorating Parkinsonian dysphonia, namely Lee Silverman Voice Therapy (Ramig, Countryman, Thompson, & Horii, 1995; Ramig et al., 2001), is a phonation-based therapy that indirectly improves articulation as a by-product (Dromey, Ramig, & Johnson, 1995; Sapir, Spielman, Ramig, Story, & Fox, 2007). The role of the putamen in phonation and articulation is in need of further exploration. For the time being, we put it in the category of “secondary” areas. We do the same for the ventral thalamus. Its co-occurrence with the putamen (i.e., both were absent in the fMRI monotone task, both were present in the syllable-singing meta-analysis, and the thalamus was only present in articles that reported putamen activation in the syllable-singing meta-analysis) probably reflects the connectivity of the basal ganglia, which sends its output from the internal segment of the globus pallidus to anterior parts of the ventral thalamus. The cerebellum’s projection to the cerebral cortex also passes through a part of the ventral thalamus (posterior to the basal ganglia projection), and so it is unclear why there should be an absence of ventral thalamus activation in the presence of strong cerebellar activity. The thalamus showed relatively low concordance across studies in the meta-analysis.
One interesting point of reference with regard to the basal ganglia comes from the studies of Riecker et al. (2005), Riecker, Kassubek, Groschel, Grodd, and Ackermann (2006), which were included in the meta-analysis. These studies examined the tempo of vocalization, looking at monosyllable/pa/repetitions over the range of 2–6 Hz. What was found was that activity in lobules VI and VIIIA of the cerebellum showed positive correlations with syllable rate whereas activity in the putamen and caudate nucleus showed negative correlations. Putamen activity decreased monotonically for speaking rates ranging from 2 to 6 Hz (Riecker, Kassubek, Groschel, Grodd, & Ackermann, 2006); the two cerebellar regions showed the reverse pattern. Our profile of high cerebellum and low putamen does not follow from the assumption that these patterns would extend to 1 Hz, the suggested production rate for our fMRI monotone task. Again, the absence of articulatory changes in our singing task may be a more important factor than tempo per se in explaining the absence of putamen activity.
The cingulate motor area gave low concordance in the meta-analysis, and was not found to be active in the speech or monotone-phonation fMRI tasks. Unlike the larynx motor cortex, the CMA is the only cortical part of the monkey brain which, when lesioned, disrupts vocalization (Sutton, Larson, & Lindeman, 1974, but see Kirzinger & Jürgens, 1982). The projection from the cingulate cortex to the periaqueductal gray is thought to represent an ancestral vocalization pathway in primates that is perhaps more important for involuntary vocalizations than voluntary ones like speech. This area may indeed be more involved in emotive vocalizations than learned vocalizations such as speech and song in humans. It is interesting to note that almost all of the studies in the meta-analysis that showed CMA activation employed monotone tasks rather than melodic singing tasks. Hence, the CMA may have some preference for simple vocal tasks, as shown by its activation in monotone (Brown et al., 2004; Perry et al., 1999), monovowel (Sörös et al., 2006), and monosyllable (Bohland & Guenther, 2006; Riecker et al., 2006) tasks. This hypothesis is consistent with the reading study of Barrett et al. (2004), in which subjects had to read semantically-neutral passages under conditions of either happy or sad mood induction. Regressions with affect-induced pitch range showed that the more monotonous the speech became during sad speech, the greater the activity in the CMA. The major Talairach coordinate for this regression was at −8, 18, 34, which corresponds quite well with one of the CMA coordinates from the syllable-singing meta-analysis at −2, 16, 32. CMA activity may thus be sensitive to melodic complexity, showing a preference for low-complexity vocal tasks having minimal pitch variation, which may reflect its evolution from a system involved in simple, stereotyped vocalizations. Might the CMA be the brain’s “chant” center? Further work is needed to clarify the role of the cingulate cortex in vocalization.
The frontal operculum and medial-adjacent anterior insula represent yet another difficult case for our model. As with the putamen, activity in this region was much stronger in the meta-analysis than the fMRI monotone task. We again make the speculation that this area encodes generalized orofacial functions and thus might be equally involved in articulation and phonation. The fMRI study showed comparable activity in the frontal operculum for lip movement and tongue movement as for vocalization. This casts doubts on a phonation-specific role of this region. In addition, the most typical type of symptom associated with damage to the anterior insula is apraxia of speech and not dysphonia alone (Jordan & Hillis, 2006; Ogar, Slama, Dronkers, Amici, & Gorno-Tempini, 2005). Hence, damage to this region is much more likely to result in articulatory deficits than phonatory ones, although both seem to co-occur. As Ogar et al. (2005) point out: “Prosodic deficits, however, are thought to be a secondary effect of poor articulation” (p. 428). It is for these reasons that we put the frontal operculum and adjacent anterior insula into the category of “secondary” areas for vocalization. Several models of vocal production have ascribed an important role for the anterior insula in phonological processing (Ackermann & Riecker, 2004; Bohland & Guenther, 2006; Indefrey & Level, 2004; Riecker et al., 2005, 2006). In Indefrey and Level’s (2004) meta-analysis, they associated the anterior insula most strongly with “phonological code retrieval”, which is a process of searching for phonological words that match a lexically selected item. They found less evidence for a role of the anterior insula in actual speech production, a result counter to the perspective of Ackermann and Riecker (2004). Riecker et al. (2006) found that activity in the insula increased monotonically with syllable rate, hence showing a similar profile to the cerebellum (as well as larynx motor cortex and SMA). So the frontal operculum/anterior insula is almost certainly a vocal-motor area, but its exact role is in need of further analysis.
The model in Fig. 3 shows striking similarities with the “basic speech production network” proposed by Bohland and Guenther (2006), which includes all of the areas mentioned here. In fact, there is no region of disagreement between our model and theirs. Perhaps the only motivational difference relates to our goal of defining a network of vocal production based on phonation, leading to our distinction between primary and secondary areas for vocalization. Their model was based on a series of syllable tasks, ranging from simple to complex trisyllables. Hence, articulation was an important component of all of their tasks. It is possible that a task based on vowels alone would yield different results. For example, the vowel production task of Perry et al. (1999) failed to show activity in some of the areas that we have speculated to be associated with articulation (e.g., putamen) but did show activity in others (Rolandic operculum, frontal operculum), whereas the vowel production task of Sörös et al. (2006) failed to show activity in the Rolandic operculum but did show activity in the putamen and frontal operculum. Further work is clearly needed to verify the phonation network postulated in our primary areas.
Using two complementary comparisons between speech and non-speech oral tasks (fMRI and meta-analysis), we have attempted to disentangle phonation and articulation in speech, and have shown that motor-control models like the “source-filter” model can be represented somatotopically in the motor cortex. A principal site of activation for speech is the larynx representation in the motor cortex, in keeping with the overwhelmingly voiced nature of speech. Additional activity in the Rolandic operculum for tongue movement and other parts of the motor cortex contribute to an overall sense of additivity of phonation and articulation during speech production.
This work was supported by a grant to SB from the Grammy Foundation. ARL and SMT were supported by the Human Brain Project of the NIMH (R01-MH074457-01A1), and PQP by NSF grant 0642592. We thank Trudy Harris, Jennifer McCord, and Burkhard Mädler at the MRI Research Centre of the University of British Columbia for expert technical assistance. We thank Roger Ingham (University of California at Santa Barbara) for critical reading of a previous version of the manuscript.