|Home | About | Journals | Submit | Contact Us | Français|
Sensory cortices can be activated without any external stimuli. Yet, it is still unclear how this perceptual reactivation occurs and which neural structures mediate this reconstruction process. In this study, we employed fMRI with mental imagery paradigms to investigate the neural networks involved in perceptual reactivation. Subjects performed two speech imagery tasks: articulation imagery (AI) and hearing imagery (HI). We found that AI induced greater activity in frontal-parietal sensorimotor systems, including sensorimotor cortex, subcentral (BA 43), middle frontal cortex (BA 46) and parietal operculum (PO), whereas HI showed stronger activation in regions that have been implicated in memory retrieval: middle frontal (BA 8), inferior parietal cortex and intraparietal sulcus. Moreover, posterior superior temporal sulcus (pSTS) and anterior superior temporal gyrus (aSTG) was activated more in AI compared with HI, suggesting that covert motor processes induced stronger perceptual reactivation in the auditory cortices. These results suggest that motor-to-perceptual transformation and memory retrieval act as two complementary mechanisms to internally reconstruct corresponding perceptual outcomes. These two mechanisms can serve as a neurocomputational foundation for predicting perceptual changes, either via a previously learned relationship between actions and their perceptual consequences or via stored perceptual experiences of stimulus and episodic or contextual regularity.
Sensory cortices can be activated without any external stimulation (e.g., Ji & Wilson, 2006; Wheeler, Petersen, & Buckner, 2000). That is, perceptual neural representations can be reconstructed without perceptual processing (referred to as perceptual reactivation). Mental imagery, defined as an internally generated quasi-perceptual experience, is one such example (e.g., Kosslyn et al., 1999; Kraemer, Macrae, Green, & Kelley, 2005). The ability to form mental images has been hypothesized as a vehicle for generating and representing thoughts. This argument can be found as early as Plato’s Theaetetus [427–347 BC] (1987) and Aristotle’s De Anima [384–322 BC] (1986). In the age of enlightenment, mental imagery was considered analogous to perception by philosophers such as Descartes (1642/1984), Hobbes (1651/1968), Berkeley (1734/1965a, 1734/1965b) and Hume (1969). Early experimental psychologists such as Wundt (1913) and James (1890) proposed that ideas were represented as mental images in both visual and auditory domains. Modern research in mental imagery has yielded insight on how thought is represented in cognitive systems (Kosslyn, 1994; Kosslyn, Ganis, & Thompson, 2001; Paivio, 1971, 1986; Pylyshyn, 1981, 2003).
Recently, an additional computational role of mental imagery has been proposed: a mechanism to plan possible future contingencies. That is, mental imagery has been modeled as a process in which perceptual consequences can be predicted to gain advantages in various aspects of perception, memory, decision making and motor control (Albright, 2012; Moulton & Kosslyn, 2009; Schacter et al., 2012; Tian & Poeppel, 2012). The reactivation of perceptual neural representations without any external stimulation is the key mechanism mediating this predictive ability (Moulton & Kosslyn, 2009). Internally induced neural representations, which are highly similar to the ones established in corresponding perceptual processing, have been observed in modality-specific areas, such as in visual (e.g., Kosslyn et al., 1999; O’Craven & Kanwisher, 2000), auditory (e.g., Kraemer et al., 2005; Shergill et al., 2001; Zatorre, Halpern, Perry, Meyer, & Evans, 1996), somatosensory (e.g., Yoo, Freeman, McCarthy III, & Jolesz, 2003; Zhang, Weisser, Stilla, Prather, & Sathian, 2004) and olfactory (e.g., Bensafi et al., 2003; Djordjevic, Zatorre, Petrides, Boyle, & Jones-Gotman, 2005) domains.
It is not clear how these neural representations are reconstructed. Preliminary evidence from an MEG study (Tian & Poeppel, 2013) suggests that imagining speaking (articulation imagery, AI) and imagining hearing (hearing imagery, HI) differentially modulated neural responses to subsequent auditory stimuli. These distinct modulation effects by different types of imagery suggest that similar auditory neural representations may be internally formed via different neural pathways. A dual stream prediction model (DSPM, Fig. 1) was proposed in which two distinct processes in parallel neural pathways can internally induce the corresponding perceptual neural representation (Tian & Poeppel, 2012, 2013).
In the simulation-estimation prediction stream (Fig. 1), the perceptual consequences of actions are predicted by simulating the movement trajectory, followed by estimating the perceptual changes that would be associated with this movement. AI has been hypothesized to implement the motor-to-sensory transformation for simulation-estimation mechanism (Tian & Poeppel, 2013). Specifically, during AI, a motor simulation process similar to speech motor preparation is carried out, but without execution and output (Palmer et al., 2001; Tian & Poeppel, 2010, 2012). Therefore, neural networks that mediate motor simulation should be similar to the ones implicated in motor preparation, including supplementary motor area (SMA), inferior frontal gyrus (IFG), premotor and insula (Bohland & Guenther, 2006; Palmer et al., 2001; Shuster & Lemieux, 2005). After motor simulation, a copy of the planned motor commands – known as the efference copy (Von Holst & Mittelstaedt, 1950/1973; for a review see Wolpert & Ghahramani, 2000) – is sent to the somatosensory areas and is used in a forward model to estimate the somatosensory consequences (Blakemore & Decety, 2001). This somatosensory estimation is hypothesized to be governed by the networks underlying somatosensory perception (Blakemore, Wolpert, & Frith, 1998; Tian & Poeppel, 2010, 2012), including primary and secondary somatosensory regions, parietal operculum (PO) and the supramarginal gyrus (SMG). Moreover in the context of speech, we hypothesize that auditory consequences are predicted on the basis of somatosensory estimation, and this auditory estimation will recruit neural structures in temporal auditory cortices (Tian & Poeppel, 2010, 2012, 2013, 2015).
In the memory-retrieval prediction stream (Fig. 1), the internally induced neural representations are the result of memory retrieval processes – reconstructing stored perceptual information in modality-specific cortices (Kosslyn, 1994, 2005; Wheeler et al., 2000). In particular, the retrieved object properties from long-term memory reactivate the sensory cortices that originally processed the object features (Kosslyn, 1994). In this experiment, we employed HI to probe this memory-retrieval stream. Auditory representations can be retrieved from various memory sources such as episodic memory, which presumably relies on hippocampal structures (Carr, Jadhav, & Frank, 2011; Eichenbaum, Sauvage, Fortin, Komorowski, & Lipton, 2012) with a possible buffer site in parietal cortex (Vilberg & Rugg, 2008; Wagner, Shannon, Kahn, & Buckner, 2005). Auditory representations can also be transformed from lexical and semantic information stored in semantic networks, including frontal (e.g., dorsomedial prefrontal cortex, IFG, ventromedial prefrontal cortex), parietal (e.g., posterior inferior parietal lobe) and temporal (e.g., middle temporal gyrus) regions (Binder, Desai, Graves, & Conant, 2009; Lau, Phillips, & Poeppel, 2008; Price, 2012). Regardless of the divergent functional roles (episodic or semantic networks), frontal and parietal regions are reliably activated during memory retrieval processes. Therefore, neural activation in a frontal-parietal distributed network – the proposed memory-retrieval prediction stream – should be observed during HI.
This study uses fMRI to investigate three neuroanatomical/ functional hypotheses that are generated from the DSPM. First, if the perfect simulation-estimation and memory-retrieval tasks were carried out, two distinct processing streams would be revealed separately. However, because speech imagery could involve both production and perception, we predict that both types of imagery will activate the simulation-estimation stream for simulating speech motor action (Tian & Poeppel, 2013). More importantly, we hypothesize that each type of imagery will recruit each prediction stream to a different extent. Specifically, we predict that AI will induce stronger activation in the simulation-estimation prediction stream, including SMA, IFG, premotor, insula for motor simulation, as well as primary and/or secondary somatosensory areas PO and SMG for subsequent estimation of somatosensory consequences. On the other hand, we predict that HI will have more activation in the memory-retrieval prediction stream, which is comprised of frontal, superior and inferior parietal cortices that are associated with memory retrieval (Binder et al., 2009; Lau et al., 2008; Price, 2012; Vilberg & Rugg, 2008; Wagner et al., 2005).
Second, we suggest that a more precise, detailed auditory prediction can be induced through simulation-estimation mechanisms, comparing to that obtained via memory-retrieval route (Hickok, 2012; Oppenheim & Dell, 2010; Tian & Poeppel, 2012, 2013). We propose that there is a one-to-one mapping between motor simulation and perceptual estimation via a bridge of somatosensory estimation in the simulation-estimation stream. Such a deterministic prediction mechanism, contrasted with the memory-retrieval prediction stream’s probabilistic prediction mechanism (narrowing down the target features in distributions of stored memory), presumably suffers less interference and lateral inhibition from similar features and yields a stronger and robust representation (Tian & Poeppel, 2012, 2013). Based on this hypothesis of enriched auditory representations via simulation and estimation processes, we predict that auditory cortices will be more strongly activated in AI than in HI.
Finally, we hypothesize that the neural networks governing simulation within the simulation-estimation stream overlap with cortical regions underlying motor preparation during speech production (Tian & Poeppel, 2012). That is, the initial motor processes are the same during articulation (A) and AI until the processes diverge, specifically until the motor signals are not executed during imagery. Therefore, we predict that enhanced activity in SMA, IFG, premotor areas and insula, which has been observed during preparation of overt speech production (Brendel et al., 2010; Riecker et al., 2005), will be observed in both AI and A. The observation of overlapping neural networks will provide evidence towards potentially shared neural mechanisms between overt and covert speech production, and furthermore suggests that mental imagery of speech is a valid paradigm to research these shared motor processes.
Eighteen volunteers gave informed consent and participated in the experiment (10 males, mean age 28.2 years, range 20–44 years). All participants were right-handed, with no history of neurological disorders. The experimental protocol was approved by the New York University Institutional Review Board (IRB).
Two 600-msec duration consonant-vowel syllables (/ba/,/ki/) were used as auditory stimuli (female voice; sampling rate of 48 kHz). All sounds were normalized to 70 dB SPL and delivered through MR-compatible headphones (MR confon Silenta, MR confon GmbH, Magdeburg, Germany). Four images were used as visual cues to indicate different trial types. Each image was presented foveally, against a black background, and subtended less than 10° visual angle. A label - either ‘/ba/’ or ‘/ki/’ – was superimposed on the center of each picture (<4° visual angle) to indicate the syllable that participants would produce in the following tasks.
We employed a similar experimental design as Tian and Poeppel (2013) (see Fig. S1). The experiment was comprised of four conditions: articulation (A), hearing (H), articulation imagery (AI), and hearing imagery (HI). In A, participants were asked to overtly generate the cued syllable (gently, to minimize head movement). In AI, participants were required to imagine saying a syllable without any overt movement of the articulators. In H, participants passively listened to one of the syllables. In HI, participants were asked to imagine hearing the cued syllable.
The timing of trials was consistent across conditions (Fig. S1). First, a visual cue appeared in the center of the screen at the beginning of each trial and stayed on for 1000 msec. During the following 2400 msec, participants actively formed a syllable in three of the task conditions (A, AI, and HI) or passively perceived an auditory syllable in H, in which a syllable was presented 1200 msec after the offset of visual cue, followed by a 600 msec interval. Notice that the 2400 msec period was the total duration that participants were allowed to finish the tasks (indicated by the curly bracket, Fig. S1). The actual time of performing task was much shorter, presumably comparable to the syllable duration. Finally, participants were presented with one of the syllables that always followed the task phase. The inter-trial interval was randomly chosen from 4440 to 6660 msec (2–3 TRs, see MRI scanning for details), temporally jittered by 46.25-msec increments (length of TR divided by 48, the number of task trials in a run). Twelve trials for each of the four tasks were presented in each run. Six resting trials (length: 9550 msec), which were visually cued with the word ‘rest’ were also included in each run. In total, the experiment included five runs with 54 trials each, encompassing all four tasks and the rest condition, which were pseudo-randomly presented in each run.
The goal of our earlier MEG study (Tian & Poeppel, 2013), on which this current study builds, was to assess cross-modal repetition adaptation. In contrast, the goal of this study is to assess the neural networks mediating internal perceptual reactivation by testing the main effects among different tasks, which are independent of the adaptation effects. Because the different syllables were used equally often, we only compared between overt and covert tasks, rather than between different syllables.
Each participant received training for 15–20 min before the experiment, and they focused on the timing as well as vividness of imagery. First, only the H trials were presented to introduce the relative timing among the visual cue, the first auditory stimulus (the same period for active tasks in other conditions), and the subsequent auditory stimuli. After participants were familiar with the timing, they were instructed to use similar timing for the other conditions. This was to prevent any overlap between the internally generated neural responses during tasks and the subsequent responses to the external auditory stimuli. Next, they practiced on A trials while the experimenter observed the overt articulation and provided feedback if needed; participants executed the task with similar timing and without overlaps between their voice and subsequent auditory stimuli, before they moved onto the imagery conditions. For the AI condition, they were told to imagine speaking the syllables “in their mind” without moving any articulators or producing any sounds. They should feel the movement of specific articulators that would associate with actual pronunciation. For the HI condition, they were asked to recreate the female voice from the H condition in their minds, but minimize any feeling of movement in their articulators. If needed, the recorded female voice was presented again to form a better memory. We tried to selectively elicit the motor-induced auditory representation in imagined speaking, while we aimed to target auditory memory retrieval in imagined hearing. Participants were asked to generate a movement intention and kinesthetic feeling of articulation in the AI condition; in the HI condition, such motor-related imagery activity was strongly discouraged. After verbal confirmation of successful distinction between these types of imagery formation, they further practiced on the AI and HI tasks to reinforce the vividness of imagery and the timing requirement in the trials. Lastly, they trained on a practice block in which all four conditions were presented. Timing of the A condition was monitored by the experimenter and verbal confirmation of distinction between imagery tasks was obtained for each participant before proceeding onto the main experiment.
Scanning was performed with a 3T Siemens Allegra MRI system using a single-channel, whole-head coil. Functional data were acquired using a gradient-echo, echo-planar pulse sequence (TR = 2220 msec; TE = 30 msec; 38 slices oriented about 30° rotated counter-clockwise from AC-PC line, which was adjusted individually to maximize coverage; (3 × 3 × 3) mm3 voxel size, .6 mm interslice gap; 244 volume acquisitions in each of five runs). High resolution T1-weighted (MP-RAGE) images were collected from each participant for anatomical visualization.
Data were analyzed using SPM8 (http://www.fil.ion.ucl.ac.uk/spm/). The first two volumes of a run (dummy images) were discarded from all analyses. All functional volumes were spatially realigned after motion correction. Structural images were coregistered to the functional images and spatially normalized to a T1-ICBM152 template provided in SPM8. The resulting normalization parameters were applied to the functional images, followed by spatial smoothing with an 8-mm full-width, half maximum isotropic Gaussian kernel.
Voxel-wise statistical parametric maps of brain activation were generated by estimating the parameters of a general linear model (GLM). For each of the four conditions (A, AI, H, HI) in each participant, neural activity was modeled as boxcar events spanning the entire 4 sec trial period (from onset of visual cue to the offset of auditory stimuli), convolved with a canonical hemodynamic response function, and entered as regressors into a fixed effect GLM. The time series were high-pass filtered with a cut-off at 128 sec.
For each comparison of interest in each participant, a contrast of parameter estimates (β weights) was calculated in a voxel-wise manner to produce a contrast image. Two groups of contrasts were defined. The first group of contrasts was the main effects of AI, HI and A: (1) A > H; (2) AI > H; (3) HI > H. Because this study was designed to assess the neural networks that mediate AI, HI, and A, H was used as a baseline to account for neural responses to the auditory stimuli that cannot be temporally separated from the responses of interest (the tasks) in the experimental design. The second group of contrasts contained direct comparisons between imagery tasks: (4) AI > HI; (5) HI > AI. These direct contrasts revealed the possible differential involvement of neural pathways in different types of speech imagery tasks.
The parameter estimates from these first-level analyses were then entered into a random (between-subject) effect analysis, and linear contrasts were used to identify responsive regions. Thresholded t-maps were obtained for all contrasts, with a cluster threshold of 25 contiguous voxels whose test statistic exceeded an uncorrected p value of .001 (Lieberman & Cunningham, 2009). Because the effect size of mental imagery in each voxel is weak compared to overt hearing and production, we chose this approach to balance the type I and II errors. The AlphaSim Monte Carlo simulation in the original method paper (Lieberman & Cunningham, 2009) shows that using the statistical threshold of p < .005 and cluster size of 10 voxels can achieve a desirable balance between type I and II errors, while using a 20 voxel extent threshold produces an actual false discovery rate (FDR) of .05. For this study, an AlphaSim Monte Carlo simulation with our particular scanning and analysis parameters – a smoothing kernel of 8 mm and voxel resolution of 3 mm – combined with a more conservative criterion with the magnitude statistical threshold of .001 and cluster threshold of 25 voxels yielded an FDR of .022. To examine regions that showed significant common neural responses to AI and HI as well as to A and AI, conjunction analyses were performed with the contrasts of interest [AI > H]∩[HI > H] and [A > H]∩[AI > H] (Friston, Penny, & Glaser, 2005; Nichols, Brett, Andersson, Wager, & Poline, 2005).
For visualization purposes, thresholded maps were superimposed on an average, spatially normalized anatomical image obtained from the 18 participants. The locations of neural activity were first classified using the Automated Anatomical Labeling (AAL) map (Tzourio-Mazoyer et al., 2002), and then were further refined with: 1) neuroanatomical atlases (Duvernoy, 1991; Schmahmann et al., 1999); 2) probabilistic maps or profiles for primary auditory cortex (Penhune, Zatorre, MacDonald, & Evans, 1996), planum temporale (Westbury, Zatorre, & Evans, 1999), pars opercularis of IFG (Tomaiuolo et al., 1999), and mouth region of primary motor cortex (Fox et al., 2001); and 3) locations defined by previous reports or reviews on the medial frontal and cingulate areas (Picard & Strick, 1996, 2001) and subdivisions of the premotor cortex (Chen, Penhune, & Zatorre, 2008).
Speech production networks were observed during A, which included bilateral anterior cingulate cortex (ACC), pre-SMA/ SMA complex, sensorimotor cortex, middle frontal cortex (BA 46) and right posterior cingulate cortex. Cerebellum and subcortical regions, including thalamus and basal ganglia were also activated (see Table S1 for a complete activation list). Significant activations, as well as in all following analyses, surpassed a threshold of t > 3.65 (p < .001 uncorrected) with an extent of at least 25 voxels which is equivalent to FDR smaller than .05.
The neural networks that mediated AI were comprised of frontal and motor-related regions, including bilateral pre-SMA, inferior frontal pars opercularis (BA 44), frontal operculum, anterior insula, mid premotor cortex (BA 6), middle frontal cortex (BA 46); this activity extended to left primary motor cortex (near the mouth region). Parietal activation was observed in the left parietal operculum. Moreover, activity in bilateral cerebellum VI (declive) and globus pallidus was also observed (see Table S2 for a complete list of activation). Activity in auditory cortices was not observed in the contrast of AI-H (Fig. S2).
Similar neural networks were observed during the HI task, including bilateral pre-SMA/SMA, inferior frontal pars opercularis (BA 44), frontal operculum, mid premotor cortex (BA 6) in the frontal lobe, and left parietal operculum in the parietal lobe. Bilateral cerebellum VI (declive) was also engaged (see Table S3 for a complete list of activation). Activity in auditory cortices was not observed in the contrast of HI-H (Fig. S2).
The conjunction analysis between AI and HI revealed the shared neural networks for these imagery tasks (Fig. 2a): bilateral pre-SMA/SMA, inferior frontal pars opercularis (BA 44), and left anterior insula, mid premotor cortex (BA 6) extending to primary motor cortex (near mouth region), and bilateral cerebellum VI (declive). Moreover, activity from both tasks overlapped in left parietal operculum, a somatosensory-related area (see Table 1 for a complete list of activation peaks).
In the direct contrast between HI and AI, stronger activity was observed in HI compared with AI in left middle frontal (BA 8) and in left inferior parietal cortex and intraparietal sulcus (Fig. 2b, also see Table 2 for a complete list). Activity in auditory cortices was not observed in the contrast of HI-AI (Fig. S2).
The direct comparison between AI and HI revealed stronger activation during AI over frontal and parietal areas, including bilateral sensorimotor cortex, left subcentral gyrus (Rolandic operculum, encompassing vocalization areas of primary motor and somatosensory cortex) and middle frontal cortex (BA 46), as well as left parietal operculum (Fig. 2c, also see Table 3). The same direct comparison between AI and HI also revealed stronger activation during AI over temporal cortices, including left anterior superior temporal gyrus (aSTG) and right posterior superior temporal sulcus (pSTS) (Fig. 2c, also see Table 3).
The conjunction analysis between A and AI revealed the shared networks between overt and covert speech production tasks (Fig. 3). These overlapping areas included bilateral ACC, inferior frontal pars opercularis (BA 44), pre-SMA/SMA, left mid-dorsal insula, right frontal operculum and left parietal operculum, as well as bilateral cerebellum VI (declive), globus pallidus and left putamen (Table 4).
We investigated the neural networks that mediate perceptual reactivation using fMRI with speech imagery paradigms. Whereas the neural networks that mediate AI and HI largely overlapped in frontal-parietal motor-sensory areas, different subsets of frontal and parietal regions were involved in each task. This differential involvement of neural networks suggests two possible mechanisms for reactivating perceptual neural representation.
Frontal-parietal neural networks were observed during both AI and HI. The frontal overlapped areas included bilateral pre-SMA/SMA, inferior frontal pars opercularis (BA 44), and left anterior insula, mid premotor cortex (BA 6) (see Fig. 2a). Interestingly, most of the observed overlapped networks between AI and HI (BA 44, pre-SMA/SMA, insula) were also found in the conjunction analysis between AI and actual articulation (A) (see Fig. 3). These frontal/insular regions (SMA, IFG, premotor and insula) have been implicated in motor preparation during overt speech production (Bohland & Guenther, 2006; Palmer et al., 2001; Shuster & Lemieux, 2005). Therefore, perceptual reactivation processes during AI and HI may recruit these regions to simulate motor preparation without motor execution (Palmer et al., 2001; Tian & Poeppel, 2010, 2012); motor simulation may then induce activity in sensory cortices. In fact, aside from the observed shared frontal activity between AI and HI, we also observed overlap activation in PO (see Fig. 2a), an area that relates to somatosensory perception (e.g., Blakemore et al., 1998). These results are consistent with the internal forward model theory, which hypothesizes that a copy of the planned motor commands – known as the efference copy (Von Holst & Mittelstaedt, 1950/1973; for a review see Wolpert & Ghahramani, 2000) – is sent to somatosensory areas and is used to estimate the somatosensory consequences of an action (Blakemore & Decety, 2001). Therefore, the observed activation in frontal-parietal sensorimotor regions during both AI and HI suggests that the motor-sensory interaction via the simulation-estimation process is a potential top-down mechanism to reactivate sensory cortices without external stimuli or output.
Multiple functions such as auditory working memory have been associated with the SMG (e.g., Paulesu, Frith, & Frackowiak, 1993). In this study, the PO, an area close to SMG was observed in both articulation and AI conditions. In the articulation condition, participants were required to say only one syllable after visual cue, so the working memory demand was minimal. As such, the observed parietal opercular activity may not have been elicited by working memory, but rather perception of somatosensory feedback. Together with the conjunction results, the observed parietal opercular activation in AI may be involved in the estimation of somatosensory consequences in a process similar to that seen during somatosensory perception.
The direct contrast between AI and HI reveals that frontal-parietal sensorimotor regions were activated stronger during AI (see Fig. 2c). The observed greater activation in frontal and motor areas during AI, including bilateral sensorimotor cortex, left subcentral gyrus (Rolandic operculum) and middle frontal cortex (BA 46) is similar to activation patterns indicative of articulation preparation (Brendel et al., 2010; Price, 2012; Riecker et al., 2005), which suggests that AI relies more on internally simulating articulatory preparation. Additionally, greater parietal opercular activity during AI may represent stronger somatosensory estimation during motor simulation.
On the other hand, the reverse comparison between HI and AI revealed that activity increased in left middle frontal, inferior parietal cortex and intraparietal sulcus during HI, which may form a subset of proposed distributed memory systems. For example, parietal cortex has been hypothesized as a buffer site for episodic memory (Vilberg & Rugg, 2008; Wagner et al., 2005). Lexical and semantic information may be stored in semantic networks, including frontal (e.g., dorsomedial prefrontal cortex, IFG, ventromedial prefrontal cortex) and parietal (e.g., posterior inferior parietal lobe) regions (Binder et al., 2009; Price, 2012). The greater activity seen in middle frontal, inferior parietal cortex and intraparietal sulcus during HI may reflect memory retrieval during perceptual reactivation. That is, HI may also rely on two complementary processes: a memory retrieval operation and motor simulation. This combined contribution from motor and memory systems in reactivated perceptual representation may be due to the nature of the HI task: both speech perception and production are related to this particular process of perceptual reactivation, and hence need memory retrieval of information related to speech perception and motor processes to simulate speech production. Therefore, this dissociation of neural pathways between AI and HI tasks implies that (1) two functional pathways exist for perceptual reactivation: one underlies motor-to-perceptual transformation and another mediates memory retrieval; and (2) these two pathways are differentially recruited during perceptual reactivation for different imagery tasks.
Stronger activity in bilateral temporal auditory regions and frontal-parietal sensorimotor systems was observed during AI, compared to activity recruited for both sensorimotor activation and memory retrieval during HI. This supports the hypothesis that detailed auditory representations can be reactivated by the one-to-one ‘deterministic’ mapping between motor and perceptual systems (Tian & Poeppel, 2012, 2013). This mapping structure provides motor-to-sensory transformation dynamics that enrich the details of the representation, leading perhaps from phonemic to phonetic levels of detail (Hickok, 2012), which may not be available during memory retrieval. This result is also consistent with the behavioral observation that motor engagement enriches phonetic details, which can then influence speech-error rates at lexical-phonological and phonemic-articulatory levels during a covert tongue twister task (Oppenheim & Dell, 2010).
STS recruitment is commonly observed in speech and song production studies that manipulate auditory feedback (e.g., Niziolek & Guenther, 2013; Tourville, Reilly, & Guenther, 2008; Zarate, Wood, & Zatorre, 2010; Zarate & Zatorre, 2008; Zheng, Munhall, & Johnsrude, 2010). The observation of increased STS activity during AI in our study suggests similar computations between AI and self-monitoring during speech production. Whereas the auditory feedback manipulation during speech production actually creates the discrepancy between expectation and auditory input, the lack of auditory feedback during AI can also be considered as a similar violation of a sensory expectation generated after motor preparation and simulation. To support this, the location of our STS activation in the AI task [54, −26, 2] resembles the locations of STS activity reported in feedback perturbation studies: [52.8, −32.1, 4.4] in (Niziolek & Guenther, 2013), [58, −28, 6] in (Tourville et al., 2008), and [54, −18, −6] in (Zheng et al., 2010). Therefore, we suggest that similar mechanisms for generating auditory predictions and subsequent comparisons with incoming auditory feedback may be carried out in STS during both speech imagery and speech monitoring.
Price (2012) implicates the aSTG in the early auditory processing of complex sounds. This anterior region of temporal gyrus has been found to be sensitive to rapid frequency transition (Belin et al., 1998) and spectral variation (Zatorre & Belin, 2001). Rapid frequency modulations are a key feature in words that might need to be internally reconstructed and parsed to distinguish between particular phonemes or syllables in speech. Thus, the observation of increased aSTG activity during AI might suggest that auditory representations of spectral transitions, similar to those seen during speech perception, can be internally induced without any external stimulation.
Our observed increase in activity within associative auditory cortices aSTG and pSTS (but not within primary auditory cortex) during speech imagery is consistent with findings in earlier auditory imagery studies (e.g., Bunzeck, Wuestenberg, Lutz, Heinze, & Jancke, 2005; Halpern & Zatorre, 1999; Herholz, Halpern, & Zatorre, 2012; Shergill et al., 2001; Zatorre et al., 1996). It should be noted, however, that some auditory imagery work has reported primary auditory cortex activation (e.g., Kraemer et al., 2005; see Zatorre & Halpern, 2005 for a review), but we speculate that imagining different levels of content complexity may require multiple levels of auditory processing, which could result in the recruitment of different stages along the auditory perceptual hierarchy. In the current study, we use spoken syllables as stimuli. Given the complex nature of these stimuli, the internal reconstruction of syllabic representation may occur beyond the computations and representations that are mediated by primary auditory cortex. Our MEG studies support this view – the response latencies were modulated by the content of stimuli, for example at 200 ms for syllables (Tian & Poeppel, 2013) and 100 msec for pitch (Tian & Poeppel, 2015), suggesting that simpler stimuli may only recruit lower (and therefore faster) level areas for auditory processing, whereas more complex stimuli recruit higher-order auditory regions within the auditory perceptual hierarchy and thus require additional processing time. Additional studies will need to be conducted to determine whether an auditory processing hierarchy can be accessed by increasingly complex imagined stimuli, as has been reported in the visual domain (Kosslyn & Thompson, 2003).
In summary, this study complements and extends beyond our earlier MEG study (Tian & Poeppel, 2013) by offering neuroanatomical evidence that supports the existence of two complementary neural pathways for perceptual reactivation. Two speech imagery tasks differentially recruit a motor-to-sensory transformation pathway and a memory-retrieval pathway. Moreover, stronger auditory responses in AI suggest that motor system involvement leads to stronger perceptual reactivation.
We thank Keith Sanzenbach for his technical assistance with fMRI recording, Tobias Overath and Thomas Schofield for their discussion and guidance with fMRI analyses, and Jess Rowland for her comments on this manuscript. This study was supported by MURI ARO #54228-LS-MUR, NIH 2R01DC 05660, a grant from the GRAMMY Foundation®, Major Projects Program of the Shanghai Municipal Science and Technology Commission (STCSM) 15JC1400104 and National Natural Science Foundation of China 31500914.