Twenty-two subjects consented to participate in the study. All subjects reported themselves to be in good general health, right-handed, with no history of neurological disease or other contraindications for MR imaging, and to have learned English as their first language. They were recruited from the University of California, Irvine (UCI) community and remunerated for their participation, in accordance with the human subjects procedures approved by the Institutional Review Board of UCI. The data from two subjects were excluded from all analyses because they had fewer than twelve trials in the ‘miss’ experimental condition (Ns of 7 and 8 respectively). Data are reported from the remaining 20 subjects (nine females) ranging in age from 19yrs to 29yrs (mean = 20yrs).
Three hundred and six stimulus triplets were used in the experiment. Each triplet consisted of a picture of a commonly encountered object and both a visually and an auditorily presented word naming the object represented in the picture. The colored pictures were drawn from Hemera Photo Objects 50,000 Volume III (http://www.hemera.com/index.html
). The names of the pictures were between three and 10 letters long, with a mean written frequency between 1 and 100 counts per million (Kucera & Francis, 1967
). Visually presented words were displayed in black uppercase 30 point Helvetica font on a gray background. Auditory words were recorded by a male voice in the laboratory, edited to a constant sound pressure level and filtered to remove ambient noise (http://audacity.sourcefourge.net
). Auditory stimuli were presented binaurally via MR compatible headphones and did not exceed 1000ms in duration (mean duration 650ms). Presentation volume was adjusted in the scanner to a comfortable listening level for each volunteer prior to scanning.
Of the 306 item triplets, ten served as buffers (two at the beginning and end of each study list, and two at the beginning of the test list), and 16 additional triplets were used in the practice phases preceding the study and test sessions (see below). For every subject, 160 pictures and their corresponding names were randomly selected from the remaining 280 stimulus triplets to serve as the critical stimuli in the study session. These stimuli were further randomly divided into two groups; 80 pictures associated with their corresponding visual name, and 80 pictures associated with their corresponding auditory name (congruent trials).
Twenty additional pictures were randomly selected for the study phase. These pictures were not paired with their names; instead, names from unused pictures in the pool were selected, such that for these 20 pictures the subsequently presented name did not correspond to the object (incongruent trials). Out of these 20 pictures, 10 of the mismatching names were presented visually, while the other 10 were presented auditorily.
Two study lists were created from the 180 study pictures (160 pictures with corresponding names, 20 pictures with mismatching names) for each subject. Each list contained a pseudo-random ordering of 40 pictures associated with matching visual words, 40 pictures associated with matching auditory words, five pictures associated with mismatching visual words, and five pictures associated with mismatching auditory words. The study task required a decision as to whether or not the words corresponded to the pictured objects.
Test items consisted of the 160 pictures from the congruent study trials and 80 new pictures. The test requirement was to judge whether the item had been presented at study and, if so, to indicate whether the associated word had been presented visually or auditorily. Both study and test items were presented in a subject-unique pseudo-random order, such that there were no more than three consecutive presentations of items belonging to any one experimental condition.
Study items were back-projected onto a screen and viewed via a mirror mounted on the scanner headcoil. Pictures were presented in central vision within a continuously displayed solid gray frame, and subtended maximum visual angles of 5.7 ° × 5.7°. Words were centered 1° below the pictures and subtended a maximum horizontal visual angle of 5.7° and a maximum vertical visual angle of 1.15°. Pictures and words displayed together subtended a maximum visual angle of 5.7° × 8° (width by height).
Test items were presented outside of the scanner on a computer screen. The pictures and associated cues were presented in central vision within a solid gray frame (subtending 6.8° × 6.8 ° visual angle at the 1 m viewing distance) that was continuously presented.
Experimental Tasks and Procedures
The experiment comprised a single study-test cycle.
Instructions and practice were administered outside the scanner. The study phase of the experiment proper consisted of the presentation of two blocks of items, separated by a brief rest period (approx. 1 minute). Each study trial began with the presentation of a red fixation character in the center of the display frame for 500 ms. This was replaced by a picture that was presented for 1500ms. Five hundred ms after the onset of the picture, a second item was presented concurrently. This was either a visual word (visual condition) or an auditory word (auditory condition). Following picture offset, a centrally presented black fixation character was displayed for a further 1500ms, completing the trial. Subjects were informed they would receive no warning as to the modality of the word on each trial. Congruency judgments were signaled by button press of left or right index fingers, and response mapping was counterbalanced across subjects. Instructions placed equal emphasis on speed and accuracy.
The stimulus onset asynchrony (SOA) of study trials was stochastically distributed with a minimum of 3500 ms modulated by the addition of 40 randomly intermixed null trials (Josephs and Henson, 1999
). Trials were presented in pseudo-random order, with no more than three trials of one item-type (visual condition, auditory condition, or null) occurring consecutively. Each block consisted of 94 trials, comprising 80 critical study items (congruent trials), 10 noncritical items (incongruent trials), and four buffer items for a total of 160 critical study items across two blocks.
Following the completion of the second study block, volunteers were removed from the scanner and taken to a neighboring testing room. Only then were they informed of the source memory test and given instructions and a short practice test. Approximately 30 minutes elapsed between the completion of the second study block and the beginning of the memory test. Each test trial began with a red fixation presented in the center of a gray frame for 500 ms, followed by the presentation for 500ms of a centrally presented picture within a solid gray frame.
The test items consisted of the 160 critical study items (i.e. from the congruent trials) and 80 randomly interspersed unstudied (new) pictures (no more than three items of one type were presented consecutively). Instructions were to judge whether each word was old or new, and to indicate the decision with the right index (old) or left index (new) finger. If uncertain whether an item was old or new, volunteers were instructed to indicate ‘new’ so as to maximize the likelihood that subsequent source memory judgments (see below) would be confined to confidently recognized items. If a picture was judged ‘new,’ the test advanced to the next trial (with a 1 s inter-trial interval, during which a black fixation character was presented). If the picture was judged ‘old’ the prompt “Heard, Seen, Unsure?” appeared in black uppercase letters. Subjects were required to recall the modality of the name associated with the picture at study. The modality judgment was signaled by one of two button presses: right index finger for ‘heard’, right middle finger for ‘seen’. Subjects were further instructed that if they were unable to retrieve the modality they should respond ‘unsure’ with their right ring finger. The test was self-paced, with instructions to complete the test as quickly as possible without sacrificing speed for accuracy. The test was presented as a single block, lasting approximately 20 minutes.
fMRI Data Acquisition
A Philips Achieva 3T MR scanner (Philips Medical Systesm, Andover, MA, USA) was used to acquire both T1–weighted anatomical volume images data (240 × 240 matrix, 1mm3 voxels, 160 slices, sagittal acquisition, 3D MP-RAGE sequence) and T2*–weighted echoplanar images (EPI) [80 × 79 matrix, 3 × 3 mm in-plane resolution, axial acquisition, flip angle 70°, echo time (TE) 30ms] with blood-oxygenation level dependent (BOLD) contrast. The data were acquired using a sensitivity encoding (SENSE) reduction factor of 2 on an eight-channel parallel imaging headcoil. Each EPI volume comprised 30 3mm-thick axial slices separated by 1mm, oriented parallel to the AC-PC plane, and positioned to give full coverage of the cerebrum and most of the cerebellum. Data were acquired in two sessions of 260 volumes each, with a repetition time (TR) of 2s/volume. Volumes within sessions were acquired continuously in an ascending sequential order. The first five volumes of each session were discarded to allow equilibration of tissue magnetization.
fMRI Data Analysis
Data were analyzed with Statistical Parametric Mapping (SPM5, Wellcome Department of Cognitive Neurology, London, UK; Friston et al., 1995
) implemented under Matlab2006a (The Mathworks Inc., USA). Functional images were subjected to a two-pass spatial realignment. Images were realigned to the first image, generating a mean image of the sessions. In the second pass the raw images were realigned to the generated mean image. The images were then subjected to reorientation, spatial normalization to a standard EPI template (based on the Montreal Neurological Institute (MNI) reference brain; Cocosco et al., 1997
) and smoothing with an 8mm FWHM Gaussian kernel. Functional time series were concatenated across sessions.
Statistical analyses were performed on the study phase data in two stages of a mixed effects model. In the first stage, neural activity elicited by the study pictures was modeled by delta functions (impulse event) that coincided with the onset of each picture. The ensuing BOLD response was modeled by convolving the neural functions with a canonical hemodynamic response function (HRF) and its temporal and dispersion derivatives (Friston et al., 1998
) to yield regressors in a General Linear Model (GLM) that modeled the BOLD response to each event-type.
For the reasons discussed in the results section, the principal analyses were confined to four events of interest: studied pictures that were later recognized and correctly endorsed as having been paired with a visual or auditory word (visual source hits and auditory source hits, respectively), and studied pictures that, on the later memory tests, were associated either with inaccurate source judgments (incorrect or unsure) or which were misclassified as new. A fifth category of trials comprised events of no interest, namely, incongruent trials, buffer trials, and trials associated with incorrect or omitted study responses. Six regressors modeling concatenated movement-related variance (three rigid-body translations and three rotations determined from the realignment stage) and session-specific constant terms modeling the mean over scans in each session were also entered into the design matrix.
For each voxel, the functional timeseries was highpass-filtered to 1/128 Hz and scaled within-session to yield a grand mean of 100 across voxels and scans. Parameter estimates for events of interest were estimated using a General Linear Model. Nonsphericity of the error covariance was accommodated by an AR(1) model, in which the temporal autocorrelation was estimated by pooling over suprathreshold voxels (Friston et al., 2002
). The parameters for each covariate and the hyperparameters governing the error covariance were estimated using Restricted Maximum Likelihood (ReML). Effects of interest were tested using linear contrasts of the parameter estimates. These contrasts were carried forward to a second stage in which subjects were treated as a random effect. Unless otherwise specified, only effects surviving an uncorrected threshold of p < .001 and including nine or more contiguous voxels were interpreted. The peak voxels of clusters exhibiting reliable effects are reported in MNI co-ordinates.
Regions of overlap between the outcomes of two contrasts were identified by inclusive masking of the relevant SPMs. When the two contrasts were independent, the statistical significance of the resulting SPM was computed using Fisher's method for estimating the conjoint significance of independent tests (Fisher, 1950
; Lazar et al., 2002
). In all cases, the SPM to be masked was thresholded at p
< 0.01 (again with a 9 voxel extent threshold) and the SPM that constituted the mask was thresholded at p
< 0.001, giving a conjoint significance level of p<10-4
. Exclusive masking was used to identify voxels where effects were not shared between two contrasts. Contrasts to be masked were thresholded at p
< 0.001 and the SPM constituting the exclusive mask was thresholded at p
< 0.05 (p<.1 for bi-directional F contrasts). Note that the more liberal the threshold of an exclusive mask, the more conservative is the masking procedure.