|Home | About | Journals | Submit | Contact Us | Français|
Using Blood Oxygen Level Dependent (BOLD) functional MRI (fMRI) to detect deception is feasible in simple laboratory paradigms. A mock sabotage scenario was used to test whether this technology would also be effective in a scenario closer to a real-world situation. Healthy, nonmedicated adults were recruited from the community, screened, and randomized to either a Mock-crime group or a No-crime group. The Mock-crime group damaged and stole compact discs (CDs), which contained incriminating video footage, while the No-crime group did not perform a task. The Mock-crime group also picked up an envelope from a researcher, while the No-crime group did not perform this task. Both groups were instructed to report that they picked up an envelope, but did not sabotage any video evidence. Participants later went to the imaging center and were scanned while being asked questions regarding the mock crime. Participants also performed a simple laboratory based fMRI deception testing (Ring-Watch testing). The Ring-Watch testing consisted of “stealing” either a watch or a ring. The participants were instructed to report that they stole neither object. We correctly identified deception during the Ring-Watch testing in 25 of 36 participants (Validated Group). In this Validated Group for whom a determination was made, computer-based scoring correctly identified nine of nine Mock-crime participants (100% sensitivity) and five of 15 No-crime participants (33% specificity). BOLD fMRI presently can be used to detect deception concerning past events with high sensitivity, but low specificity.
Intense interest exists in the scientific community and lay press concerning the possibility of using Blood Oxygen Level Dependent (BOLD) functional magnetic resonance imaging (fMRI) to measure brain activation during deception. A number of studies using fMRI to investigate the neural correlates of deception have been published (1–14). The design and analysis methods across these studies vary considerably, making it difficult to integrate the results. At the group analysis level, however, these studies have consistently found significant brain activation in deception versus telling the truth. There has been variability in the specific brain regions activated during deception in these studies. One explanation for this array of findings is the diversity in tasks and questioning paradigms. To date, successful individual analysis has only been achieved in two studies (2,5). The University of Pennsylvania group (Langleben et al.) also reported on using a different analytic approach to the same imaging data to improve their accuracy (12).
Although these studies reported reasonably high individual accuracy rates, there are several concerns which must be addressed prior to moving this technology to real-world application. One concern is that relatively simple deception paradigms (theft of a watch or ring, deception about which playing card one is holding) were used with perceived financial compensation for successful deception. The impact of testing for deception on more elaborate scenarios that are closer to a real-world situation is unknown. Importantly, the robustness of our methodology using a priori defined regions of interest (Kozel et al., 2005) is untested for detecting deception when performing different tasks and providing different types of lies. Another concern is the time between the act in question and testing (3). The ring/watch and playing card studies used a laboratory-based scenario performed immediately before the MR scanning to test for deception. To determine how robust our prior published method (5) was in detecting deception in a more real-life setup with extended time from the event, we worked with the Defense Academy for Credibility Assessment (DACA) to develop a mock sabotage (i.e., the destruction of property or obstruction of normal operations) crime paradigm. This paradigm was closer to a real-world situation in both the nature of the criminal act and the time between the act and testing. We hypothesized that we would be able to determine which participants committed the crime versus those who did not commit the crime using our automated computerized analysis methods on the fMRI data to detect deception. In addition, we hypothesized that using a laboratory screening deception test would improve our predictive power for the mock crime. We reasoned that we would have a higher predictive power in individuals for whom we were able to properly identify lies versus truth on an embedded laboratory study of deception (i.e., Ring-Watch task). That is, using the lab-based identification as a within subject assay might improve our ability to detect deception in the more elaborate Mock-crime task if we only made Mock-crime calls in the subjects that we correctly identified in the simple laboratory study.
Participants were healthy, literate adults between the ages of 18–50 years who were recruited by written advertisements from within the Medical University of South Carolina (MUSC) and the Charleston, SC community. (See Fig. 1 for outline of study flow.) Informed consent approved by the MUSC Office of Research Integrity was obtained in writing. Participants were screened with a Structured Clinical Interview for DSM-IV Axis I Disorders (SCID-I) (15), a pre-MRI screening form, a medical history, and were given a brief physical exam. Exclusion criteria included taking any medications within five half-lives of imaging, presently using illegal drugs, history of a psychiatric disorder except simple phobia, history of a significant central nervous system disease (e.g., stroke, seizures, severe head injury, etc.), currently unstable medical condition, pregnancy or lactating, caffeinism (i.e., headaches or other withdrawal symptoms with cessation of caffeine for 3 days), nicotine use, claustrophobia, previous inability to tolerate an MRI, any metal implants making imaging unsafe, or prior knowledge of the paradigm. All participants were also evaluated with an Annette Handedness Scale (16), the State-Trait Anxiety Inventory (STAI) (17), and the Temperament and Character Inventory (18). Prior to MRI scanning a urine sample was obtained to test for drugs of abuse as well as pregnancy testing for women of childbearing potential. Participants received $20 for completing the screening phase. For those participants that met the screening criteria, they were randomly assigned to either the Mock-crime group or No-crime group. Participants were then scheduled for a second visit in which they were given instructions and performed the assigned tasks.
Subject randomization was performed at DACA by Dean Pollina (DP) using an urn randomization controlling for sex and age (<35 or ≥35). The group assignment was placed in sequential envelopes, sealed, and sent to Emily Grenesko (ELG) at MUSC. The entire study team, except ELG, was unaware of the randomization scheme used or the percentage committing the mock crime (i.e., base rate). Once participants were screened and enrolled, ELG would open the next envelope in the appropriate category (e.g., female and >35 years) to determine group assignment.
The mock-crime scenario used in the present study was adapted from procedures that were used to develop computerized evaluations of polygraph data (19). Based on internal studies performed at DACA and the constraints of performing an fMRI experiment, a number of modifications to the paradigm were made. Very similar paradigms have been used by DACA to test other modalities designed to detect deception (personal communication A. Ryan).
Participants received instructions from ELG regarding how to carry out the mock crime. They were told to first pick up an envelope marked “Confidential” from a research scientist in room 370 in the Main Hospital (different building from both the location of screening and where the mock crime was performed). They were instructed not to open it and bring it back to ELG before their MRI scan. The subject was then told that their next task would be to destroy evidence of a crime. The explanation was that a surveillance camera captured someone else committing a crime and that person wanted the evidence destroyed to avoid prosecution. In order to do this, they needed to go to room 112 in the Forensic Psychiatry Building where the evidence was being kept for processing. They were instructed to spend as little time in the room as possible—30 min at the most. They were told to use a specific entrance to the building across from the law enforcement parking spaces. Once in the room, they were to find a compact disk (CD) in a specific folder, view the CD to ensure that they saw a robbery of a convenience store taking place, and then destroy the CD by any means possible before leaving the room. A shredder in the room was suggested for use, but it purposely did not work. Thus, they were required to deal with an unexpected situation and determine how to break the CD using another method. The purpose was to add some unpredictability to the scenario and increase the sense of jeopardy and stress. Participants were to retain a small piece of the CD and bring it to ELG just before their MRI scan. The subjects were then told by ELG that she needed them to do something in addition “just for her” while they were in that room. She said that she had taken a laptop from one of her professors some time ago but had recently noticed security cameras in his office. She told them a CD containing this incident was in another separate folder in the room. They were instructed to view the CD to ensure that they saw ELG committing the crime and to bring it back to her intact before their scan. The purpose of this additional task was to add more of a sense that the subject was actually committing a crime. The room containing the CDs had hidden video surveillance monitoring equipment and a webcam to confirm and document that the participants accurately performed the task. The room also had a telephone that ELG called while the participants were performing the task in the room. The phone call was intended to add to the sense of possibly getting caught in the act. Participants were informed that they would be contacted by ELG to set up Visit 3 if the task was performed properly. Participants received $50 compensation for completing the task portion of the study. If the task was not performed properly, the subject was compensated for Visits 1 and 2 and excluded from the study. Throughout the remainder of the study, the participants were to report that they went to pick up the envelope but did nothing else (i.e., they were to lie about the Mock-crime task). Any deviation from that would result in the subject being excluded from the study and compensated for Visit 1 and 2.
ELG ensured that the participants understood that someone destroyed evidence on a CD of a crime being committed. They were told they would need an alibi as to where they were during that time. For their alibi, they were instructed to say that they were in room 370 in the Main Hospital picking up an envelope marked “Confidential” from a research scientist. The participants were informed that they would not actually do this task. Throughout the remainder of the study, they were to answer as if they picked up the envelope but did not damage the CD or visit the crime room. The purpose of having the participants lie about picking up the envelope was to have questions during the fMRI evaluation in which the participants in this group were lying. Participants were allowed to leave after the instructions and were subsequently contacted for their MRI scan (i.e., Visit 3). They were instructed that they would receive $50 for performing Visit 2.
Participants met ELG prior to going to the scanner. If the participant had been assigned to the Mock-crime group, they were expected to produce the envelope, a piece of the CD which contained video of the convenience store robbery, and the intact CD which contained video of ELG stealing the laptop. ELG confirmed that all participants were abiding by the story of picking up the envelope but not damaging the CD. The questions asked in the scanner were reviewed to confirm that the subject was answering per protocol.
The subject was escorted to the scanner by ELG who then left for the remainder of the scanning. They were given the STAI to fill out and were trained by Kevin Johnson (KAJ) on answering the questions using the IFIS system (Intermagnetics General Corp., Latham, NY) outside of the scanner. Subsequently, the participants were made safe to enter the scanning room and were positioned in the MRI scanner with the IFIS system signaling gloves. In addition, physiologic measuring devices were attached to collect data on pulse, breathing rate, and electrodermal response (EDA, galvanic skin response) every 0.01 sec during their examination (data not included). The participants were given foam ear protection and had their head secured using foam pads. All scanning was performed using a Philips 3.0 Tesla Intera MRI scanner equipped with an actively shielded magnet. The BOLD fMRI Echo Planar Imaging (TR 1867 ms, TE 30 ms, Flip Angle 90 deg, FOV 208 mm, matrix 64 × 64, SENSE factor 2, 36 slices, 3 mm with 0 mm gap, giving a voxel size of 3.25 × 3.25 × 3.00 mm3) covered the entire brain and was positioned with reference to the Anterior Commissure-Posterior Commissure line using a sagittal scout image. A quick motor task (3 min—97 volumes) was performed to orient participants to the fMRI environment and to give them practice responding with the IFIS glove (yes—right thumb, no—right index finger). For the Mock-crime testing, BOLD fMRI scans were acquired (12 min—386 volumes) while questions were visually displayed. Using an event-related design, three types of questions were asked regarding whether the subject picked up the envelope, whether the subject committed the sabotage, or general neutral questions. Questions were presented visually for 3.5 sec. After a question, there was a visual prompt for the subject to answer “Yes or No” for 2 sec followed by a “+” for 0.5 sec. Participants were instructed to not answer until they saw the visual prompt of “Yes or No.” The delayed response was to reduce the variability of response timing due to differences in reading speeds across questions and across participants. Because of this forced delay to respond, the reaction times cannot be interpreted as indicating the time it required to answer the question. Thus, each question with answer epoch took 6 sec. There were 20 questions for each category (60 unique questions total): “disk” regarding the Mock-crime; “envelope” regarding picking up the envelope; and “neutral” regarding easily answered and verifiable personal information. The order of the type of question (i.e., disk, envelope, and neutral) was pseudorandomized using a web-based randomization generator (http://www.randomization.com) but consistent across participants. The IFIS system pseudorandomly chose the actual question for the appropriate category. Thus, the order of the type of questions was consistent across participants, but the order of the actual questions asked was varied. The 60 questions were presented once for practice outside of the scanner and in two separate sets (total 120 questions—each question with answer epoch lasting 6 sec) that were administered consecutively for the scanning. Participants were then taken out of the scanner to participate in the task of taking the ring or watch.
Upon completion of the Mock-crime paradigm questions, participants were given a brief break before performing the Ring-Watch paradigm (similar to paradigm in prior study ). For the Ring-Watch paradigm, the subject was instructed to go into a room where there were two objects hidden in a drawer. The instructions were to steal one of the objects and leave the other behind. A research assistant confirmed that only one object was taken. The participant was instructed to respond to questions in the scanner as if he/she had not taken either object. Once in the scanner, BOLD fMRI scans (same scanning parameters as in the Mock-crime task, 386 volumes) were acquired while questions were visually displayed in the same manner as the Mock-crime testing. The number of questions presented during the Ring-Watch testing was equal to the number of questions presented for the Mock-crime testing. The difference was that the questions consisted of different neutral questions, and whether the subject took the ring or took the watch. All questions were pseudorandomized and presented twice. The IFIS gloves were used to respond “yes” or “no” to the questions. Unlike in the Kozel et al. 2005 study, the current study participants were not encouraged or given monetary incentives to “beat” the test and control questions were eliminated. A structural scan was subsequently acquired and reviewed for any gross pathology. When the task was completed, participants were removed from the scanner and debriefed. Scanning time in total was c. 40 min, but the entire task took about 1.5–2.0 h for completion. KAJ documented which object the subject took. KAJ and ELG were the only investigators in the study who knew which object was taken.
After participants were removed from the MRI, they were greeted by ELG who asked if they had any questions about the protocol or what just occurred. They were asked if they experienced any difficulties during the study or scanning. Participants were given a postscanning questionnaire and paid $50 for completing the scanning. One of the questions in the postscanning questionnaire specifically asked the participants whether they felt that the sabotage mock-crime scenario was believable. Once again, only ELG knew which task the subject actually performed; and only KAJ and ELG knew whether the watch or ring was taken for each subject.
All initial imaging data analysis was performed blind to subject group. After initial data analysis, datasets were locked, and the blind broken. There was no possibility of any data analysis by anyone with knowledge of the group assignment. For the Ring-Watch task, KAJ revealed the blind to SJL, MSG, and FAK. Subsequently, the assignment of Mock-crime versus No-Crime task was unblinded at a later date with all of the above parties with the addition of DP.
Testing of demographic and clinical variables was performed using t-test for continuous data and Chi-square for discrete data to determine if there were any significant differences between the two groups.
Participant responses were inspected for participation and irregularities with respect to the correct answer per the protocol. Response data from the IFIS was converted to Excel files using E-Prime 18.104.22.168 which were subsequently imported into Matlab (The Mathworks, Natick, MA). Using a Matlab script, response data from the IFIS files were converted into the onsets for the event-related SPM2 analysis. Responses that were not consistent for both times the question was asked, not answered, or not answered as specified in the protocol were identified and modeled as separate “nonprotocol” events. Although the reaction times do not indicate the time required answering different question types, differences in reaction times between the Mock-crime group and No-crime group could indicate differences in the group’s testing performance. Response times as measured by the IFIS were compared using a two-sided t-test for the two groups using NCSS (20).
Functional MRI analysis was carried out using Statistical Parametric Mapping software (SPM 2; Wellcome Department of Cognitive Neurology, London, UK) by Steven J. Laken (SJL) and F. Andrew Kozel (FAK) independently. The two analyses differed in the following ways: FAK performed SPM2 analysis using Red Hat Linux Enterprise Edition (Linux kernel 2.6) and Matlab version 22.214.171.124 (Release 14 with service pack 3) and SJL performed SPM2 analysis using Windows XP x64 and Matlab version 2006a. All analyses, Mock-crime and Ring-Watch were carried out with SJL and FAK blind to the actual tasks performed by the participants. Data analysis was carried out in the same manner as in Kozel et al. 2005 except statistical estimation was modeled to correct for temporal coherence using AR(1) (21) and the hemodynamic response function included a temporal derivative. We felt that the use of the AR(1) decreased the impact of temporal dependence, and the use of a temporal derivative enabled a broader hemodynamic response to be fitted to our model. Further all data from Kozel et al. 2005 was tested using these changes and did not change the results (data not shown). The first image was reoriented using the display function in SPM2 so that the 0,0,0 coordinate corresponded to the medial anterior commissure. Each scan was then adjusted to set these coordinates using the reorient function in SPM2. SJL and FAK set these points independently. Preprocessing was performed using an automated script as described in Kozel et al. 2005. The reoriented images were realigned and unwarped to correct head movement and resulting EPI distortions. Participants with movement of >3 mm were eliminated. Slice timing was performed to correct for differences in the point in time that each slice was acquired. The functional images were then spatially normalized to the SPM EPI template and re-sampled to 3 × 3 × 3 mm voxels (22). The data were spatially smoothed using a Gaussian kernel with 8 mm full width at half maximum based on the suggested standard of 2 to 3 times the output spatially normalized voxel size. This was performed to adjust for inter-subject variability as well as making the errors more normal in their distribution to help ensure the validity of inferences based on parametric tests (23).
Using a Matlab script, the statistical portion of the analysis was performed. A general linear model within SPM2 was specified and estimated for the Mock-crime and the Ring-Watch paradigms. Events and their temporal derivative were defined as occurring when the cue to answer “Yes” or “No” was presented to the participants (starting at 3.5 sec and occurring every 6 sec thereafter). Effects at each and every voxel were estimated using the general linear model at the first statistical level. The motion-recorded parameters generated during the “Realign” process were included as six user-specified regressors. The nonprotocol events were also included as conditions and modeled with the hemodynamic response function. A high pass filter (cut-off frequency = 128 s) was used to remove possible effects of low-frequency changes and AR(1) was used to decrease the impact of temporal dependence. Individual t-statistics activation maps were defined based on the contrasts of interest. For the Mock-crime analysis, the contrasts were disk minus neutral and envelope minus neutral. For the Ring-Watch analysis, the contrasts were ring minus neutral and watch minus neutral. Using a Matlab script, the number of significant (p < 0.05) voxels was determined in each of the three Regions of Interest (ROI) defined in Kozel et al. 2005 (clusters 1, 2, and 4 which roughly correspond to right anterior cingulate region, right orbitofrontal/inferior frontal region, and right middle frontal region). For the Mock-crime Analysis—the number of significantly activated voxels for the disk minus neutral contrast was subtracted from the envelope minus neutral contrast using ROIs 1, 2, and 4. If the resulting value was positive, then the call was made that the mock crime was committed (i.e., greater brain activation corresponding to lying about performing the mock crime). If the resulting value was zero, then it was called indeterminate. If the resulting value was negative, then the call was that the mock crime was not committed (i.e., greater brain activation corresponding to lying about the envelope task). Similarly for the Ring-Watch Analysis—the number of significantly activated voxels for the ring minus neutral contrast was subtracted from the watch minus neutral contrast using ROIs 1, 2, and 4. If the resulting value was positive, then the call was made that the ring was taken (i.e., greater brain activation corresponding to lying about the ring questions). If the resulting value was zero, then it was called indeterminate. If the resulting value was negative, then the call was that the watch was taken (i.e., greater brain activation corresponding to lying about the watch questions).
The functional imaging data was checked for gross artifacts that would preclude analysis and motion >3 mm in any plane. The behavioral data was checked to ensure that <10 responses per group (e.g., disk) were “not per protocol” (i.e., incorrectly answered or not answered). Also, participants that did not adhere strictly to the protocol were eliminated. Once FAK and SJL had completed their blinded analyses, the results were compared. If each reached a different conclusion, then that subject was analyzed again from the beginning by both investigators. If the conclusions drawn were still different, then that subject was eliminated. The participants that were not eliminated due to excessive motion, gross artifacts, inadequate number of correct responses, not performing the protocol correctly, or lack of agreement in analysis are referred to as the “Quality Group.”
The a priori defined primary analysis of correctly identifying deception in the Mock-crime group was based on the subset of the Quality Group for which the Ring-Watch analysis correctly identified the object taken. This group is referred to as the “Validated Group” since the technology was validated to work on a known condition for this individual during this scanning session (i.e., the lie and truth are known depending on whether the ring or watch was taken). Finally, all participants were included into secondary analyses using the Quality Group and then the entire group without restrictions to determine which, if any, of these factors influenced the ability to detect mock-crime deception.
Receiver operator characteristic curves were generated using NCSS (20) by using the number of voxels with significantly activated t-values greater than or equal to 1.645 from the contrasts ([Disk – Neutral] – [Envelope – Neutral]) for the clusters 1, 2, and 4 as described in Kozel et al. 2005. Participants were segregated into Mock-crime groups for sensitivity values and No-crime groups for specificity values. Sensitivity and specificity was calculated for each group by stepping through each of these values for the Complete Group, Quality Group, and Validated Group.
A number of group analyses were performed to increase our understanding of the relationship of this study to prior work. Using the individual Lie-minus-True and True-minus-Lie contrast images produced at the first statistical level analysis for the Mock-crime paradigm and Ring-Watch paradigm, group t-maps were generated at the second level using a random effects model (24). Only participants who met the criteria for the Quality Group were included in the analysis. Significance was defined as False Discovery Rate > 0.05 with a cluster size of ≥25 voxels which was the threshold level used in Kozel et al. 2005. For the Mock-crime paradigm, group analyses were performed for the entire Quality Group as a whole, those in the subset of the Mock-crime group, and those in the subset of the No-crime group. For the Ring-Watch paradigm, group analyses were performed on the entire Quality Group.
Seventy participants were enrolled of which 13 were eliminated at screening (see Fig. 2). This left 57 participants who were randomized to either the Mock-crime group (n = 27) or No-crime group (n = 30). After randomization, there were five participants in the Mock-crime group and four participants in the No-crime group who dropped out. The remaining 48 (n = 22 Mock-crime, n = 26 No-crime) participants that were scanned will be referred to as the “Complete Group.”
To address the question of how believable was the study paradigm for the participants, the postscanning questionnaire asked the subjects, “Did you feel this scenario was believable?” For the Complete Group scanned, 43 of 48 (90%) participants answered, “Yes.” Thirty-two of thirty-six (89%) in the Quality Group and 22 of 25 (88%) in the Validated Group answered, “Yes.” Thus, the participants generally felt that the scenario was believable.
The average time for the Mock-crime group from performing the mock crime (visit 2) to fMRI scanning was 105.0 h (90.6 SD, range 5.5–312.0) in the Validated Group. The average time for the No-crime group from getting instructions (visit 2) to fMRI scanning was 56.8 h (66.4 SD, range 0.5–216.0) in the Validated Group (see Table 1). The difference in time from performing the task (or given instructions of what occurred) and the scan was significantly different between the Complete Mock-crime and Complete No-crime groups (two sample t-test, two-sided, t = 2.12, p-value = 0.04) and between the Quality Mock-crime and Quality No-crime groups (two sample t-test, two-sided, t = 2.44, p-value = 0.02). There were, however, no significant time differences between the Mock-crime and No-crime Validation Groups (two sample t-test, two sided, t = 1.60, p-value = 0.12). Although there was a difference in time to scan for the two groups, there was no significant difference (two sample t-test, two-sided, t = 0.52, p-value = 0.61) in time to scan for those participants accurately and inaccurately called.
No significant differences were found in comparisons of ethnicity, object chosen to steal (i.e., ring or watch), gender, handedness, and years of education between the Mock-crime and No-crime groups for the Complete Group, the Quality Group and the Validation Group (see Table 1). There was, however, a difference between the two groups with respect to reaction times for all question types. The No-crime group responded significantly faster to all question types than the Mock-crime group (p < 0.001 for Complete, Quality, and Validated groups).
Of the 48 participants in the Complete Group, twelve (n = 8 Mock-crime group, n = 4 No-crime group) did not meet criteria to be in the Quality Group. The reasons for exclusion from the Quality Group were varied. Three participants in the Mock-crime group did not complete the protocol properly and one of these also had inadequate number of per protocol responses recorded. Three participants had motion >3 mm on the Ring-Watch testing with one of these also having >3 mm motion on the Mock-crime testing. Three participants had significant artifacts on their EPI scans for both the Mock-crime and Ring-Watch testing. In addition to the artifacts, one of these participants had an inadequate number of per protocol responses recorded. Another subject had an inadequate number of per protocol responses recorded. There were two participants in which SJL and FAK did not agree on their calls. Both participants had 0 or 1 voxels activated as determined by SJL or FAK which most likely represented spurious results. The Quality Group was comprised of 14 participants in the Mock-crime group and 22 participants in the No-crime group.
To determine the Validated Group, those participants in the Quality Group who were not correctly identified or indeterminate on the Ring-Watch testing were eliminated from the Validated Group (n = 5 for the Mock-crime group and n = 6 for the No-crime group). For the Validated Group, nine of the nine assigned to the Mock-crime were correctly identified, and five of the 16 who were in the No-crime group were correctly identified with one of the calls being an indeterminate. For those in the Validated Group for whom a call was made, this resulted in a sensitivity of 100% (9/9 correct classification, 95% confidence limit = 0.68–1.00) and a specificity of 33% (5/15 correct classification, 95% confidence limit = 0.15–0.58) for detecting the mock crime (see Table 2).
For the Quality Group, thirteen of the fourteen Mock-crime participants were correctly identified and eight of the 22 No-crime participants were correctly identified (1 was an indeterminate call). For those in the Quality Group for whom a call was made, this results in a sensitivity of 93% (13/14, 95% confidence limit = 0.69–0.99) and a specificity of 38% (8/21, 95% confidence limit = 0.20–0.59). Thus, restricting judgments to only those identified correctly in the lab study did not significantly (p > 0.4) improve the overall detection rates.
For the Complete Group, 20 of the 22 Mock-crime group participants were correctly classified. The No-crime group was correctly identified 10 out of 26 times with two of those calls being indeterminate. For those in the Complete Group for whom a call was made, the sensitivity was 91% (20/22, 95% confidence limit = 0.72–0.97) and the specificity was 42% (10/24, 95% confidence limit = 0.22–0.61).
For the Quality Group, the Ring-Watch paradigm correctly identified the deception regarding which item was taken 25 out of 36 times with one subject having an indeterminate call. This results in an accuracy of 71% (25/35, 95% confidence limit = 0.55–0.84). For the Complete Group, Ring-Watch paradigm deception was correctly identified 72% (31/43, 95% confidence limit = 0.57–0.83) of the time with 5 individuals having indeterminate results.
We used receiver operator characteristic (ROC) curves to systematically test the trade-off between sensitivity and specificity. For our a priori analysis, we used a threshold of the significant (t > 1.645) voxel number difference greater than zero as being the lie. The area under the curve (AUC) is calculated to determine the overall accuracy of the test and the AUC can be compared to a test that is no better than chance. For each group, the BOLD fMRI test discriminated better than chance (see Table 3 and Fig. 3). The area under the curve, or accuracies, for the Complete Group was 0.77 (one-sided z-test for AUC = 0.5, z-value = 3.75, p = 0.0001), for the Quality Group was 0.77 (one-sided z-test for AUC = 0.5, z-value = 3.3, p = 0.005), and for the Validated Group was 0.78 (one-sided z-test for AUC = 0.5, z-value = 2.85, p = 0.0022). A Q–Q plot demonstrated that the voxel difference number score values approximated a Gaussian distribution. In addition, a plot of the distribution of the voxel number difference scores demonstrated that the Mock-crime group was largely to the right of zero (i.e., a correct call would be greater than zero) while the No-crime group (i.e., a correct call would be less than zero) was more diffusely distributed (see Fig. 4).
The Lie-minus-True contrast analysis for the Quality Group failed to find significant (FDR < 0.05, k > 25) activation for the entire group. When the groups were split by whether they were in the Mock-crime or No-crime groups, however, both produced areas of significant activation (see Table 4, Fig. 5). The Mock-crime group had a lateral and prefrontal pattern of activation similar to the Model-building group in our prior study, while the No-crime group had a predominantly medial prefrontal activation pattern. A two sample t-test using SPM2 confirmed that there were significantly (FDR < 0.05, k > 25) different regions of activation between the Mock-crime and No-crime groups for the Lie-minus-True contrast.
The True-minus-Lie contrast group analysis of the Quality group for the entire group or split by whether the mock crime was committed failed to find significant activations.
The Lie-minus-True and True-minus-Lie contrasts failed to find significant (FDR < 0.05, k > 25) activation.
Using a unique scenario that more closely approximated a real world situation and performing testing in the MR scanner after a period of delay, our a priori defined methodology was found to have good sensitivity (100%) but low specificity (33%). While the overall results are less accurate than in our prior study, we again found that our BOLD fMRI method is significantly better than chance at detecting deception at the individual level. Also, the ROC curves revealed that if the cut-off values are changed (i.e., the value of the difference in the number of activated voxels defined to indicate deception), considerable improvements in specificity can be gained with only modest reductions in sensitivity. Interestingly, in contrast to our prestudy hypothesis, the use of the Ring-Watch Paradigm as an internal study screen to validate whether our technique would work in a particular individual did not drastically improve our diagnostic results in this study.
Our present methodology compares brain activation correlated to responses regarding two conditions (committed mock crime vs. picked up envelope or took the watch vs. took the ring) that are mutually exclusive. One condition (e.g., mock crime) is determined to be the lie because it caused more activation in the previously identified brain regions of interest compared to the other condition (e.g., envelope). This requires the two conditions to be mutually exclusive and the questions to be specific and equally counterbalanced. One explanation for our low specificity is that the cognitive work load for lying about the envelope task was not adequate to obtain a reliable activation. Because of the nature of the testing and concern for subject confusion, the envelope questions all required a “yes” response whereas half of the neutral and disk questions were answered with “no” responses. This could have changed the cognitive demand resulting in a lack of activation in our regions of interest for the No-crime group. The significantly faster reaction times of the No-crime group versus the Mock-crime group provides some support of the testing being less cognitively demanding for the No-crime group.
Another concern was that the No-crime group did not perform any task and thus was lying about performing a task versus lying about not performing a task for the Mock-crime group. This was instituted so that a lie condition could be introduced and all operators and investigators doing the analysis would still be blind to the randomization group. If the No-crime subject had picked up the envelope and then reported that he/she did not pick up the envelope, then the researchers performing the scanning would know to which group the participant was randomized. The significantly different group analysis results of Lie-minus-True for the Mock-crime and No-crime groups supports, but it does not prove, the idea that the brain activation for lying about doing a task versus lying about not doing the task may be different. Significantly, the Mock-crime group whose deception was reporting that they did not do something that they actually did had very accurate individual results using our methodology and a group analysis revealing considerable consistency with our prior study in which subjects also lied about not doing a task. Conversely, the No-crime group whose deception was reporting that they performed a task that they did not perform had much lower accuracy rates and a group analysis that was not very consistent with either the Mock-crime group or the group from our prior study. Future work needs to address these concerns in order to move this technology into practice.
An unexpected finding in this study was the lower rate of accuracy that we found in our Ring-Watch paradigm for this study (71%) versus our prior two groups (93% and 90%) using a similar paradigm (5). In addition, unlike for the prior two groups that we studied, the group imaging analysis map failed to find a significant result for the Lie-minus-True contrast. We suspect that there are a number of reasons for the discrepancy between the results from our prior study and this one. In the prior study, the testing in the scanner for the Ring-Watch paradigm was after a brief motor paradigm scan and participants were given a $50 incentive if they were able to “fool” a researcher observing the scan. In this study, the testing in the scanner was performed after already being tested for the main focus of study which was the Mock-crime paradigm. Also, there was no incentive given for “fooling” a researcher watching the scan. The potential fatigue of being in the scanner for almost an hour, the reduced salience of the Ring-Watch paradigm, and the lack of a motivating factor may have impacted the results of the methodology. Another concern is that the questioning paradigm was shortened by removing one category of the questions from the previous study. The comparison questions that involved admitting or denying minor offenses were removed because they were not used in the analysis and added 4 min to each scan. The removal of these questions could have reduced the power of the paradigm to detect differences in deceptive versus truthful answers. Successful replications of the original results accomplished at two independent sites supports the idea that differences in accuracy seen were the result of paradigm design differences (data presented as a poster at the American College of Neuropsychopharmacology, December 2007). Both replications used all the questions and gave subjects a monetary incentive to successfully deceive the investigators. Further study will be required to determine the degree of impact that fatigue, salience, motivation, and question format might have, if any, on fMRI detection of deception.
We also compared those participants for whom we made correct classifications and those for whom we made incorrect classifications in their respective groups (Mock-crime and No-crime) to determine if there were demographic or imaging results differences. There were no significant differences in years of education, age, handedness, employment, race, sex, ring or watch taken, time from group assignment to scanning, or number of significantly activated voxels in the ROIs for the ring-minus-neutral or watch-minus-neutral contrasts (data not shown). These factors do not appear to exert an important influence on this methodology. This study was not, however, designed specifically to test these factors.
This study has several factors that must be considered for adequate interpretation of the results. Although this study attempted to approximate a scenario that was closer to a real-world situation than prior fMRI detection studies, it still did not equal the level of jeopardy that exists in real-world testing. The reality of a research setting involves balancing ethical concerns, the need to know accurately the participant’s truth and deception, and producing realistic scenarios that have adequate jeopardy. In addition, this study only involved healthy adults who were not taking any medications. Thus, whether fMRI deception testing would work is unknown for participants who are taking a medication, who have a significant psychiatric or medical condition, or who are outside the 18–50 year age range. Future studies will need to be performed involving these populations.
This study has a number of strengths that should be highlighted in the context of how fMRI technology can be moved towards a practical application. The paradigm resembled a real-world scenario for sabotage. The behavior of the participants was closely monitored with sophisticated video equipment and any deviation from the required tasks resulted in participants being excluded. This ensured that we could accurately assign each participant to the Mock-crime or No-crime groups. Data analysis was carried out blind to the participant assignment in a largely automated manner. The investigators, who performed the analysis blind to group assignment and base rate of how many participants would be asked to commit the mock-crime, used a priori defined analysis methods run with Matlab scripts that ensured repeatability of the methods. The only operator judgment required was the assignment of the 0,0,0 coordinate to the location of anterior commissure on the BOLD fMRI image. Despite two operators performing this independently and with differing results, the final analysis outcomes were remarkably similar. In 94 of 96 cases both researchers drew the same conclusions. In the two discordant cases, the researchers could not draw a conclusion because of disagreement of results. The results between the two analyses, however, were close. One of these two participants was found to have a result of 0 activated voxels (indeterminate call) for one investigator and 1 activated voxel (committed mock-crime call) for the other investigator, while the other participant had the investigators getting the inverse results (1 activated voxel—committed mock-crime call, 0 activated voxels—indeterminate call). The close agreement of the two independent analyses highlights the robustness of the data analysis methods. Finally, though our a priori hypothesis resulted in low specificity and high sensitivity, the ROC analysis shows that the test may be tailored to the situation where one is investigating.
Our methodology of using functional MRI to detect deception was found to be sensitive but suffers from low specificity on this task for whether a subject committed a mock crime. This indicates that the test would be helpful to “rule out” a potential suspect (i.e., a person who is found to be not lying about being innocent—did not commit the crime) but not very helpful in “ruling in” a suspect (i.e., a person who is found to be lying about being innocent—did commit the crime). Comparing these results with other testing modalities is problematic due to the variability in testing and groups of people tested. The most comprehensive report on the literature surrounding the use of the polygraph to detect deception simply concluded that although there was a considerable variability in results, the polygraph was probably better than chance (25). More work with direct comparisons of paradigms and participant samples are needed to understand how the various technologies compare in detecting deception. Although the diagnostic ability of our method was greater than chance, future work is focused on improving specificity and using more realistic testing in order to enhance the utility of this technology in real-world applications.
We wish to thank Dr. Margaret Melikian, Director of Forensic Psychiatry at MUSC, for the use of the specially equipped room in the Forensic Psychiatry Building; Mr. Dave Ramsey and Mr. John Dornisch of the South Carolina Research Authority for computing support; Drs. Paul Morgan and Chris Rorden for help with imaging parameters; and Ms Minnie Dobbins and Ms Kimberly Mapes for administrative support.
Funding provided by Department of Defense Polygraph Institute (W74V8H-04-1-0010) and Cephos Corp. Dr. Kozel is supported by a K23 from the NIMH (5K23MH070897-01). The views expressed in this article are those of the authors and do not necessarily reflect the official policy or position of the Department of Defense, the National Institute of Mental Health, the National Institutes of Health, or the U.S. Government.
*This work was presented as a poster at the 13th Annual Meeting of the Organization for Human Brain Mapping in Chicago, June 2007.