It is conventional wisdom that it is more important for fMRI researchers to correct for multiple comparisons than for behavioral researchers to do so because fMRI researchers conduct so many more tests than behavioral researchers. Although there is no question that fMRI researchers conduct more tests than behavioral researchers in any given study, it is not clear why this is a criterion for who should be correcting and under what circumstances. How many behavioral science papers report dozens of statistical tests without any correction for multiple comparisons? If we are going to be serious about Type I errors, should not the number of tests reported in any paper be corrected for? While lip service is always given to P
< 0.05 corrected for multiple comparisons, this in no way reflects the actual conventions of behavioral scientists. A randomly selected issue of Journal of Personality and Social Psychology
, August, 2000), a high-profile journal of the American Psychological Association
, contained an average of 93 statistical tests per paper (range: 32–145 tests), excluding one paper that reported no statistical tests at all.2
If the goal of the push to avoid Type I errors in fMRI is to achieve a similar FDR as that observed in behavioral research, then we should be trying to match actual behavioral science research practices. As a first approximation, we used AlphaSim (Cox, 1996
) to estimate the FDR for papers with 93 tests and then determined what cluster size, used in conjunction with an intensity threshold of P
< 0.005 would produce the same FDR.3
In the fMRI simulation, we assumed a 64 × 64 × 25 matrix with a mask applied to include only voxels inside the brain (total voxels included: 39 838). We assumed voxel dimensions of 3.5 mm × 3.5 mm × 5 mm and a smoothing kernel of 6 mm full-width half-maximum. Based on one million simulations, we observed that P
< 0.005 with a cluster size of 8 voxels achieves the same FDR as a 93 test study and a cluster size of 18 voxels achieves a FDR of 0.05.4
We also compared the fMRI simulations to the JPSP
paper with the least number of tests in the selected issue. To achieve the same FDR as a behavioral study with 32 tests, a P
-value of 0.005 with a 9 voxel extent threshold is needed.
Whole-brain analyses using P < 0.005 with a 10 voxel extent threshold may not be equivalent to an FDR of 0.05 (though, P < 0.005 with a 20 voxel extent using the other scanning parameters we described does), but it is quite consistent with the FDR conventions used in actual behavioral research that fMRI researchers aim to emulate. If neuroimaging was already using inferential procedures as conservative with respect to Type I errors as actual practices in the behavioral sciences, why go further and further with methods that ensure an increasing number of Type II errors? We are not trying to suggest that P < 0.005 with a 10 voxel extent should be reified as a ‘gold standard’ criterion. The complexity of neuroimaging analyses suggests that a variety of standards might be appropriate in different contexts. We are focusing on this standard because it has been used so frequently in the past and is now being treated as an unacceptable criterion. In contrast, we think this is one of many different reasonable criteria for significance.
One might respond that we should be correcting behavioral papers for the number of tests reported (i.e. we should change our statistical practices in the behavioral sciences to match Type I focus in the neuroimaging community). In that case, the articles in JPSP should have been using a per-test P-value threshold of 0.0005 and all that would have been left is the one article that did not report any tests. But if we are serious about correction, there is no logical reason to stop there. If the goal is to prevent Type I errors, perhaps each journal should require a correction for the number of tests reported in an entire issue of the journal (P < 0.00005). The selected issue of JPSP reported 932 tests; perhaps that should be the basis of correction because surely, some of those 932 tests will be significant as a result of chance alone. Perhaps there should be a correction for all the tests reported in a year of a journal (P < 0.000005) or perhaps for all the journals covering the same area (social psychology, emotion, memory) in a given month. This might be too difficult to decide on, so perhaps, instead of focusing on journals, we should focus on investigators and their labs. Perhaps a lab should have to correct for the total number of published results in a given year or maybe an investigator should have to continually update their correction factor based on the total number of results that they have published in their careers, with early career submissions having a more liberal correction factor and those in the National Academy of Science who have run countless tests, needing the most impressive results for them to qualify as significant (and, they would of course have to retract papers as their career progressed, finding that previous tests in old papers are no longer significant in light of their success and, ironically, contribution to the field).
The previous paragraph was meant to be hyperbolic in order to make an important point. As with just about everything else in our statistical analyses, corrections for multiple comparisons are about conventions and the conventions are arbitrary as long as they do not seriously offend our intuitions. There is no right way to correct for multiple comparisons that actually prevents Type I errors from being made. This is not how statistics operate and our attempt to treat statistics this way leads to serious underreporting of true effects that are likely to replicate. In any event, behavioral scientists have figured out a convention that works. They do not correct for multiple comparisons in actual practice, which assuredly produces Type I errors, but Type I errors that are not likely to replicate across multiple studies.