Despite the large body of fMRI literature, most published studies have samples sizes that would be considered small by conventional standards. Nevertheless, the number of foci claimed to be discovered by small studies is relatively large and we found absolutely no correlation between the sample size of a study and the number of foci that it claims. This is counterintuitive to power considerations and it suggests that biases that inflate the number of claimed foci may affect disproportionately the smaller studies in the literature. Consistent with this picture, meta-analyses identified only slightly more total foci than single studies, despite having sample sizes that were almost 20 times larger; thus studies with n<45 identify far more foci per subject than meta-analyses do. This picture persisted when we compared only single studies and meta-analyses that used the same study-wide corrected p-value threshold.
This evidence is consistent with the presence of selective reporting biases causing an excess of significance in the published literature. We cannot exclude the possibility that larger studies and even large meta-analyses are also affected by such biases. Different mechanisms may contribute to this excess significance.
First, smaller fMRI studies may be underpowered and inflate the number of reported foci. The average sample size of the retrieved studies was 13 subjects and the vast majority (94%) of individual studies included in the meta-analyses we examined had fewer than 30 patients. FMRI is a powerful technique to investigate subtle neurophysiological brain changes and the adequate sample size depends on the nature of statistical inference requested 
. An average sample size of 13 subjects is probably well below the optimal sample size for an fMRI study, especially when variability across measurements and patients are considered. Some authors have proposed optimal sample sizes of 16-32 subjects per group 
, suggesting that between-subject comparison studies of n
<30 are too small even by liberal estimates. A recent commentary by Friston posed “Ten Ironic Rules” that claimed several fallacies to the application of classical inference to sample size and power for functional neuroimaging studies whereby studies with smaller sample sizes generate more reliable data because they are less likely to report findings with trivial effect sizes. Friston’s assertions have been challenged on the basis that weak statistical power comes with lower positive predictive value, increases the likelihood of false positive results 
. Moreover, a recent paper by authors of the present paper (MM & JPAI) reported power analyses of single studies included in 49 meta-analyses and found that almost 50% of studies had an average power lower than 20% 
On the other hand, the presence of many underpowered studies in the available literature may be due to the technical and logistic complexity of fMRI and the cost involved. In theory the field would thus benefit from the conduct of collaborative studies where many centers join forces and generate large sample sizes with standardized data collection, analysis and reporting plans 
. However, the use of different scanners in multicenter fMRI studies may introduce additional significant heterogeneity 
Second, small fMRI studies with inconclusive, null, or not very promising results (e.g. very few foci identified) may not be published at all. For example, recent voxel-based meta-analyses of structural studies in early psychosis 
uncovered only one study reporting no significant between-group results 
. Peer-reviews should be as strict when assessing the methods of a study reporting abnormalities in expected brain regions, as when assessing the methods of a study not finding any expectable finding. Similarly, acceptance or rejection of a manuscript should not depend on whether abnormalities are detected or not, or on the specific brain regions found to be abnormal. Publication bias is very difficult to detect in meta-analyses done after the fact, especially when all the published studies are small 
. Bias is sometimes further extenuated by the fact that some available voxel-based packages can only analyze sets of x
spatial coordinates excluding studies reporting null results.
Third, small studies may be analyzed and reported in ways that may generate a larger number of claimed foci. The analysis option that would cause such an increased number of foci is to use more lenient statistical thresholds for claiming a discovery in smaller studies. In fact, the meta-analytical approach adopted by most packages does not correct the number of foci entered for their statistical significance nor for the sample size of the single study. In other words, foci reported in small studies with liberal statistical threshold are directly compared with foci reported in large studies adopting more stringent thresholds. Only in the most recent years this problem has been recognized and partially overcome in the current version of the two most widely used voxel-based packages (i.e. ALE and SDM) 
. Additionally, it is not uncommon in neuroimaging studies that the statistical threshold for some regions of interest is rather more liberal than for the rest of the brain. The use of inconsistent and erratic statistical threshold in the same study can affect the number of foci detected. Although we were unable to test this at the level of the individual study, this problem is well recognized in the imaging literature. A recent voxel-based packages such as SDM require the user to carefully check that the same statistical threshold is used throughout the whole brain to avoid biases toward liberally thresholded brain regions 
. At the level of the meta-analyses, our empirical analysis showed that there is heterogeneity in the selection of statistical thresholds, but this heterogeneity is unlikely to explain the whole bias. In fact, we did not find that meta-analyses with more stringent thresholds (p
<0.01 or p
<0.001) reported significantly fewer foci than those with more lenient thresholds (p
<0.05). The results of our simulation suggest that for a given set of analysis choices for a given task (degree of spatial smoothing, statistical thresholds, etc), increased sample size should increase the number of significant clusters – each containing one or more local maxima (foci). If, however, arbitrary analysis choices account for the larger number of reported foci in smaller studies, it would be consistent with the statement that our conclusion that biases that inflate the number of claimed foci may affect disproportionately the smaller studies in the literature.
Fourth, analyses can be built post-hoc on the basis of the researchers’ hypotheses or on the basis of the most significant result. Although whole-brain analyses are considered less affected by this potential bias than region of interest (ROI) analyses, this has not been explicitly tested. The use of Small Volume Correction (SVC) techniques in addition to standard whole-brain analyses may be used to alter the statistical threshold in selected ROIs, thus impacting on the number of foci reported. For example, insignificant results at the standard whole-brain levels may be published as significant results after SVC. Most of the meta-analyses included in this study did not clarify whether the individual studies performed whole brain analyses or whether they also included SCV coordinates. Recent recommendations in the field (http://sdmproject.com/
clearly indicate that prior to conducting the voxel-based meta-analysis, there should be strict selection of the reported peaks by only including those that appear statistically significant at the whole-brain level (no SVCs). To overcome these problems, authors may also be encouraged to blind the statistical analyses of the imaging datasets to avoid analyses be built post-hoc on the basis of the results. Similarly, all individual studies should explicitly acknowledge the number of analyses performed giving a clear rationale for each, to control for conducting exploratory analyses and reporting the most significant result.
Some limitations should be acknowledged. First, we used the data presented in already available meta-analyses and some of these may have data or analysis errors. However, it would have been prohibitively resource intensive to repeat from scratch 94 meta-analyses and to extract the requisite data in detail from almost two thousand papers. The availability of large-scale evidence from such a large number of meta-analyses offers an excellent launching point for assessing the big picture in this important research field. Second, it is possible that some large studies may be of poor quality and thus to find fewer foci than smaller studies simply because of poor study design and poor measurements. However, this is unlikely to be a systematic problem with large studies, and if anything one would expect higher quality criteria in larger investigations that are typically performed by more experienced teams. Third, the number of foci identified in meta-analyses may also be biased. For example, they can report only activation and not deactivation clusters and it is not an absolute gold standard, since meta-analyses carry with them many of the biases of the studies that they combine. However, it is generally agreed that the findings form meta-analyses have a higher level in the hierarchy of evidence than findings from isolated individual studies. Fourth, there are a number of other factors reflecting a range of potential arbitrary choices in analyses that could affect the number of activation foci generated (e.g., inter-study variation in the degree of spatial smoothing, use of fixed-effects instead of mixed-effects analyses, use of cluster extent-based correction, etc.), but there is no reason to expect that variation in these and other analytic methods would vary by sample size. Fifth, as many studies incorporate repeated-measures designs and/or multiple between-subject groups and others do not, the numbers included in our analyses represent some heterogeneity in study designs with different implications for statistical power. However, the sample sizes included in our analyses represented the number of subjects per group for first-level analyses (within-subject task/stimulus contrasts) that were incorporated into meta-analyses. While we can not definitively exclude the possibility that some repeated-measures analyses were included in the foci counts, we have no reason to expect that larger sample sizes would not improve positive predictive value for group comparisons or higher-level, repeated-measures analyses.
Finally, while it is possible that our finding is partially driven by the likelihood that larger studies identify fewer clusters due to smoothing, and thus, are likely to report fewer foci, this would still support our finding that that smaller studies are likely to report more foci. Whether this is the result of smoothing or because of a number of analysis choices that may be influenced by sample size is difficult to determine.
Acknowledging these caveats, the emerging picture is consistent with the presence of excess significance biases in the literature of whole brain fMRI affecting predominantly smaller studies. Similar biases have been identified in a large number of other fields with different methods 
. Improvements in standardization of research in this field with delineation of acceptable and optimal practices may be useful. Efforts at generating large-scale systematic evidence may be instrumental in improving the yield of information in this important research field.