We studied the accuracy of the reporting of statistical results in a random selection of high- and low-impact psychology journals (Study 1), and in a fully random sample of recent psychology articles, in which the researchers had employed NHST (Study 2). We found that between 17% (Study 1) and 19% (Study 2) of the exactly reported statistical results and between 7% (Study 1) and 8% (Study 2) of the inexactly reported statistical results reported in psychological articles are incongruent. These results reveal that the problem of incongruent statistical results is greater in psychology journals than in the other fields that have been studied thus far. In the studies of Garcia-Berthou and Alcaraz (2004
) and Berle and Starcevic (2007
) of the prevalence of congruence errors in Nature
, the British Medical Journal
, and two psychiatry journals, between 11% and 14% of published statistical results with exactly reported p
values were reported incorrectly. Furthermore, we found that 55% of the articles in the first study and 35% of the articles in the second study contained at least one such error. Moreover, around 1% of the examined statistical conclusions were not supported by the reported test statistic and df
. More important, we came across at least one unsupported statistical conclusion in 39 of the 257 articles (15%) that we scrutinized in our two studies. In other words, despite passing the peer reviewers, in roughly 1 out of 7 articles in psychology, at least one statistical conclusion appears to have been unfounded on the basis of the presented test results alone.
Moreover, 4% of the statistical results in the first study and 21% of the statistical results in the second study were not completely reported, which goes against the guidelines of the APA Publication Manual (American Psychological Association, 2010
). The percentage of incompletely reported results in the psychological literature is even larger, because in our representative sample of psychology articles, we came across 29 articles (17%) in which the statistical results were reported only by a p
In addition, the results of the first study showed that articles published in low-impact journals contained relatively more congruence errors than articles published in high-impact journals. However, we found no difference between high- and low-impact journals in the prevalence of gross errors. Although the number of statistical results in the first study is large, we examined only three high-impact and three low-impact journals. Therefore, the conclusions about differences between high- and low-impact journals can be dependent on the specific journals included in our study. Despite this potential limitation on the generalizability to other journals, we have no reasons to believe that the journals we selected are unrepresentative for psychology journals with high- and low-impact factors, respectively. In fact, the findings of the second study on the basis of a random (and hence representative) sample of psychological articles do attest to the generality of reporting error frequencies.
Because statistical results from articles can be used for meta-analyses, it is important that results are correctly reported or, at least, that the magnitude of these errors is small. We operationalized the magnitude of reporting errors on the basis of results from p
values that may feature in meta-analyses with Cohen’s d
and found that the average magnitude of these errors to be substantial (average d
= 0.17). Reporting results with effect sizes would decrease the unhealthy focus on the significance boundary. However, the second study showed that effect sizes are reported only in around 20% of the articles. Despite many efforts to change reporting practices in psychology (see, e.g., Wilkinson and Task Force on Statistical Inference, 1999
), the preponderance of published articles still lack effect sizes. So if p
values are used, the common misreporting of these p
values could bias meta-analytic results considerably. The practice of only reporting p
values as we documented in 17% of the empirical articles in Study 2 should therefore be avoided.
In the second study, we found a similar prevalence of congruence errors as in the first study, although in the second study we came across fewer articles with at least one congruence error than in the first study. This may be due to the fact that the articles in the second study contained fewer statistical results, on average. Especially, the high-impact journals in the first study contained many statistical results per article, mostly because of the common practice of including more than one study per article. Furthermore, we found substantially more incompletely reported statistical results in our second study. Twenty-two percent of the statistical results were not reported according to the guidelines of the APA Publication Manual (American Psychological Association, 2010
). This difference between Studies 1 and 2 was probably caused by the overrepresentation of high-impact journals in the first study. For instance only 3 out of 1,882 statistical results in JPSP
were reported incompletely. This suggests that journal policies can make a difference. The second study involved a fully random sample of articles published in 2008 in peer-reviewed psychology journals, and although the sample of articles may not be large, our results are based on a large number of statistical test results and show clear consistency. Therefore, it is safe to conclude that the prevalence of misreporting (both congruence errors and incomplete results) within psychological articles with statistical results is indeed close to 30%, and that more than half of the these articles contains at least one such error in the reporting of statistical results.
The present work gives some insight into the types of errors made. To begin with, incompletely reported results are quite common. Most often, a test statistic was given without the mention of one or more dfs. Another common error is the confusion of “<” with “=”. In several articles, we found that inequality and equality signs were used as if they were interchangeable. Furthermore, we came across the wrong use of tests (e.g., F and χ² tests from which the p values are divided by two without sound argumentation), problems with the reporting of the smallest p value (e.g., reporting a p value as < .000 or reporting p = .001 when p < .001 would have been correct), and rounding errors, and we found evidence of a substantial occurrence of copy–paste errors. We propose recommendations to avoid the misreporting of p values below.
Many congruence errors could not be classified on the basis of the reported information, and so the source of these errors remained unclear. We simply do not know whether the test statistic, the df, and/or the p value were misreported. In addition, since we focused only on incongruently reported results, we did not consider other errors in the reporting of statistical results that did not result in incongruency. Thus, it is possible that additional errors may be present, which would surface only following a complete reanalysis of the raw data.
To obtain a better understanding of the origins of the errors made in the reporting of statistics, we contacted the authors of the articles with errors in the second study and asked them to send us the raw data. Regrettably, only 24% of the authors shared their data, despite our request being quite specific and our assurances that the authors would remain anonymous. The degree of nonresponse was in line with the previous results of Wicherts et al. (2006
). They requested data from 141 authors of articles in APA journals and observed a response rate of 27%. Nevertheless, some authors, who appeared willing to share their data with us, conducted a reanalysis themselves and informed us of their results. Both the raw data and the results of the reanalyses revealed some additional sources of error. Especially, incongruencies caused by reporting the wrong test statistic or df
were revealed. Furthermore, several contacted authors gave us more background information on the causes of the incongruencies. For example, one author told us that the reported p
value was based on one dataset and that the test statistic and df
were based on a different, former dataset, which contained an incorrect value. This can be seen as a special case of the copy–paste error with a former result that is only partly edited. Nevertheless, even with access to the raw data, the causes of some errors remained unknown. Given access to the raw data, it is at least possible to determine the correct statistical results, which can be used, for instance, in meta-analyses.
Of special interest was the direction of the congruence errors. Researchers often have specific preferences regarding their results, which may affect the extent to which researchers scrutinize errors in line with or contradicting their preferred results. We hypothesized that congruence errors would more often be in favor of the researchers’ expectations. The direction of the gross errors in the first study revealed that 46 of the 50 congruence errors resulted in a significant result. Furthermore, the rounding errors with a p
value of .05 were all in favor of the researchers’ hypotheses—that is, the alternative rather than the null hypotheses. These errors may have been the result of sloppiness, so they should not be taken to mean that researchers were trying to present a more convincing story than the data could support (Friedlander, 1964
). Nonetheless, these results point to the importance of studying further the potential influence of researchers’ expectations on the outcome and reporting of their data analyses.