Epidemiological investigations almost universally highlight significant associations between risk factors and outcomes. The vast majority of the 389 articles that we analyzed reported some significant results. Less than half of these articles presented at least one nonsignificant relative risk in their respective abstracts. This pattern suggests that there is a strong predilection for highlighting “positive” results and avoiding “negative” results. The preponderance of significant findings was less prominent in the full texts of these articles. However, even in the full texts, an article reported, on average, at least as many significant relative risks as nonsignificant ones, and sometimes reported a greater number. Despite some variability depending on country, tested risk factor, and outcome, most important fields of epidemiological investigation seem to have little room for “negative” findings.
Given the largely exploratory nature of most epidemiological analyses, one would expect that most hypotheses analyzed should be “negative.” A counter argument, however, is that epidemiologists do not select null hypotheses at random, but rather because there is some reason to believe they are false. For most of the reported relative risks considered in the present paper, there were already pertinent results available, perhaps needing replication and refinement. However, even when studies were not the first to report on a tested hypothesis, there is no guarantee that previous studies had found “positive” results, and the fact that a hypothesis is tested repeatedly does not guarantee its credibility [19
]. The exact expected proportion of statistically significant associations in the entire field of epidemiological research is, by default, unknown. However, we can gain some insights into epidemiological research by examining the rate of replication of epidemiological findings, each time similar hypotheses are tested, using very large, well-conducted studies, preferably with the most robust designs, such as randomized trials.
Empirical evidence shows than even among the most cited, confirmatory epidemiological studies, five out of six studies have been refuted or were found to be exaggerated within a few years of their publication in major journals [14
]. In modern epidemiology, we also have evidence that most proposed associations are rejected when large-scale evidence accumulates. For example, of 32 candidate gene associations that proposed that common gene variants were associated with breast cancer, large-scale evidence eventually indicated that none remained formally significant after correcting for 32 comparisons, and only a few associations maintained an uncorrected p
-value of less than 0.05 [21
]. Another argument comes from the sheer number of available epidemiological factors under study. For example, we can currently test millions of genetic variants and a vast number of exposures. Even considering only independent variants and independent exposures, the claimed associations already could explain several-fold more than 100% of the attributable fraction for each outcome. This was already an issue almost three decades ago when Doll and Peto tried to estimate attributable fractions for cancer risk factors [23
], and the scale of the problem has escalated in modern epidemiology.
Our results extend the observation of a previous survey in which 63 of 73 epidemiological studies in leading journals had statistically significant results [12
]. Three quarters of the analyzed risk relationships reflected effect sizes where the compared groups differ less than 2.5-fold in their risk for the outcome of interest. Relative risks exceeding five were very rare. On the whole, the current literature presents modest associations, and half of them cluster in the relatively narrow relative risk range of 1.4–2.5. For some fields, the typical relative risks may be even lower. The strength of an observed epidemiological association is one of the classic criteria for causality.
We observed a lower frequency of significant results and a higher frequency of nonsignificant results in US studies. It has been previously reported [24
] that studies from non-English speaking countries may report significant results more frequently in the English literature. Nonsignificant results may be reserved for the local non-English literature that is typically not indexed in PubMed. However, the direction of “language bias” may vary across different fields [25
]. In our sample, relatively few articles were from English-speaking countries other than the United States. The association of presented significant results with structured abstracts may reflect the possibility that structured abstracts may encourage the use of exact numbers. Finally, the association of fewer significant results with cancer outcomes may reflect a larger prevalence of “negative” findings in cancer epidemiology compared with other fields, such as cardiovascular disease. Alternatively, it could be speculated whether there are more journals specializing in cancer epidemiology.
Furthermore, we noted that when the compared groups were further apart in the distribution of the values of the risk factor, the presented relative risks were lower. The contrast of extreme quintiles, the most extreme contrast of those evaluated here, was used to present what were, on average, the smallest relative risks. Investigators presented more extreme contrasts when the risks were inherently lower. In fact, studies with extreme contrasts apparently had been designed upfront to have more power to detect small effects than studies that reported more proximal contrasts. It could be argued that there is nothing wrong with epidemiologists designing contrasts that are more likely to reveal the relationship that is being sought, provided the contrasts are transparently reported. However, most non-methodologist readers may still be misled. For example, by comparing the extreme quintiles, a relative risk of 1.5 may be calculated, while a contrast of people with above-versus-below median values might have given a relative risk of 1.2, or even 1.1, for the same dataset. The non-methodologist reader or the general public would then be informed about a 50% relative increase in the disease risk, rather than 20% or 10%—a more impressive result that nevertheless pertains to the fewer people of the extreme groups. Most readers and even physicians may not understand that, with extreme groups being compared, the presented risk pertains to only a minority of people in the population. The use of relative risk metrics, rather than measures of absolute risk, may cause further misinterpretations and has been characterized as a main source of confusion in understanding medical statistics [27
]. The problem is heightened when relative risks seem even larger, because many apparently sizeable relative risks eventually translate to negligible absolute risks [28
We should also acknowledge that researchers and editors may try to select and present what they deem to be the most interesting and important work. The window on the world offered by the published scientific literature is not comprehensive, but instead is a particular view reflecting a host of complicated desires, abilities, and interests of the scientific community. Whether statistical significance should be one of the criteria used to select work for presentation has been a point of endless debate. However, at a minimum, if data are selected based on significance thresholds, it is important to know the underlying multiplicity of the conducted analyses. A significant risk (for example, p < 0.05) that arises out of a single hypothesis and a single analysis is very different than one that arose out of a massive screening of potential risk factors where it is not shown that many other risk factors have also been screened.
Some additional limitations should be discussed. First, we focused on articles that used specific percentile group contrasts; this was dictated by our aim to investigate specific selection biases based on these contrasts. We encourage assessments of other designs (e.g., binary risk factors and non-percentile contrasts); preliminary evidence suggests that the pursuit of statistical significance exists across all epidemiological studies [12
]. Second, no data existed with which to compare presented relative risks with different types of percentile contrasts in the same study because, with very rare exceptions, each study used only one type of contrast.
Selection in primary studies could also affect the findings and inferences of secondary analyses, leading to spurious conclusions being drawn. A meta-analysis may ameliorate this defect by using the raw (individual level) data without taking into consideration the selected contrasts that had been presented in each paper. However, this would require full access to all data, and this is currently the exception. Selection of outcomes and contrasts in the primary studies may lead to similar selective choices in meta-analyses that have to depend on published data. This would further perpetuate these biases. Meta-analyses may try to detect and address these selective reporting problems using a variety of diagnostic tools such as asymmetry tests. However, having an unbiased body of evidence is certainly preferable to trying to detect and eliminate bias after the fact.
Our empirical findings may lead to some recommendations on how to improve the situation. Epidemiological research is very important [30
], but reporting of epidemiological studies needs standardization [12
], as has been proposed for clinical trials and other study designs [34
]. The “STrengthening the Reporting of OBservational studies in Epidemiology” (STROBE) statement and similar efforts in genetic epidemiology are working in this direction. Investigators should avoid the selective presentation, and dissemination, of high-risk estimates and significant results. They should give a clear explanation of the way in which quantitative exposures are analyzed e.g., which groupings are chosen, whether a continuous analysis is done, and the rationale behind the choice (continuous, trend test, or comparison to a reference group) [37
]. In particular, they should avoid estimating results with various categorical contrasts and selecting what to report simply based on the seemingly largest magnitude of effect. Readers should also be advised to interpret cautiously apparently large effects that are based on extreme contrasts and should instead place them properly in the population context. Study reports should also convey the exact breadth of analyses that have been performed. While it may not be possible to provide all “negative” results in detail, the reader should be aware of the existence of these analyses that have led to “negative” results. This is fairly challenging because, in contrast with randomized trials [38
], upfront registration of epidemiological protocols may be very difficult or unrealistic. Some epidemiological research will unavoidably remain exploratory and post hoc in nature. Even so, this exploratory nature would need to be clarified, and selective reporting minimized, so that epidemiological findings could be interpreted in the most appropriate perspective.