Fisher, who designed studies for agricultural field experiments, insisted “a scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance” [
5]. There are three issues that a researcher should consider when conducting, or when assessing the report of, a study (Table ).
First, the relevance of the hypothesis tested is paramount to the solidity of the conclusion inferred. The proportion of false null hypotheses tested has a strong effect on the predictive value of significant results. For instance, say we shift from a presumed 10% of null hypotheses tested being false to a reasonable 33% (ie, from 10% of treatments tested effective to 1/3 of treatments tested effective), then the positive predictive value of significant results improves from 64% to 89% (Fig. , Point C). Just as a building cannot be expected to have more resistance to environmental challenges than its own foundation, a study nonetheless will fail regardless of its design, materials, and statistical analysis if the hypothesis tested is not sound. The danger of testing irrelevant or trivial hypotheses is that, owing to chance only, a small proportion of them eventually will wrongly reject the null and lead to the conclusion that Treatment A is superior to Treatment B or that a variable is associated with an outcome when it is not. Given that positive results are more likely to be reported than negative ones, a misleading impression may arise from the literature that a given treatment is effective when it is not and it may take numerous studies and a long time to invalidate this incorrect evidence. The requirement to register trials before the first patient is included may prove to be an important means to deter this issue. For instance, by 1981, 246 factors had been reported [
12] as potentially predictive of cardiovascular disease, with many having little or no relevance at all, such as certain fingerprints patterns, slow beard growth, decreased sense of enjoyment, garlic consumption, etc. More than 25 years later, only the following few are considered clinically relevant in assessing individual risk: age, gender, smoking status, systolic blood pressure, ratio of total cholesterol to high-density lipoprotein, body mass index, family history of coronary heart disease in first-degree relatives younger than 60 years, area measure of deprivation, and existing treatment with antihypertensive agent [
19]. Therefore it is of prime importance that researchers provide the a priori scientific background for testing a hypothesis at the time of planning the study, and when reporting the findings, so that peers may adequately assess the relevance of the research. For instance, with respect to the first example given, we may hypothesize that the presence of a radiolucent line observed in Zone 1 on the postoperative radiograph is a sign of a gap between cement and bone that will favor micromotion and facilitate the passage of polyethylene wear particles, both of which will favor eventual bone resorption and loosening [
16,
18]. An important endorsement exists when other studies also have reported the association [
8,
11,
14].
Second, it is essential to plan and conduct studies that limit the biases so that the outcome rightfully may be attributed to the effect under observation of the study. The difference observed at the end of an experiment between two treatments is the sum of the effect of chance, of the treatment or characteristic studied, and of other confounding factors or biases. Therefore, it is essential to minimize the effect of confounding factors by adequately planning and conducting the study so we know the difference observed can be inferred to be the treatment studied, considering we are willing to reject the effect of chance (when the p value or equivalently the test statistic engages us to do so). Randomization, when adequate, for example, when comparing the 1-month HHS after miniincision and standard incision hip arthroplasty, is the preferred experimental design to control on average known and unknown confounding factors. The same principles should apply to other experimental designs. For instance, owing to the rare and late occurrence of certain events, a retrospective study rather than a prospective study is preferable to judge the association between the existence of a radiolucent line in Zone 1 on the postoperative radiograph in cemented cups and the risk of acetabular loosening. Nonetheless researchers should ensure there is no systematic difference regarding all known confounding factors between patients who have a radiolucent line in Zone 1 and those who do not. For instance, they should retrieve both groups over the same period of time and the acetabular components used and patients under study should be the same in both groups. If the types of acetabular components differ markedly between groups, the researcher will not be able to say whether the difference observed in aseptic loosening between groups is attributable to the existence of a radiolucent line in Zone 1 or to differences in design between acetabular components.
Last, choosing adequate levels of Type I and Type II errors, or alternatively the level of significance for the p value, may improve the reliance we can have in purported significant results (Figs. , ). Decreasing the α level will proportionally decrease the number of false-positive findings and subsequently improve the positive predictive value of significant results. Increasing the power of studies will correspondingly increase the number of true-positive findings and also improve the positive predictive value of significant results. For example, if we shift from a Type I error rate of 5% and power of 80% to a Type I error rate of 1% and power of 90%, the positive predictive value of a significant result increases from 64% to 91% (Fig. , Scenario 2; Fig. , Point D). Sample size can be used as a lever to control for Types I and II error levels [
2]. However, a strong statistical association, p values, or test statistics never imply any causal effect. The causal effect is built on, study after study, little by little. Therefore, replication of the experiment by others is crucial before accepting any hypothesis. To replicate an experiment, the methods used must be described sufficiently so that the study can be replicated by other informed investigators.
The p value and the theory of hypothesis testing are useful tools that help doctors conduct research. They are helpful for planning an experiment, interpreting the results observed, and reporting findings to peers. However, it is paramount researchers understand precisely what these tools mean and do not mean so that eventually they will not be blinded by the irrelevance of some statistical value in front of important medical reasoning.