Armed with our understanding of the anatomy and physiology of Bayes' rule, we are prepared for pathophysiology. In Part II we explore common misinterpretations and misuses of elementary medical statistics that occur in the application of significance testing, and how these can be effectively treated by applying our understanding of Bayes' rule.
Before one can appreciate the problems with significance testing, one needs a clear understanding of a few concepts from 'classical statistics', namely binary hypothesis testing and P-values. We now proceed to review these concepts.
Binary hypothesis testing
Binary hypothesis testing is familiar to most physicians as the central concept involved in judging the results of clinical trials. The basic setup was encountered in the quiz that began the paper. For any proposition A, we set up two hypotheses: H0 = 'A is not true', called the null hypothesis; and H1 = 'A is true', called the alternative hypothesis. In our quiz, the effect of a new drug was being investigated and we had H0 = 'the drug has no effect' vs. H1 = 'the drug has some effect'. One of these statements must be true as a matter of logical necessity. To find out which one, an experiment is carried out (for example a clinical trial), resulting in data D. We then conclude, through a procedure described below, that the data either favors H0, called 'affirming the null hypothesis,' or favors H1, called 'rejecting the null hypothesis.' We will denote our conclusions as either D0 = 'the data favor the null hypothesis', or D1 = 'the data favor the alternative hypothesis'.
Our conclusions can be right or wrong in four ways (see Table ). Correct results include 'true positives' (concluding
D1 when
H1 is true), and 'true negatives' (concluding
D0 when
H0 is true); the corresponding probabilities
Pr(
D0|H0) and
Pr(
D1|H1) are called the 'specificity' and 'power' of the study, respectively. Incorrect results include Type I errors (concluding
D1 when
H0 is true), and Type II errors (concluding
D0 when
H1 is true); the corresponding probabilities
Pr(
D1|H0) and
Pr(
D0|H1) are called the 'Type I error rate' and 'Type II error rate', respectively. There is a perfect analogy (and mathematically, no difference) between these probabilities and the 'four fundamental forward probabilities' well known to physicians in the context of diagnostic testing, namely, the true and false positive rates, and true and false negative rates. Similarly, corresponding to the 'four fundamental inverse probabilities' of diagnostic testing, namely positive and negative predictive values and the false detection rate and false omission rate, there are exactly analogous quantities for the hypothesis testing scenario, that is
Pr(
H0|
D0),
Pr(
H1|D1),
Pr(
H0|D1), and
Pr(
H1|D0). (See the Additional file
1 for a brief review of the fundamental forward and backward probabilities of diagnostic testing.) This analogy is summarized in Table and has been expounded beautifully in a classic paper by Browner and Newman [
63]. We will return to this analogy near the end of the paper.
| Table 3The analogy between diagnostic tests and clinical trials. |
The null hypothesis significance testing procedure
Let us now consider the conventional statistical reasoning process followed in drawing conclusions about experiments. This reasoning is prescribed by a standardized statistical procedure, the 'null hypothesis significance testing procedure' (NHSTP), or simply 'significance testing', consisting of the following steps.
1. Specify mutually exclusive and jointly exhaustive hypotheses H0 and H1.
2. Design an experiment to obtain data D, and define a test statistic, that is a number or series of numbers that summarize the data, T = T (D) (for example the mean or variance).
3. Choose a minimum acceptable level of Type I error, called the 'significance level', denoted α
4. Do the experiment, yielding data D, and compute the test statistic, T = T (D).
5. Compute the P-value of the data from the test statistic, P = P (T (D)).
6. Compare the P-value to the chosen significance level. If P ≤ α, conclude that H1 is true. If P >α, conclude that H0 is true.
In the customary statistical jargon, when P ≤ α, we say that the experimental results are 'statistically significant', otherwise, they 'do not reach significance.' Also, note that the P-value itself is a statistic, that is a number computed from the data, so in effect we compute a test statistic T = T (D), from which we compute a second test statistic P = P (T(D)).
P-values
We now review what
P-values mean. The technical definition that we will use differs in important ways informal definitions more familiar to physicians, and the difference turns out to be consequential, as witnessed by the existence of a large critical literature dealing with practical and philosophical problems arising from definitions in common use [
5,
7,
13-
17,
26,
28,
50,
64-
83]. As an overview to our own discussion of the conceptual issues at stake, we note that the literature critical of
P-values can be roughly divided into two dominant themes [
75]. First, there are problems of interpretation. For example, consider the commonly encountered informal definition of the
P-value as the probability that the observed result could have been produced by chance alone
The probability that the observed result could have been produced by chance alone
This definition is vague, and tempts many users into confusing the probability of the hypothesis given the data with the probability of the data given the hypothesis [
13,
17], that is it is unclear whether this definition refers to a conditional probability with the hypothesis H0 before the conditioning line,
Pr(
H0
|·), or after the conditioning line,
Pr(·
|H0), which have very different meanings. Another common complaint is that the conventional cutoff value for 'significance' of
P < 0.05 is arbitrary. Finally, many have argued that real-world null hypotheses of 'no difference' are essentially never literally true, hence with enough data a null hypothesis can essentially always be rejected with an arbitrarily small
P-value, casting doubt on the intrinsic meaningfulness of any isolated statement that
'P <
x'.
A second entirely different class of
P-value criticisms concerns problems of construction [
7,
26,
28,
75,
83]. This critique maintains that
P-values as commonly conceived are in fact conceptually incoherent and meaningless, rather than simply being subject to misinterpretation. The charges revolve around a more explicit yet still mathematically informal type of definition of the
P-value such as
the probability that the data (that is the value of the summary statistic for the data), or more extreme results, could have occurred if the intended experiment was replicated many, many times, assuming the null hypothesis is true.
The potential morass created by this definition can be illustrated by imagining that an experimenter submits a set of data, consisting, say, of 23 data samples, to a statistical computer program, which automatically computes a
P-value. According to the definition above, to produce the
P-value, the computer must implicitly make several assumptions, often violated in actual practice, about the experimenter's intentions, such as the assumption that there was no intention to: collect more or less data based on an analysis of the initial results (the 'optional stopping problem'); replace any lost data by collecting additional data; run various conditions again; or compare the data with other data collected under different conditions [
26,
28,
75]. Any of these alternative intentions would leave the actual data in hand unaltered, while implicitly altering the null hypothesis, either trivially by changing the number of data points that would be collected in repeated experiments, or by more profound alterations of the precise mathematical form of the probability distribution describing the null hypothesis. Consequently, the
P-value apparently varies with the unstated intentions of the experimentalist, which in turns means that, short of making unjustified assumptions about those intentions, the
P-value is mathematically ill defined.
In what follows, we will avoid the 'constructional' objections raised above by using a mathematically explicit definition for the
P-value. Problems with interpretation will still remain, and the following section will focus in detail on what we believe are the most serious of the common modes of misinterpreting
P-values. The generally accepted mathematical definition for the
P-value is [
84]:
the probability under the null hypothesis of obtaining the same or even less likely data than that which is actually observed, that is the probability of obtaining values of the test statistic that are equal to or more extreme than the value of the statistic actually computed from the data, assuming that the null hypothesis is true.
Note that this definition does not include any reference to the 'intentions' under which the data were collected. To avoid any possible confusion, we emphasize that this definition requires that the null hypothesis,
H0, be fully specified. This means, for example, that the number of data samples
n, constituting the data
D, the chosen data summary statistic
T (
D), and more generally a mathematical formula for the probability distribution of values for the data summary statistic under the null hypothesis,
Pr(
T (
D)|
H0), be explicitly stated. In some cases, this specification is straightforward. For example, if the data is assumed to follow a normal distribution, then the null hypothesis can be fully specified by simply stating values for two parameters, the mean and standard deviation. In other cases the distribution can have a mathematically complicated form. Methods for specifying and computing complex null hypotheses are beyond the scope of this essay, but have been well worked out in a wide variety of practically important cases, and are in wide use in the field of statistics. The important point to grasp here is that once the null hypothesis
H0, is specified, or more precisely, the relevant probability distribution
Pr(
T (
D)|H0), then computing the
P-value can in principle proceed in a straightforward, uncontroversial manner, according to its mathematical definition given above. As mentioned above, without specifying the null hypothesis distribution explicitly, the
P-value is ill-defined, because any raw data are generally consistent with multiple different possible sample-generation processes, each which of may entail a different
P-value [
25,
26].
We now turn to explaining our final, technical definition of the P-value. We will do this by exploring the definition from the vantage point of three different examples. The third example presents an additional, alternative definition of P-values which provides novel insights into the true meaning of P-values by viewing them from the medically familiar perspective of sensitivity and specificity considerations, in the context of ROC curves. This final definition will be mathematically equivalent, though not in an immediately obvious way, to the definition just given.
Angle 1. P-values as tail area(s)
Graphically, a
P-value can be depicted as the area under one or two tails of the null-hypothesis probability distribution for the test statistic, depending on the details of the hypothesis being tested. For example, consider the classification of patients' systolic blood pressure as either chronically hypertensive,
H1, or not chronically hypertensive,
H0, on the basis of a single blood pressure measurement. Let us assume that blood pressures for normotensive patients obey a normal distribution

, as shown in Figure . If for a particular patient we obtain a systolic blood pressure of
SBP = 138.6, then the
P-value for this result is the probability in a non-hypertensive patient of finding a blood pressure equal to or greater than this value, or the area under the right sided tail of

, starting from
SBP =
138.6.
If instead the null hypothesis states that the patient is chronically normotensive, H0, so that the alternative H1 includes the possibility of either hypertension or hypotension, then the P-value would be 'two-sided', since values under an equally-sized left sided tail of the distribution would be equally contrary to the hypothesis H0 and hence would have caused us to reject H0 according to the null hypothesis significance testing procedure (NHSTP).
Angle 2. P-values for coin flipping experiments
Let us carry out the P-value calculation in detail for a simple coin flipping experiment, where we wish to decide whether a coin is fair (equal probability of heads or tails) or biased (unequal probabilities). Note that the P-value in this case is 'two-sided'. Following the NHSTP:
1. Let H0 = 'the probability of heads is 1/2', H1 = 'probability of heads ≠ 1/2'.
2. The experiment will consist of flipping a coin a number of times n, and the data D will thus be a series of heads or tails. For our test statistic T , let us compute the difference between 1/2 and the fraction of heads, that is if k of the n coin tosses land as heads, then T (D) = |1/2-k/n|. For this example, let us put n = 10.
3. We set the significance level to the conventional value α = 0.05 = 5%.
4. Having done the experiment suppose we get data D = (H, H, H, H, H, H, T, H, H, T). This sequence contains eight heads, so T (D) =|1/2-8/10| = 0.3.
5. To calculate the P-value, we must consider all the ways in which the data could have been as extreme or more extreme than observed, assuming that the null hypothesis is true. That is, we need to consider all possible outcomes for the data D such that T(D) ≥ 0.3, and calculate the joint probability of these outcomes, assuming that the coin is fair. Clearly, observing eight, nine, or ten heads would be 'as extreme or more extreme' than our result of eight heads. Since the null hypothesis assumes equal probability for heads and tails, symmetry dictates that observing zero, one, or two heads would also qualify. Hence, the P-value is:
(See Additional file
1 for details of this and the next two calculations.)
6. Since p ≥ 5%, the NHSTP tells us to accept the null hypothesis, concluding that the coin is fair.
Before leaving this example, it is instructive to examine its associated Type I and II error rates. The Type I error rate (false positive rate) in this case is the probability of incorrectly declaring the coin unfair (H1) when in fact it is fair (H0), that is, the probability of getting P ≤ α when in fact H0 is true. It turns out that had we observed just one more head then the NHSTP would have declared a positive result. That is, suppose k = 9, or T (D) = |1/2 - 9/10| = 0.4.
Then:
Thus, we see that P ≤ α whenever d ≥ 0.4, hence the Type I error rate or false positive rate is:
Calculation of the false negative rate requires additional assumptions, because a coin can be biased in many (in fact, infinitely many) ways. Perhaps the least committed alternative hypothesis H1 is that for biased coins any heads probability different from 1/2 is equally likely. In this case the false negative rate turns out to be FNR = Pr(D0|H1) = 72.73%
Angle 3: P-values from ROC curves
To take a third angle, we consider an alternative definition for the
P-value [
84]. The
P-value is
the minimum false positive rate (Type I error rate) at which the NHSTP will reject the null hypothesis.
Though not obvious at first glance, this definition is mathematically equivalent to our previous definition of the P-value as the probability of a result at least as extreme as the one we observe. The effort required to see why this is the case affords additional insight into the nature of P-values.
Let us step back and consider the null hypothesis testing procedure from an abstract point of view. The NHSTP is one instance of threshold-decision procedure, that is, a procedure that chooses between two alternatives by comparing a test statistic computed from the data T(D) with a threshold γ (in the case of the NHSTP, the statistic is the P-value, and the threshold is the significance level α). The procedure declares one result when the test statistic is less than or equal to threshold, and the alternative result when the threshold is exceeded. Identifying one of the alternatives as 'positive' and the other as 'negative', in general any such threshold-based decision procedure must have a certain false positive and false negative rate, determined by the chosen threshold. More explicitly, let us denote the positive and negative alternatives as H1 and H0, respectively, and declare a positive result whenever T(D) ≤ γ, or a negative result whenever T(D) >γ. A false positive then occurs if T(D) ≤ γ when in fact H0 is true, and the probability of this event is denoted FPR(γ) = Pr(T(D) ≤ γ|H0). Similarly, a true positive result occurs if T(D) ≤ γ when H1 is true, and the probability of this event is denoted TPR(γ) = Pr(T(D) ≤ γ|H1). If we allow the threshold to vary, we can generate a curve of the false positive rate versus the false negative rate; such a curve is called a ROC curve. To make this discussion concrete, let us return to our coin flipping example. In that case, we set the 'positive' alternative to H1 = 'the coin is biased' (that is Pr(Heads|H1) ≠1/2), and set the negative alternative to H0 = 'the coin is fair' (Pr(Heads|H0) = 1/2). Setting the test statistic as before to T(D) = d = |1/2 - k/n|, we then have:
The resulting ROC curve ROC(γ) = (FPR(γ), TPR(γ)) is plotted in Figure . (On a technical note, the way we have set up our decision procedure, there are really only seven achievable values of (TPR(γ), FPR(γ)) on this ROC curve, marked by the circles: The first five values correspond to the five possible values of d, 0, 0.1, 0.2, 0.3, 0.4, which correspond in turn to the following pairs of possible values k for the number of heads in ten coin tosses (0. 10), (1. 9), (2. 8), (3. 7), (4. 6) (each member of the pair gives the same value for d); the sixth value corresponds to the value d = 0.5, which corresponds to a result of five heads; and the seventh value corresponds to setting the threshold to any value beyond what is obtainable, that is to γ < 0. We have connected these seven points with straight lines to create a more aesthetically pleasing plot.)
Key points on the ROC curve are marked by circles, and the corresponding value for is γ noted. Points on the ROC curve 'down and to the left' (low false positive rate, low true positive rate) correspond to setting the threshold low; whereas values 'up and to the right' (high false positive rate, high true positive rate) correspond to setting the threshold high. Clearly, if we wished to avoid all false positive conclusions, we could set the threshold to -∞, since all results will then be declared negative (Pr(d ≤ -∞|H0) = 0), but this comes at the expense of rejecting all true positive results as well (since Pr(d ≤ -∞|H1) = 0). Conversely, we can avoid missing any true positive results by setting the threshold to γ ≤ 0.5, since it is true for all possible results that d ≤ 0.5 (hence Pr(d ≤ 0.5|H1) = 1), but this simultaneously results in a maximal false positive rate (since Pr(d ≤ 0.5|H0) = 1 also). Clearly, positive results are only meaningful when obtained with the threshold γ set to some value intermediate between these extremes. Now, suppose that after conducting our coin flipping experiment we decide to 'cheat' as follows. As before let the outcome be that we get eight heads, or d = |1/2 - 8/10| = 0.3. Rather than choosing the decision threshold beforehand, we instead choose the threshold after seeing this result, to ensure that the result is declared positive. Our results will look best if we choose the threshold γ as small as we can, to let through as few false positives as possible, while still letting our result pass. This special choice of the threshold γ is clearly the value of our actual result, so we set γ = d = 0.3, and voilà, our result is positive. We cannot make the false positive rate any smaller without making our result negative according to the NHSTP.
Now for the point of this whole exercise: If we drop a vertical line from the point on the ROC curve ROC(0.3) = (TPR(0.3), FPR(0.3)) down to the x-axis to see where it intersects, we see that the false positive rate is FPR(0.3) = 10.94%, which is the result we calculated previously as the P-value. Thus the condition for declaring a positive result (d ≤ γ) is equivalent to the condition in the NHSTP (P ≤ α), hence, as claimed, the P-value is the minimum false positive rate at which the NHSTP will reject the null hypothesis. As an immediate corollary we also see that false positive rate of the NHSTP is simply the significance level, that is:
Is significance testing rational?
The null hypothesis significance test (NHST) should not even exist, much less thrive as the dominant method for presenting statistical evidence. . . It is intellectually bankrupt and deeply flawed on logical and practical grounds. - Jeff Gill [
85]
We are now in a position to answer the question: Is the null hypothesis significance testing procedure a rational method of inference? We will show momentarily that the answer is a resounding 'NO!', but first we briefly consider why, despite its faults, many find it intuitively plausible. Several books explore the reasons in detail [
59,
86-
88], and a full account is well beyond the scope of this paper. We will focus on one particularly instructive explanation, called 'the illusion of probabilistic proof by contradiction' [
13]. Consider once again the valid logical argument form:
This argument is called 'proof by contradiction': A is proved by 'contradicting' B, that is the falsehood of A follows from the fact that B is false. It is tempting to adapt this argument for use in uncertain circumstances, like so:
By analogy, this argument could be called 'probabilistic proof by contradiction'. However, this analogy quickly dissolves after a little reflection: The premise (that is the 'if, then' statement) leaves open the possibility that
A may be true while
B is nonetheless false. More concretely, consider the statement 'If a woman does not have breast cancer, then her mammogram will probably be negative.' (This example is discussed more extensively in an excellent online tutorial by Eliezer Yudkowsky [
61].) This statement is true. However, given a positive mammogram, one cannot invariably pronounce a diagnosis of breast cancer, because false positives do sometimes occur. This simple example makes plain that 'probabilistic proof by contradiction' is an illusion - it is not a valid argument. And yet, this is literally the form of argument made by the NHSTP. To see this, simply make the following substitutions:
A = 'H0 is true', and B = 'P >α', to get:
Again, we have just seen that this is an invalid argument. One obvious 'fix' is to try softening the argument by making the conclusion probabilistic:
Unfortunately, any apparent validity this has is still an illusion. To see the problem with this argument, let us return to the mammography example. Is it rational to conclude that a positive mammogram implies that a woman probably has breast cancer? The correct answer, obvious to most physicians at an intuitive if not at a formal statistical level is, 'it depends on the patient's clinical characteristics, and on the quality of the test'. Very well, then let us give a bit more information: Suppose that mammography has a false positive rate of 20%, and sensitivity of 80%. Can we now assign a probable diagnosis of breast cancer? Interestingly, most physicians answer this question affirmatively, giving a probability of cancer of 80%, a conclusion apparently reached by erroneously replacing the sensitivity
Pr(
H1|D1) with the positive predictive value
Pr(
D1|H1) [
9,
11,
12]. The fallacy here has been satirized thus:
It is like the experiment in which you ask a second-grader: 'If eighteen people get on a bus, and then seven more people get on the bus, how old is the bus driver?' Many second-graders will respond: 'Twenty-five.'....Similarly, to find the probability that a woman with a positive mammography has breast cancer, it makes no sense whatsoever to replace the original probability that the woman has cancer with the probability that a woman with breast cancer gets a positive mammography. - Eliezer Yudkowsky [
61]
To calculate the desired probability Pr(H1|D1) correctly, Bayes' rule requires that we also know the prior probability of disease. Suppose that our patient is a healthy young woman, from a population in which the prevalence of breast cancer is 1%. Then, given her positive mammogram the probability that she has breast cancer is:
To put it as alarmingly as possible, the probability that she has breast cancer has increased by almost 8 fold! Nevertheless, she probably does not have cancer (7.8% is far short of 50%); the odds are better than nine to one against it, despite the positive mammogram. Thus, while further testing may be in order, a rational response is reassurance and perhaps further investigation rather than pronouncement of a cancer diagnosis. This and other examples familiar from everyday clinical experience make clear that the null hypothesis significance testing procedure cannot 'substitute' for Bayes' rule as a method of rational inference.
We have focused our criticism on what we consider to be the most fundamental and most common error in the interpretation of
P-values, namely, the error of mistaking 'significant'
P-values as proof that a hypothesis is 'probably true'. There are many other well documented conceptual problems with
P-values as commonly employed which we have not discussed. The interested reader is referred to the excellent discussions in the following references [
7,
28].
Do prior probabilities exist in science?
Though most physicians are comfortable with the concept of prior probability in the context of diagnostic test interpretation, many are less comfortable thinking about prior probabilities in the context of interpreting medical research data. As one respondent to our quiz thoughtfully objected,
The big difference between a study and a clinical test is that there is no real way of knowing how likely or unlikely a hypothesis is a priori. In order to have a predictive value in a clinical test, you need a prevalence or pre-test probability. This does not exist in science. It is the job of the scientist to convince us that the pre-test probability is reasonably high so that a result will be accepted. They do this by laying the scientific groundwork (introduction), laying out careful methods, particularly avoiding bias and confounders (methods), and describing the results carefully. Thereafter, they use the discussion section to outright and unabashedly try to convince us their results are right. But in the end, we do the positive predictive value calculation in our head as we read a paper... As an example, one person reads the SPARCL study and says, 'I do not CARE that the P-value shows statistical significance, it is hooey to say that statins cause intracranial hemorrhage.'... They have set a very low pre-test probability in their head. Another person reads the same study and says, 'I have wondered about this because I have seen lots of bleeds in people on statins.' They have set a much higher pre-test probability.
This response actually makes our point, perhaps inadvertently, about the necessity of prior probabilities. Nevertheless, several important points raised by this response warrant comment.
Do prior probabilities 'exist' in science?
First, to the philosophical question of whether prior probabilities 'exist' in science, the answer is 'yes and no'. On the one hand, probability theory is always used as a simplifying model rather than a literal description of reality, whether in science or clinical testing (with the possible exception of probabilities in quantum mechanics). Thus, when one speaks of the probability that a coin flip will result in heads, that a drug will have the intended effect, or that a scientific theory is correct, one is not necessarily committing to the view that nature is truly random. In these cases, the underlying reality may be deterministic (for example a theory is either true or false), in which referring to probabilities represents merely a convenient simplification, but do not really 'exist' in the sense that they would not be needed in a detailed, fundamental description of reality. However, simplification is essentially always necessary in dealing with any sufficiently complex phenomena. For example, while it might be possible to conceive of a supercomputer capable of predicting the effects of a drug using detailed modeling of the molecular interactions between the drug and the astronomical number of cells and molecules in an individual patient's body, in practice we must make predictions with much less complete information, hence we use probabilities. The use of such simplifications is no less important in scientific thinking than in medical diagnostic testing. Thus, insofar as probabilities 'exist' at all, they are not limited to the arena of diagnostic testing.
Are prior probabilities in science arbitrary?
Given that prior probabilities for hypotheses in science and medicine are often difficult to specify explicitly in precise numerical terms, does this mean that any prior probability for a hypothesis is as good as any other? There are at least two reasons that this is not the case. First, pragmatically, people do not treat prior probabilities regarding scientific or medical hypotheses as arbitrary. To the contrary, they go to great lengths to bring their probabilities into line with existing evidence, usually by integrating multiple information sources, including direct empirical experience, relevant theory (for example an understanding of physiology), and literature concerning prior work on the hypothesis or related hypotheses. These prior probability assignments help scientists and physicians choose which hypotheses deserve further investment of time and resources. Moreover, while these probability estimates are individualized, this does not imply that each person's 'subjective' estimate is equally valid. Generally, experts with greater knowledge and judgement can be expected to arrive at more intelligent prior probability assignments, that is their assignments can be expected to more closely approximate the probability an 'ideal observer' would arrive at based on optimally processing all of the existing evidence. Second, in a more technical vein, methods for estimating accurate prior probabilities from existing data are an active topic of research, and are likely to lead to increased and more explicit use of 'Bayesian statistics' in the medical literature [
29,
31-
36,
83,
89].
Taking responsibility for prior probabilities
Finally, regarding the responsibility of scientific authors and readers to take prior probabilities seriously: We emphatically agree that authors should strive to place their results in context, so as to give the firmest idea possible of how much plausibility one should afford a hypothesis, prior to seeing the new data being presented. Without this context, there is no way to appraise how likely a hypothesis is to actually be true, or how strong the evidence needs to be to be truly persuasive. The neglect of thorough introductory and discussion sections in scientific papers is decried by many as a natural side effect of reliance on significance testing arguments [
7,
90,
91], and is blamed for the too-common phenomenon of unreproducible results in clinical trials [
92-
97], and has even lead some authors to suggest that the majority of published medical research results may be false [
5,
98-
100]. Similarly, it is a central thesis of this paper that in reading the medical literature physicians should strive to take prior probabilities into account. Indeed, taking prior probabilities into account can be viewed as a good summary of what it means to read the medical literature critically.
Pro-Bayes
[T]he theory of probability is at bottom nothing more than good sense reduced to a calculus which evaluates that which good minds know by a sort of instinct, without being able to explain how with precision. - Laplace [
115]
The heuristics and biases movement notwithstanding, the probabilistic theory of cognition has been resurrected in recent years in the fields of neuroscience, artificial intelligence, and human cognitive science. As mentioned earlier, Bayesian theories have provided successful explanations of the sub- or pre-conscious mental phenomena, such as learning [
40], visual object and pattern recognition [
45,
116], language learning and speech recognition [
38,
41]; and memory [
42]. In the artificial intelligence community, there is a general consensus that many difficult engineering problems are best formulated and solved within a probabilistic framework, including computer vision, speech recognition, search engine technology, and pattern recognition, [
43,
44,
46-
48,
50,
51,
53]. Similarly, Bayesian inference has become the generally accepted framework for understanding how the nervous system achieves its feats, yet unmatched by engineering technology, of visual and auditory perception, among other tasks [
104,
108,
117-
119]. The thread tying these various problems and fields together is the need to draw rich inferences from sparse data, that is, to reason under uncertain conditions where the required conclusions are underdetermined by the available evidence.
There is also a growing consensus that many higher-level human cognitive processes also operate on Bayesian principles [
20,
39,
40]. Specific examples include studies of human symbolic reasoning [
120], reasoning about and predicting the actions of other people [
121], and estimating various everyday quantities [
122]. Taking this last example as a case in point, Tenenbaum
et al. recently studied the abilities of subjects to predict the values of uncertain quantities that arise in everyday reasoning situations. Subjects were told how long a particular everyday process had been going on so far (for example how long a cake had been baking, or how long a man had lived so far), and were asked to predict the final value of the process (for example how much longer before the cake will be done baking, or when the man will die). The scenarios tested included total final profits for movies, total runtimes of movies, the length's of poems, term lengths for US representatives, and cake baking times. In these tasks, people's judgements are remarkably close to optimal Bayesian estimates. These findings suggest that in many everyday tasks at which people are 'experts', people implicitly use the appropriate statistical distributions and, albeit unawares, carry out optimal probabilistic calculations.