The testing of scientific hypotheses is typically associated with two types of statistical errors. A test may give confirmation for a hypothesis that is actually false. This type of error is commonly referred to as type I error or ‘false positive’. The probability α of obtaining a positive result although the hypothesis is false relates to the significance level of a test. Conversely, a test may fail to confirm a true hypothesis. This type of error is referred to as type II error or ‘false negative’. The probability β of missing a true relation corresponds to the power of a test, 1-β. The probability that a hypothesis is true after a test result has been obtained, i.e. the posterior probability, does not only depend on the test statistics, but also on the probability of the hypothesis before the test, i.e. the prior probability. For example, a positive result on a very improbable hypothesis is likely a false positive, while a positive result on a more probable hypothesis is more likely to be true. For a given prior probability, test result and test statistics, the posterior probability of a hypothesis can be calculated using Bayes' Theorem 
In a recent controversial essay by J.P.A. Ioannidis 
, it has been argued that at least in some research fields, most of the published findings are false. This is because findings tend to be evaluated by p-value rather then posterior probability, and because positive results are more likely to be published than negative results. Small effect sizes, error-prone tests, low priors of the tested hypothesis, and biases in the interpretation of research findings can lead to a large fraction of published false positives 
. Moreover, competition has been argued to have a negative effect on the reliability of research, because the same hypotheses are tested independently by competing research groups. The more often a hypothesis is tested independently, the more likely a positive result is obtained and published even if the hypothesis is false 
. These findings raise concerns about the reliability of published research in those fields of the life sciences that are characterized by low priors, error-prone tests, and considerable competition.
Scientific research is, however, typically more complex than accounted for by the approach outlined in Ioannidis' essay, where single tests are used to evaluate single hypotheses. Research programs involve larger sets of hypotheses that are evaluated by different tests and complementary technical approaches. In many research fields, evidence from several tests and experiments has to be combined in order to reach a conclusion about a hypothesis. In such situations, it is often advantageous to evaluate hypotheses in a step-by-step manner, choosing each test based on previous findings. Such sequential testing is typically more cost-efficient than parallel testing, because previous knowledge often allows one to design experiments in a more informative way.
Sequential testing gives rise to temporal dynamics in the reliability of research. These dynamics are additionally affected by the fact that in scientific research, not all results are published or receive equal attention. Competition for limited space in scientific journals implies that some findings are not published at all, or are published in journals with low visibility. Especially those studies that do not achieve formal statistical significance are less likely to be published because they are perceived as less valuable 
. For studying the reliability of published findings in such scenarios, the methods outlined in Ioannidis' essay must be extended. In this study we use computer simulations and experimental approaches to analyze the impact of statistical errors on research programs that include sequential testing. We investigate simple scenarios of sequential testing as well as scenarios where not all results can be published and used for subsequent rounds of testing. To study reliability of research in these scenarios, we use simple research tasks that can be investigated with computer simulations as well as experimental settings.
For our experiments, these research tasks are framed within the context of molecular biology. Our framing gives participants a concrete picture of what they are investigating, and avoids situations where they have prior expectations or preferences for the hypotheses under investigation. However, our findings are not specific to molecular biology and may be generalized across fields that engage in hypothesis testing. Suppose that three genes (A, B, and C) are known to interact in a linear biochemical pathway: The first gene activates the second, which in turn activates the third. However, the order of the sequence is unknown. The task is to identify the correct sequence. There are six possible pathways (ABC, ACB, BAC, BCA, CAB, and CBA) that form the set of possible hypotheses. Knowledge of the pathway can be characterized by six probabilities p(h1), … , p(h6) that are associated with these hypotheses. In order to increase their knowledge about the hypotheses, researchers can test whether a specific gene activates another, i.e. they can test whether A activates B, A activates C, etc. Thus there are six different tests (AB, AC, BA, BC, CA, and CB). Note that each test supports two of the hypothesis and each hypothesis is supported by two tests. A positive result on test AB, for example, supports the sequences ABC and CAB, while sequence ABC is supported by positive results on test AB and BC.
All of the tests are equally prone to type I and type II errors. We use α
0.12 and β
0.3 in all our computer simulations and experiments. These values are higher than the values of α<0.05 and β<0.2 that researchers traditionally aim to achieve in the life sciences. We use these error probabilities to ensure that in the experiments, participants are exposed to errors at a considerable frequency. After a test has been performed, the probabilities associated with the hypotheses can be updated according to Bayes' Theorem. The research task is to identify the correct sequence after a limited number of tests. We use seven rounds of testing in all our simulations and experiments.
In these scenarios, individuals can choose tests depending on results that have been obtained earlier. If, for example, the interaction AB is tested in the first round, and the result is positive, it is an efficient strategy to test in the next round either BC or CA. These tests are the most informative ones, because they distinguish between the two hypothesis supported by the first result (CAB and ABC). In scenarios where several tests can be performed but not all results can be used for subsequent test rounds, some of the results have to be selected for publication. This implies that different results have to be compared and evaluated. Thus in contrast to quantifying the informativity of a test, here the informativity of a result has to be determined. If, for example, an individual receives a positive result on test AB and a negative result on test BA, and can only publish one of these results, it might be best to choose the positive result on AB because this result is more informative. The informativity of tests and results can be formally quantified using methods from information theory; see 
for a review. Details about the informativity measures used in this study are given in the Methods
We perform computer simulations for three scenarios of sequential testing. First, we analyze a scenario of random test choice (SIM-R). Here, results from previous rounds are not used for the choice of a test. Second, we study a simple scenario with informative test choice (SIM-1). Third, we study a scenario of informative test choice where only a subset of results can be used for further test choice (SIM-2). To test predictions from the simulated scenarios with informative test choice, we use four different experimental settings. Two of these settings (EXP-1S and EXP-1G) are analogous to the simple scenario (SIM-1). The two other settings (EXP-2G and EXP-2E) are analogous to the complex scenario (SIM-2).