“Show me the data,” we say. But we don’t mean it. Instead of the numbers generated by measurement – which can be billions for a single experiment – we wish to see results. This frequent confusion illustrates an important point: We think of the results as reflecting the data – so closely that we can disregard the distinction. However, interposed between data and results is analysis; and analysis is often complex and always based on assumptions (, top).
Intuitive diagrams for understanding circular analysis
Ideally, the results reflect some aspect of the data without any distortion caused by assumptions or hypotheses (, bottom left). Consider the hypothesis that neuronal responses in a particular region reflect the difference between two experimental stimuli. We might measure the neuronal responses, average across repetitions, and present the results in a bar graph with one bar for the response to each stimulus. The set of stimuli (or, more generally, experimental conditions) is decided on the basis of assumptions and hypotheses, thus determining what bars are shown. But the results themselves, i.e. the heights of the two bars, are supposed to reflect the data without any effect of assumptions or hypotheses.
Untangling how data and assumptions influence neuroscientific analyses sometimes reveals that assumptions predetermine results to some extent.1,2,3,4,5
When the data are altogether lost in the process, the analysis is completely circular (, bottom center). More frequently, in practice, the results do reflect the data, but are distorted – to varying degrees – by the assumptions (, bottom right). Such distortions can arise when the data are first analyzed to select a subset, and then the subset is reanalyzed to obtain the results. In this context, assumptions and hypotheses determine the selection criterion, and selection, in turn, can distort the results.
In neuroimaging, an example of selection is the definition of a region of interest (ROI) by means of a statistical mapping that highlights voxels more strongly active during one condition than another. In single-cell recording, an example of selection is the restriction of in-depth analysis to neurons with certain response properties. In electro- and magnetoencephalography, an example of selection is the restriction to a subset of sensors or sources that show expected responses.
In gene microarray studies, an example of selection is inferential analysis performed for a statistically selected subset of genes.6
In behavioral studies, an example of selection is the division of a group of subjects into subgroups based on task performance. Weighting and sorting of data can be construed as variants of selection; and we will use the latter term in a general sense to refer to all three ().
Selection can entail two distinct forms of bias: (1) selective reporting of accurate results and (2) distortion of estimates and invalidation of statistical tests. Both forms deserve a wider debate, but this paper focuses on the latter.
If selection were determined only by true effects in the data, there would be no distortion of the results of the selective analysis. However, data are always a composite of true effects and noise. Selection, thus, is affected by noise. In neuroimaging, for example, the voxels included at the fringe of an ROI tend to reflect the noise to some extent – even if the ROI highlights a truly active brain region (as in Example 2, below). When the selection process is based on the design matrix, it creates spurious dependencies between the noise in the selected data and the experimental design, thus violating the assumption of random sampling. This can bias selective analysis.
Selective analysis is a powerful tool and perfectly justified whenever the results are statistically independent of the selection criterion under the null hypothesis. However, “double dipping” – the use of the same data for selection and selective analysis – will result in distorted descriptive statistics and invalid statistical inference whenever the test statistics are not inherently independent of the selection criteria under the null hypothesis. Nonindependent selective analysis is incorrect and should not be acceptable in neuroscientific publications.
Although the dangers of double dipping in the pool of data are well understood in statistics and computer science, the practice is common in systems neuroscience, and in particular in neuroimaging and electrophysiology. To assess how widespread nonindependent selective analyses are in the literature, we examined all functional-magnetic-resonance-imaging (fMRI) studies published in five prestigious journals (Nature, Science, Nature Neuroscience, Neuron, Journal of Neuroscience) in 2008. Of these 134 fMRI papers, 42% (57 papers) contained at least one nonindependent selective analysis (not considering supplementary materials). Another 14% (20 papers) may contain nonindependent selective analyses, but the methodological information given was insufficient to reach a judgment.
Are all these studies incorrect in their main claims? We do not think so. First, we counted any study containing at least one nonindependent selective analysis. For a given paper, the overall claim may not depend on the distorted result. Second, we have no way of assessing the severity of the distortions. They might be small in many cases.
If circularity consistently caused only slight distortions, one could argue that it is a statistical quibble. However, the distortions can be very large (Example 1, below) or smaller, but significant (Example 2); and they can affect the qualitative results of significance tests. In order to decide which neuroscientific claims hold, the community needs to carefully consider each particular case – guided by neuroscientific as well as statistical expertise. Reanalyses and replications may also be required.
The problem arises so frequently, because the desired selection criterion is often identical with or – however subtly – related to the desired results statistics for the selective analysis. In neuroimaging, for example, we may hypothesize that there is a region responding more strongly to stimulus A than B, select voxels showing this effect to define an ROI, and then selectively analyze that ROI to test our hypothesis. One way to ensure statistical independence of the results under the null hypothesis is to use an independent data set for the final analysis of the selected channels (e.g. neurons or voxels).
Another way to ensure independence is to use inherently independent statistics for selection and selective analysis. For example, we may select channels with a large average response to stimuli A and B (contrast A+B) and test for a difference between the conditions (contrast A-B). The contrast vectors ([1 1]T
and [1 -1]T
) are orthogonal. Unfortunately, contrast-vector orthogonality, by itself, is not sufficient to ensure independence (see Supplementary Information
: A policy for noncircular analysis
, Fig. S3
). In practice, the same data are frequently used for selection and selective analysis, even when the selection criteria are not inherently independent of the results statistics. In that case, the results are questionable.
Distortions arising from selection tend to make results look more consistent with the selection criteria, which often reflect the hypothesis being tested. Circularity therefore is the error that beautifies results – rendering them more attractive to authors, reviewers, and editors, and thus more competitive for publication. These implicit incentives may create a preference for circular practices, as long as the community condones them.
Analyzing multiple channels and reporting results for a statistically selected subset is essential in electrophysiology and neuroimaging. Neuroimaging is faced with even more parallel sites than electrophysiology – typically on the order of 100,000 voxels within the measured volume. However, selection is also an issue in electrophysiology and will gain importance as multi-electrode arrays become more widely used. To its great credit, neuroimaging has developed rigorous methods for statistical mapping from the beginning.7,8,9,10,11
Note that mapping the whole measurement volume avoids selection altogether: We can analyze and report results for all locations equally, while accounting for the multiple tests performed across locations.12
The sense of discovery associated with brain mapping derives from this data-driven approach, which avoids both the bias of selective reporting of accurate results and the circularity that can invalidate nonindependent selective analyses. Despite the beauty and completeness of a nonselective mapping analysis, selective in-depth analysis of ROIs can yield additional insights.13
In this paper, we demonstrate the problem using two examples from neuroimaging (Figs. , ). In each example, a widely accepted practice is applied to random data known not to contain the experimental effect in question. This exercise reveals the distortion and spurious significance that can arise in circular analysis. We view the problem from three perspectives: as ‘selection bias’, as ‘exploration and confirmation using the same data’, and as ‘overfitting’. These perspectives are elaborated on in the Supplementary Information, which also contains further analyses and simulations (Figures S1-S4
), and a comprehensive set of questions and answers about circular analysis. Finally, we suggest a policy for noncircular analysis of brain-activity data (, Supplementary Discussion).
Example 1: Data selection can bias pattern-information analysis
Example 2: ROI definition can bias activation analysis
A policy for noncircular analysis