summarizes the effect of assuming a continuous relation between exposure and outcome in these 2 case studies. In each case the assumption leads to a qualitatively different finding and one that may exaggerate the effect of the relation.
Our critique should not be misconstrued as suggesting that multivariate analysis is inherently misleading. Instead we view the technique as a major advance that, by adjusting for confounding, can help uncover the true effect of an exposure. At the same time, the research community should acknowledge that the technique's inherent complexity means that it is widely seen as a black box that tends to distance readers, reviewers, editors and even researchers from the underlying data. Nor are we suggesting that modelling a continuous relation is always inappropriate. We do believe, however, that such modelling is likely overused, that it further adds to complexity of multivariate analysis and, thus, that it further distances people from data.
We believe the assumption of a continuous relation is less the result of a considered decision than a practice born out of convention and convenience. The convention is that biologic relations ought to be smooth — which, in turn, engenders a strong desire to present them as such. The convenience is evident in the effort to summarize the relation between multiple levels of exposure and the outcome in a parsimonious manner, ideally using a single number. This, in turn, requires modelling a fairly simple, generally linear, relation.
The strategy for researchers to avoid misleading findings due to such modelling is straightforward: report categorical findings. Examining categorical data before modelling is, of course, standard statistical practice — all we are suggesting is reporting this important step. A simple column graph showing the risk for discrete exposure categories is our recommendation for quickly communicating the crude shape of the relation.
Reporting some intermediate steps of the analytical process before presenting a model that assumes a continuous relation will help everyone reconnect with the underlying data. provides one vision of how this might be done. The first step is simply to report the crude rates of the outcome for discrete exposure categories. In other words, report what is actually observed. Examining categorical data before modelling is an important statistical practice, and authors should report the results of this step. This step is also the time to communicate another piece of basic information that readers and editors require to more fully understand the data: the size of various exposure groups. The second step is to report adjusted rates for discrete categories. Some may choose to report a hierarchy of adjusted results, starting with the most fundamental confounder — age — before moving on to adjust for other, less obvious confounders. This approach has the advantage of illustrating the effect of adjustment and where it is really important. The third step is to report the proposed continuous relation either with a summary measure (e.g., slope) or an illustration (e.g., graph). Finally, the proposed continuous relation should be superimposed on the adjusted categorical results to help judge its validity. Some investigators may need to present results with variables modelled as both categorical and continuous and provide an interpretation if there is a discrepancy. Others may choose not to assume a continuous relation and instead simply report the categorical data.
Such straightforward data reporting creates a difficult challenge for researchers: how best to categorize exposure data. The answer is based on some combination of the need to use readily understandable cutoff points and the need to reasonably reflect the underlying distribution of exposure. Producing readily understandable cutoff points often involves digit preference (i.e., using round or whole numbers) or using cutoff points that connect to some external standard (i.e., regulation, standard definition, common practice). Reflecting the underlying distribution requires avoiding creating categories in which there are few observations, a discipline that can only improve the communication of what is really observed and what is really most relevant. Finding the balance in meeting these demands requires real work, but researchers must work hard to communicate the actual data in front of them.
The strategy for readers (as well as reporters and editors) trying to interpret multivariate analyses is to ask 3 basic questions. First, can I understand the levels of exposure? As a test, imagine communicating them to a patient (e.g., explain what “moderate adherence” means). Second, do I have some sense of the common categories of exposure? In other words, determine the levels of exposure that most people have. Finally, is the rate of outcome for these categories available? That simply implies knowing what happened to people in common exposure categories. If one cannot confidently answer Yes to these 3 questions, it is hard to imagine how the results could be useful to our patients or the public.
Although multivariate analysis is an important tool to minimize the influence of confounding variables, it may also tempt researchers to assume that the relation between an exposure and a health outcome is continuous. Researchers should examine categorical data before modelling a continuous relation and report these results. Reporting outcome data for discrete categories of exposure may help readers more accurately understand the benefits and harms of various health behaviours.