Epidemiology is often introduced using examples in which both exposure and outcome are considered in binary terms: research participants are defined as having, say, lung cancer or not, and being smokers or not, and then the proportion of smokers compared between cases and controls. Many exposures, however, are inherently continuous. Indeed, in the classic case-control study on smoking and lung cancer[1
], Doll and Bradford-Hill report results both for cases and controls in terms of proportion of smokers and by "amount of tobacco consumed", grouping into several different categories such as 1 - 4, 15-24 or 50 + cigarettes per day. In contemporary epidemiologic practice, it is more customary to group continuous variables into quantiles - most often tertiles, quartiles or quintiles - based on the exposure's distribution. In one recent study, for example, researchers examining the link between dietary fat and breast cancer grouped fat intake into quintiles. They reported that women in the highest quintile of fat intake were 11% more likely to get breast cancer than women in the lowest quintile[2
]. As another example, surgeon annual caseload was found to be significantly associated with the survival of patients after an acute myocardial infarction[3
]. The authors reported that the 30-day mortality rate was 13.5% for physicians in the lowest quartile of volume (5 or fewer cases per year) compared to 11.8% for physicians in the highest quartile (more than 24 cases annually).
A number of researchers have commented on the disadvantages of categorization in epidemiologic studies[4
]. Many associations can be tested using linear models and practicable alternative methods for handling non-linear relationships have been broadly developed and validated in recent years. Yet despite these methodological advancements and calls for the abandonment of percentile-based categorization [4
], the epidemiologic community continues to rely heavily on the use of quantiles as a primary means of analyzing and presenting results. For example, in a recent issue of The American Journal of Epidemiology
(October 2009, volume 170, number 8), four of six papers with a continuous exposure used some form of percentile-based categorization; only two kept the variable as continuous.
Quantiles appear intuitively appealing to epidemiologists as they can be thought of in terms of low, medium and high risk groups. Moreover, the association between exposure and outcome can be described in terms of a relative risk between these groups. However, these perceived benefits are outweighed by several important problems that arise when a continuous variable is categorized, particularly if data dependent quantiles are used to form categories. Here we summarize the previous research on the topic and address possible concerns about the use of alternative statistical approaches.