Increasingly, as response rates to other forms of investigation have fallen [6
], researchers have looked to exploit routinely collected datasets, some of which are amenable to case-control analysis. In the UK, one of these, the General Practice Research Database (GPRD), offers a log of all consultation episodes associated with significant events, illnesses, or medical activity (diagnosis, referral, prescription etc) among patients from some 370 participating general practices (an estimated 3,000,000 episodes of care covering 6% of all residents of England and Wales) [7
]. This resource with its large sample size and its wealth of routinely collected health and prescription data has been successfully exploited in numerous pharmaco-epidemiological studies of case-control design [8
]. However, some variables of interest are typically missing, including occupational history.
We identified a study question of high policy relevance that we wished to address using the GPRD database. The populations of westernised countries are aging. In future, therefore, the frequency of common age-related health conditions is likely to rise among the workforce, as is the proportion of workers taking prescribed medicines. Potentially, certain widely used medicines that impair arousal, concentration, cognition, and psychomotor performance, and some common illnesses that result in sudden incapacity, impaired judgment, or sensory deficit could increase the risk of accidental injury at work. But which drugs and diseases, by how much, in what circumstances, and with what consequences? The British Government has announced strategic plans to maximise job retention rates among experienced older workers, but in delivering these plans employers require an evidence base to manage injury risks, the aim being to ensure safe job placement while at the same time avoiding needless restriction of job opportunities. However, when we conducted a systematic review on the topic [9
] we found few relevant data, both overall and by type of injury (eg fractured femur) and external cause (eg fall). And we identified a need to improve upon cross-sectional studies with self-reported exposures and self-reported outcomes, by mounting investigations with objective measure of outcome and documented timing of exposures (to counter worries about common instrument reporting bias and reverse causation) [9
The GPRD database overcame some of these limitations and fulfilled several requirements for a case-control analysis of occupational injury risk, co-morbidity and medication. It allowed an operational case definition (namely, male patients with a consultation episode for an injury coded as occupational, or involving plant or off-road vehicles or machinery or tools likely to be used only at work); and for each case, plentiful controls could be identified who were well matched by age, sex, and general practice. A preliminary scoping exercise suggested that we would find some 1,700 cases, to whom we could match 8,500 controls. For each injury we could establish relevant exposure parameters, including the diagnostic Read code and date of first consultation; all prescriptions, with dates, within the 24 months preceding the event; and all diagnoses, with dates, preceding the event. We thus envisaged an analysis to establish the frequency and main reasons for consultation in the 24 months before injury consultation; the frequency of prescribing over this time, the main prescribed drugs, and relative exposure odds of various illnesses and treatments in cases versus controls. As risks could vary according to time since first prescription of a drug or first onset of a new illness, so analysis could encompass various exposure time windows. Several aspects of confounding could be addressed through the matching algorithm (age, sex, geographical area) or via proxy measures available within the health-rich dataset (e.g. alcoholic liver disease as a proxy for alcohol misuse).
Unfortunately, occupation was poorly recorded in the database, which raised concerns of the kind outlined in our introduction. Specifically, cases of occupational injury must necessarily come from the employed subfraction of the study population, whereas controls - in the absence of employment information - would be drawn from the whole population, among whom a proportion would be unemployed and not at risk of occupational injury. Also, cases would be more likely than employed controls to come from manual occupations, as the potential for occupational injury is greater in blue-collar work. Bias could arise if controls over-represented the prevalence of diseases and treatments that prevent work and are more common in the unemployed; or if they under-represented the (generally worse) health characteristics of manual workers. Finally, although practical experience suggests that such selection applies to only a few high-risk jobs, in theory people with health problems could be excluded from jobs with higher injury potential, and if these jobs were less common in controls then any risks of injury from ill health would tend to be underestimated. It should be noted that these potential biases, which relate to representativeness of exposure information among controls, do not all operate in the same direction.
The missing information could only be obtained at a cost. To contact study subjects and to ascertain their employment status by a questionnaire or interview was feasible but would carry significantly higher administrative costs and effort, a need for more elaborate ethical permissions and suitably anonymised third party mailings by collaborators with data control, and the potential for one bias (related to non-response) to be substituted for another. Some of the economic advantages of a routine publicly available dataset would be lost.
The case series method of analysis [10
], which compares the relative incidence of events of interest only among cases (in time windows of exposure and non-exposure), might seem to offer an attractive alternative. Each case would provide his or her own reference information. Since the technique is based solely on the experience of cases, this would circumvent any concern about differences in work and employment experience that arose from differences in case and referent sampling frames. However, the method is only suited to short-term exposures that impact on risk for a limited time period, such as acute intercurrent illnesses, exacerbations of pre-existing disease and newly prescribed treatments (for which purposes we intend using it). Over the much longer time frames of chronic illness and long-term treatment, potential exists for employment conditions to alter markedly within individuals. For such long-term exposures, the case-control design is still the preferred choice.
Faced with this dilemma, we decided to assess quantitatively the potential bias arising if controls were selected without employment information. How much would it matter that cases came from a subfraction of the population from which controls were sampled, breaking the rule on control selection often repeated in standard textbooks? We addressed this practical question focusing on four common exposures that would be of interest in our hypothetical case-control study – namely, diabetes, anxiety-depression, asthma, and coronary heart disease.