This review identified 212 publications in which 183 different diagnoses were validated. Given the breadth of our search strategy, we feel that we are likely to have captured the majority of validations of GPRD diagnostic data published in the specified time period.
The majority of validations were external, and most frequently were requests to GPs to provide additional information. Relatively few publications documented use of internal validations. Overall, quantitative estimates of validity were high (median 89% of cases confirmed) and qualitative evidence from external rate comparisons and sensitivity analyses supported the validity of diagnoses. However, we are reluctant to draw conclusions regarding the overall validity of diagnoses in the database, for three reasons. First, despite their strengths, the methods presented here have limitations: questionnaires to GPs, record requests, algorithms and manual reviews predominantly examine PPV, whereas sensitivity analyses and comparisons of rates cannot provide quantitative estimates of validity. Even where quantitative validations are carried out, it may only be possible to categorize some coded cases as ‘possible cases’ based on the extra information given in the case notes. Second, the quality of reporting of many validations was insufficient to assess the possibility of bias and generalizability of validity estimates across the GPRD. Finally, it is possible that validation studies that found low validity of diagnoses were not published and that this publication bias could have affected our results.
The most robust method of validation may be to request additional information from the GP, since this method uses information external to the database to verify disease status of individual cases. Most such validations were restricted to establishing the proportion of cases with specific diagnostic codes that were confirmed by medical record review or responses to questionnaires, thus providing an estimate of the PPV of that set of codes (). Although a useful measure, PPV varies with disease prevalence, so use of historical validations may not be justified if disease incidence has changed over time.
Measures of validity of categorical data. Sensitivity: A/(A+C); specificity: D/(B+D); positive predictive value: A/(A+B); negative predictive value: D/(C+D)
Information for cases alone does not allow calculation of sensitivity (the proportion of true cases correctly identified in the GPRD data), specificity (the proportion of individuals without the disease identified as such in the database), or negative predictive value. Even if PPV is high, other measures of validity could be low. These other measures require additional sampling of individuals without the diagnostic codes of interest (). In most validations, the sensitivity, specificity and negative predictive value are not assessed, and this may be partly explained by the fact that for rare diseases, sampling from the vast number of individuals without the code of interest is particularly daunting. A handful of publications have successfully investigated sensitivity and specificity of diagnoses, demonstrating high validity for certain GPRD diagnoses [20
]. For example, Nazareth [8
] estimated sensitivity and PPV of schizophrenia and psychosis diagnoses.
As described in , the proportion of identified cases that underwent validation was highly variable. Where this proportion is low, the precision of the validity estimate is reduced; most studies did not report confidence intervals around the PPV. Where only a proportion of total cases have been validated, it would be useful to compare those cases found to be valid with all other cases in terms of age, sex and other descriptive variables to look for systematic differences between them. One reason for small sample sizes in many validations is the high financial cost of record retrieval from GPs (currently averaging £70 per single set of notes).
Some GPRD practices do not participate in research studies, raising the question of generalizability of validation findings. For example, in a study by Van Staa [13
], 719 practices contributed to the database during the study period but only 295 (41%) were known to provide additional information. Thus, even if compliance in providing records is high, the observed PPV may be applicable only to cases from a subgroup of practices. Practices who do participate in validation studies may only send information for certain cases, e.g. refusing to copy very large case files [22
]; this may result in selection bias. Many publications did not report response rates clearly (with a complete lack of reporting in 16% of validations), making it impossible to assess whether selection bias could have affected their validation results. Where practices did respond to requests there were three possible outcomes: (i) notes were unavailable due to patient transfer or death, (ii) notes were returned with incomplete and/or inconclusive details of disease diagnosis, (iii) notes were returned with sufficient detail to verify the diagnosis. Since nonresponse, inadequate notes and exclusion because of patient death/transfer could bias assessment of validity in different ways, it would be useful to report them separately.
Given the high cost of record retrieval and GP questionnaires, manual review of the computerized records is cost effective but is also time consuming and takes away much of the advantage of having automated data. Less than half of the validations using this method specified the criteria used to determine ‘true’ cases. Without prespecified case criteria, there is scope for bias arising from judgements by individual physicians, which may vary over time and between physicians. Furthermore, recording of symptoms, results of diagnostic procedures and feedback from secondary care may not be complete in computerized records, thereby limiting the usefulness of this approach.
Many investigators develop internal diagnostic algorithms to identify cases, but few use these to validate specific diagnostic codes (e.g. a medical code for acute myocardial infarction was validated by the presence of supportive evidence, e.g. codes for chest pain, fibrinolytic therapy, coronary intervention, troponin test results or hospitalization [23
]). This method is quick and incurs no extra cost, so could be used more widely to validate diseases for which specific treatments are given universally. However, use of such algorithms may exclude less severe cases that do not require treatment, and the inclusion of test results in these algorithms is problematic since not all test results are recorded in the GPRD.
Comparison of rates gives a quick indication of the validity of the GPRD without the effort of individual case review. These comparisons do not validate individual cases or provide a measurable estimate of validity. Where prevalence rates are being compared, the GPRD may have a lower prevalence because GPs are not required to code prevalent conditions in each consultation [18
]. Although results are reassuring for descriptive purposes, comparable rates of disease cannot identify potential balanced misclassifications between different diagnoses (i.e. the situation sometimes seen in death certification where the loss of deaths from cause A because of misclassification is balanced by the inclusion of people dying of cause B but misclassified to cause A). Reliance on this method to establish the validity of a diagnosis in the GPRD should be approached with caution and is not appropriate in analytic studies where individual validity is required. Similarly, sensitivity analysis is not a true validation of the data but does give an indication of the quality of diagnoses.
Most studies carried out using GPRD data are nested case–control studies. When conducting such a study, it is important to apply the same inclusion and exclusion criteria to cases and controls. However, validation studies which focus solely on cases may produce more detailed criteria for cases than for controls. For example, Garcia Rodriguez [24
] investigated the relation between exposure to nonsteroidal anti-inflammatory drugs and acute liver injury. The investigators retrieved medical records of acute liver injury cases to verify their computerized diagnosis and excluded 16 of 166 potential cases (10%) from further analyses due to alcoholism. No further details on alcohol consumption by controls were retrieved, which may have led to bias. Validating a sample of noncases should ensure that control patients are subject to the same criteria as cases, although this would increase the financial cost of the research.
An alternative approach is the method that we recently applied to validate GPRD diagnoses of RA [22
]. We used external medical records to validate RA diagnoses, but did not simply assess the overall PPV of an RA code. Instead, we identified characteristics in the computerized records of RA-coded patients that were associated with a valid diagnosis (e.g. specific prescriptions), and carried out multivariable analyses of these characteristics (using a valid RA diagnosis as the outcome) to develop a data-derived diagnostic algorithm of characteristics that could be used to identify valid cases in the database [25
]. This method could be adapted to develop algorithms for a wide range of GPRD diagnoses.
Although considerable effort is often made to validate cases, the lack of detailed description of validation methods hinders interpretation of results. In some publications, lack of reporting was due to space constraints, which could be overcome by providing the relevant data as a web supplement. It is also helpful to make accessible a table of the medical OXMIS and Read codes used for diagnosis (or the mapping of these codes to specific ICD codes), so that others studying the disease can replicate case identification criteria. summarizes other information that could be made available to aid interpretation of validations.
Stream diagram showing the information from General Practice Research Database (GPRD) validation studies that could be made available to researchers