Summary of main findings
A systematic review of literature was carried out to validate the accuracy and completeness of the UK GPRD. The studies included in this review considered the accuracy of diagnostic codes in the GPRD and the completeness of data compared with other databases and national statistics.
Most of the diagnoses coded in the GPRD electronic record were well recorded when compared against GP questionnaire responses, medical records held at the GP practice, or hospital letters. However, it seems that acute diagnoses were not as well recorded. The studies in this review used a variety of ‘gold standard’ references to ascertain the accuracy of diagnoses, which may explain some of the differences in accuracy of diagnosis recording, especially in the validation of acute conditions.
When a questionnaire is sent to a GP to support the accuracy of a diagnosis, the GP has several options for verifying the diagnosis, including checking through the computerised medical record or free-text information, looking for supporting evidence for a diagnosis from test results or hospital discharge letters, or relying on memory alone. Investigators conducting independent validation of a diagnosis can also request copies of patient medical records, hospital discharge letters, or correspondence or biochemical test results. This extra information can be used to find key words relating to a diagnosis, or conduct expert validation of a diagnosis.
Several studies assessed the accuracy of recording against an objective standard as defined by an external body; for instance, acute liver injury defined as an increase of more than two times the upper limit of normal in alanine aminotransferase by international consensus statement, or evidence of specific behavioural or cognitive symptoms as described in the Diagnostic and Statistical Manual of Mental Disorders.15,23,34
The number of patient records validated may depend on the level of evidence required to assess the accuracy of recording.
Smoking status is an important risk factor and confounder in many epidemiological studies, and although current smoking may be recorded well enough for the purposes of epidemiological research, data on former smoking may need to be independently validated.45
Only three papers looked specifically at the differences between the date of onset of disease in the GPRD electronic medical record and the GP-reported date.19,28,35
Although there were some inconsistencies in date recording, the differences were small. Investigators who require precise dates of onset of disease may need to be aware that there could be a slight difference in the date recorded in the electronic record and, if necessary, conduct further validation.
Generally, there is good agreement in disease prevalence rates between the GPRD and other national databases and statistics; however, there were some differences identified in this review. There is no ‘gold standard’ measure against which data from one database can be compared, or to suggest which database contains the most accurate measure. Discrepancies between two data sources do not necessarily mean that one database is right and one is wrong. There may be geographical differences or disease coding system variability that will lead to systematic differences in disease prevalence or data recording.48
It is important to consider these differences when conducting research using these datasets.
There are two reasons why the GPRD may be systematically different from other datasets. First, not all consultations for chronic diseases need to be recorded in the GPRD; the GPRD recording guidelines state that the GP should make at least one entry in the medical history for each episode of illness or new occurrence of a symptom.4
The requirement to record only the first instance of disease may partially explain why consultation rates and prevalence of diabetes and chronic musculoskeletal conditions were underestimated in the GPRD compared with the MSGP4.49,50
Second, it is important to consider that practices supplying early years of data to the GPRD provided OXMIS-coded data. Other databases may use different coding systems; for instance DIN practices have always used Read Codes for recording diagnoses and prescriptions under a problem-orientated medical record, which presents each medical record as a set of intertwined but separate problems.51,52
Investigators attributed many of the differences between DIN and the GPRD to the Read and OXMIS coding systems used in the respective databases.53
Strengths and limitations of the study
This is the first study to search for studies systematically and to combine studies that consider the accuracy and completeness of the GPRD. By using broad search terms it was possible to find a wide range of literature covering a range of diagnoses. This review provides vital information to aid researchers and clinicians who are planning to conduct research using the GPRD. However, very few of the papers in this review gave results that were directly comparable. A wide range of diagnoses were considered and many of the investigators used different criteria to assess the validity of diagnoses, making it difficult to compare directly PPVs across studies even when diagnoses were the same. There was often a lack of an objective standard for comparison of data recording, and the papers in this review often used a variety of methods to judge the accuracy of clinical diagnoses in the patient electronic records. Finally, many of the studies included in this review only validated a small number of patient records, due to the expense of conducting validation of diagnoses via GP questionnaire or independent evaluation of hospital letters or medical records.
Comparison with existing literature
Several UK-based studies consider the quality of morbidity coding in general practice, and a systematic review of these studies shows that morbidity coding in general practice is variable. However, the investigators suggest that conditions with clear diagnostic features are better recorded than conditions with more subjective criteria.58
In their paper, Jordan et al
include eight GPRD studies which are also assessed in the current review. The sensitive search strategy used in this study, which specifically considered the GPRD, made it possible to find and consolidate information from a larger number of papers validating a diagnosis in the GPRD. Thiru et al
investigated the quality of data in primary care; however, their review focused more on how well GPs record the outcome of a consultation on electronic patient records.59
Their review also found that studies report consistently high PPVs, indicating that data on the patient record were valid. As noted in the present review, the authors point out that variability in the assessment of data quality made it difficult to compare results directly between studies.
Implications for future research
Investigators conducting research using the GPRD need to consider carefully how information is recorded in primary care, and how GPs may use different Read/OXMIS codes to represent the same diagnosis. Some diagnoses may be recorded differently from others. This review suggests that researchers can be confident about case validity when using the GPRD for research into most chronic conditions. However, research into acute conditions may need additional validation. It may not be feasible to conduct validation studies of diagnoses in the GPRD for every project, as this can be expensive; current prices start at £60 per patient for a questionnaire or request for additional information from a practice.
One approach to ensure better identification of cases is to construct Read/OXMIS code diagnostic algorithms comprising several codes to identify events and diagnoses in the GPRD. Often, these diagnoses can then be internally validated using evidence within the GPRD to support the diagnosis; for instance, a Read/OXMIS code for an acute myocardial infarction may be followed by a referral to cardiology, details of a discharge letter from hospital, and relevant medication.
A study validating the recording of neural tube defects found that in some cases, the diagnosis represented a condition in the mother and not in the child. Birth defect researchers using the GPRD may wish to search the mother's medical history to determine whether the code relates to a diagnosis in the mother or the child. This supplemental information can be obtained from within the GPRD to improve the reliability of diagnostic codes.
Prescription data are well recorded in the GPRD because prescriptions for patients are generated directly from the computer, and details on drug type and dosage are digitally recorded in this automated process. Therefore, prescribing data can be used to verify clinical diagnoses, or to capture additional cases. For instance, use of inhalers was used as a proxy for asthma diagnoses.48
However, investigators should be cautious about using drug prescribing as a proxy for disease, and ensure that the prescribed drug is specific to the diagnosis of interest.
One of the future strengths of the GPRD lies in planned linkages with other national databases, including the Hospital Episodes Statistics, and Office for National Statistics databases, and the National Cancer Intelligence Network. These linkages will allow investigators to access more detailed clinical information relating to inpatient and outpatient hospital attendances and diagnoses, death registration, cause of death, and cancer diagnoses and treatment. This additional information will be a source of accurate and complete information on many of the clinical outcomes occurring outside of primary care.