Search tips
Search criteria 


Logo of scirepAboutEditorial BoardFor AuthorsScientific Reports
Sci Rep. 2017; 7: 44702.
Published online 2017 March 17. doi:  10.1038/srep44702
PMCID: PMC5356333

External validation of a COPD prediction model using population-based primary care data: a nested case-control study


Emerging models for predicting risk of chronic obstructive pulmonary disease (COPD) require external validation in order to assess their clinical value. We validated a previous model for predicting new onset COPD in a different database. We randomly drew 38,597 case-control pairs (total N = 77,194) of individuals aged ≥35 years and matched for sex, age, and general practice from the United Kingdom Clinical Practice Research Datalink database. We assessed accuracy of the model to discriminate between COPD cases and non-cases by calculating area under the receiver operator characteristic (ROCAUC) for the prediction scores. Analogous to the development model, ever smoking (OR 6.70; 95%CI 6.41–6.99), prior asthma (OR 6.43; 95%CI 5.85–7.07), and higher socioeconomic deprivation (OR 2.90; 95%CI 2.72–3.09 for highest vs. lowest quintile) increased the risk of COPD. The validated prediction scores ranged from 0–5.71 (ROCAUC 0.66; 95%CI 0.65–0.66) for males and 0–5.95 (ROCAUC 0.71; 95%CI 0.70–0.71) for females. We have confirmed that smoking, prior asthma, and socioeconomic deprivation are key risk factors for new onset COPD. Our model seems externally valid at identifying patients at risk of developing COPD. An impact assessment now needs to be undertaken to assess whether this prediction model can be applied in clinical care settings.

The burden of chronic obstructive pulmonary disease (COPD) has been rising, now representing one of the leading causes of morbidity and mortality worldwide1. Whilst estimates of disease burden have primarily come from developed countries, the prevalence appears to be rising in developing countries as well, and the resultant mortality is projected to rise by 30% in the next decade2. In the United States (US), over 12 million adults have COPD, representing the third leading cause of death3; it is the second most common cause of emergency hospitalisation in the United Kingdom (UK)4,5,6.

Whilst some studies have evaluated algorithms to identify individuals with established COPD7,8,9, the rate of undiagnosed disease remains high10, occurring in one out of eight people over the age of 35 years6. There is a paucity of studies that have developed hands-on tools that enable early identification of individuals9 at-risk of future COPD. Our ability to construct models that will enhance identifying individuals at risk well before disease onset will provide the opportunity for developing key strategies for prevention11. Using the Primary Care Clinical Informatics Unit (PCCIU) general practice (GP) database, we developed and internally validated the first risk prediction model for early detection of incident COPD, which simultaneously took into account a range of known risk factors, including smoking, age, sex, prior asthma, and socio-economic status11. A more recent study utilised the UK Clinical Practice Research Datalink (CPRD) database to similarly develop and validate a COPD prediction model, deriving comparable predictive values as our previous study12. However, it is important, prior to using these risk scoring systems in clinical practice that they are externally validated in entirely different datasets of comparable populations.

In the current study, we therefore aimed to externally validate our previously developed prediction model (developed using the PCCUI GP database) using a different database (i.e. the CPRD GP database). The current work is the first to externally validate a COPD prediction algorithm for early detection of individuals at-risk of future COPD in an entirely different database but comparable population.


Background characteristics of the study population

Overall, majority of the patients had smoked at some point, but the proportion of smokers was higher in COPD cases (n = 33,269; 86%) than in controls (n = 19,908; 52%), regardless of whether the CPRD smoking codes or the PCCIU codes were used (Table 1). The proportion of those with prior asthma was up to n = 13,161, 17% (cases n = 10,210, 27%; controls n = 2,951, 7%). A higher proportion of cases was more deprived than controls as measured by the Index of Multiple Deprivation (IMD) quintile (Table 1).

Table 1
Participants’ baseline characteristics by COPD cases and controls.

Associations between risk factors and COPD

In unadjusted and adjusted (i.e. simultaneous adjustment for all factors) models, ever smoking was associated with a seven-fold increased odds of COPD (Table 2). Prior asthma was associated with an increased risk of COPD: adjusted OR for prior asthma was 5.04 (95% CI 4.77–5.33). Compared to the least deprived IMD quintile, those in the more deprived IMD quintiles were increasingly at higher risk of COPD (Table 2). The estimates for smoking and prior asthma were similar in magnitude and direction when either of the codes from CRPD and PCCIU data were used in the analyses (data not shown).

Table 2
Unadjusted and adjusted associations between risk factors and diagnosis of COPD: Odds ratio (OR), 95% confidence interval (95% CI).

Validation of COPD prognostic index

Table 3 presents the validated decile prognostic scores when the scores derived from the development models based on the PCCIU data were applied to the CPRD data. The scores were derived for males and females separately and ranged from 0 (lowest) to 5.71 (highest) for males and 0 to 5.95 for females. Some deciles did not have corresponding prognostic scores, an indication that the scores were not normally distributed. The accuracy of the validated prediction model in discriminating between COPD and non-COPD patients was ROCAUC = 0.66 (95% CI 0.65–0.66) for males and ROCAUC = 0.71 (95% CI 0.70–0.71) for females (Table 3). The ROC curves for the validated scores are shown in Fig. 1 and the sensitivity and specificity values for the various cut points on the prognostic scores are shown in Supplementary File 4.

Figure 1
ROC curves for the validated prognostic scores, for males (top) and females (bottom): the prediction model developed using the PCCIU data was applied to the CPRD data.
Table 3
Validation of COPD prognostic scores derived from the PCCIU data using the CPRD data.

COPD prognostic scores derived from the CPRD data

Table 4 presents the prognostic scores derived solely from the CPRD data. The scores ranged from 0 to 4.37. The corresponding ROCAUC for assessing the accuracy of the model was 0.74 (95% CI 0.73–0.74). The ROC curves are shown in Figure S1 of Supplementary File 5. The prognostic scores were similar when either of the smoking and prior asthma codes from CRPD and PCCIU data were used in the analyses (data not shown).

Table 4
Prognostic scores derived from the CPRD data.


In this first ever external validation exercise of a COPD prediction model using the CPRD database, we have found similar prediction estimates as those derived from the PCCIU database derived based model, both in the magnitude and direction of effect. Whilst the prognostic scores based on the development model ranged from 0 to 7.50, similarly for males and females, the scores from the current validation study ranged from 0 to 5.71 for males and 0 to 5.95 for females. Similarly, whilst we observed some small differences in the associations between smoking, prior asthma, and deprivation quintiles and the risk of COPD, these were largely in the same direction of impact as expected. As expected, the accuracy of the models as measured by the ROCAUC were somewhat lower in this validation study (males 0.66, females 0.71) compared to the development study (males 0.83, females 0.85); nonetheless, these estimates still have the potential to be very useful in clinical practice.

The CPRD database is a well-characterised population-based primary care database and is one of the best validated large primary care research databases in the world13,14. With sufficient sample size and power, we have – for the first time –validated a COPD prediction score in a different dataset, obtaining similar risk factors and analogous accuracy estimates to discriminate COPD cases from non-cases compared to the measures derived from the prediction development study11. In comparison to the prediction development study using the PCCIU database11, the coding for smoking and asthma was more complete in the current study using the CPRD data. No differences were however seen when either of the CPRD- and PCCIU-derived codes were used.

A limitation of this study is that it was based on a matched case-control design but nested in a longitudinal population cohort (with estimation of the conditional odds ratios of the influence of the risk factors on the development of COPD), whereas the prediction development study was based on a follow-up cohort study (with estimation of the hazard ratios of the influence of the risk factors on a 10-year risk of COPD). Nevertheless, the estimates of associations between the various risk factors and the development of COPD were both in the direction and magnitude comparable between the validation and prediction development studies. Although the measures of accuracy (using the ROCAUC) of the prediction model in discriminating between COPD cases and controls were slightly lower in the current study compared to the prediction development study, the differences were as expected, given previous evidence in this respect, i.e. accuracy measures of prediction models more often are seen to be lower in an externally validated dataset compared to estimates derived from the prediction development dataset15. This work has therefore served to confirm the importance of undertaking external validation studies. Further limitation of our work is that, spirometry measures, which are the gold standard for diagnosing COPD, are not routinely recorded in the GP databases, hence we could not utilise them in this study. Similarly, pack-years of smoking, a desired exposure indicator to assess the causal impact of smoking, is not routinely collected in GP databases, hence we did not consider it in this study.

As the first validation of a risk score for predicting the development of COPD in a different external database, there is no applicable previous study to compare the results. Overall, the risk estimates for the studied risk factors (smoking, prior asthma, and socioeconomic status) were comparable to the estimates from the prediction model development study. Whilst a recent study using CPRD undertook an external validation of the prediction scores, the validation work was done within the same database: the authors split the original data into two (development and validation samples) datasets, using one dataset to develop the prediction scores and the second dataset comprising of 20 CPRD general practices to validate the scores12. External validation of prediction models aims to assess the generalisability of the derived model in an appropriate similar patient population, but in a different context; this work therefore needs to be undertaken in a new dataset15,16,17.

In comparison to the recent study using CPRD database to develop and validate a prediction model12, the risk estimates with regards to smoking and asthma, although in the same direction as observed in the current study, somewhat differed in magnitude, possibly as different definitions were used between the two studies. In the current study we defined smoking as “ever smoking” while the previous study differentiated between former and current smoking12. Another difference between the current study and the previous CPRD study was that variables included in the final model differed, which may explain the differences observed: whilst the final model in the current study included smoking status, prior asthma, and IMD, the previous study included smoking status, prior asthma, salbutamol prescription, and lower respiratory tract infections. Further difference relates to the time coverage of participants’ enrolment into the two studies: the current study included participants between 1992 and 2012, whereas the previous study covered between 2000 and 2006. The UK Quality and Outcomes Framework (QOF) for COPD started in 2004, at which point the coding of COPD improved5,6. However, since both studies contained data both prior to and after the start of the QOF, we believe the time coverage of studies may not have substantially influenced the observed variations in results.

Other previous studies from the US7,8 and Denmark9 developed prediction algorithms aimed at identifying COPD patients with already established disease. They were also based on secondary care data, such as administrative claims data, outpatient pharmacy data, and hospital admissions data, hence contrast the current study, which used population-based primary care to validate prediction scores aimed at identifying at-risk individuals before the onset of disease.

As the incidence of COPD and mortality continues to rise globally, our ability to detect cases before they manifest is crucial6,11,12. The current COPD prediction model, now externally validated in a different dataset, but comparable population, provides a convincing opportunity for evaluating its usefulness and applicability in clinical care settings for identifying individuals at high risk of developing the disease18. The current validation confirms the importance of smoking history, prior asthma, and socioeconomic status19, which in combination provide a composite prediction score to accurately identify individuals at different risk categories (based on the our prediction model, low category had risk scores ≤6, medium risk score of 7, and high risk score 8–10) of developing COPD11. The current study is the first to externally validate a COPD prediction score. Whilst emerging COPD prediction models from other contexts also need to be externally validated in different datasets, we believe that the risk scores derived from the current external validation exercise can now be assessed whether it can applied in clinical settings with a concurrent impact assessment of its performance.

In conclusion, using a large internationally respected database, we have – for the first time – validated a COPD prediction model for identifying at-risk individuals prior to the onset of disease in a different database but comparable population, with acceptable accuracy at discriminating between COPD cases and non-COPD cases. Key predictors of onset of COPD include smoking, prior asthma, and socioeconomic status. Consideration now needs to be made, with a concurrent impact assessment on performance, on whether the validated risk score is useful and can be readily applied in clinical practice for identifying those at risk of developing COPD.


Study population

The CPRD database is a validated computerised, anonymised and longitudinal primary care database, considered by many as the gold standard of routine clinical research data (,14. It is jointly funded by the UK National Health Service (NHS) and the Medicines and Healthcare products Regulatory Agency (MHRA). Data are linkable to other healthcare and social care data sources, and are regularly used to conduct both observational and interventional research in the UK. Presently the database comprises around 14 million patients derived from 660 primary care practices across the UK13,14. We received access to CPRD under licence from the Medical Research Council (MRC). The CPRD Group has obtained ethical approval from a Multi-centre Research Ethics Committee for all purely observational research using CPRD database. The protocol of the current study was approved by the Independent Scientific Advisory Committee of CPRD (protocol number 10_084 R). All the study methods were performed in accordance with the relevant guidelines and regulations and in accordance with best scientific practices.

We extracted a random sample from the CPRD of COPD cases and controls aged ≥35 years: index cases and their corresponding controls were matched at a ratio of 1:1 on GP practice, sex, and year of birth (within two years). Cases comprised of individuals with first recorded COPD diagnosis and who were followed for at least five years prior to the index date (i.e., date of drawing the sample). Identification of cases and controls and definition of other study variables were based on the Read Clinical Classification System, a standard coding system produced for clinicians in primary care and which is used for most primary care electronic patient records in the UK (a complete list of Read codes used for this study is given in Supplementary File 1). A control must have been registered at the same practice at the time of the index date of the corresponding case and should have had at least five years of follow-up in the same practice prior to the index date of the case. The index date for the controls was the date of the recording of COPD for the corresponding matched case.

In total, we sampled 38,597 case-control pairs (total N = 77,194). This was the maximum number of allowable patients extracted under the Medical Research Council license. Of these, 188 (84 [0.22%] cases and 104 controls [0.27%]) had missing data for IMD, hence were excluded from analyses. The IMD is the UK government’s measure to assess household’s socio-economic status based on the level of deprivation of an area ( Two controls without ID numbers and their corresponding cases were further excluded, resulting in a total of 77,002 (38,511 cases and 38,491 controls) as the final sample for analyses.

Assessment and definition of risk factors

From CPRD, we extracted the same variables used in the PCCIU data to develop the prediction model (i.e., smoking, prior asthma, and IMD); the current study sample was matched for age and sex. Smoking status was categorised into “never smoker” (i.e., patients recorded as “non-smoker” at any time and no coding as “smoker” or “ex-smoker” at any other time) or “ever smoker” (i.e., patients recorded as “smoker” or “ex-smoker” at any time); “never smoker” was the reference category. Prior asthma was categorised as “no” or “yes”; “no” was the reference category. As the study was nested within a longitudinal cohort, the timing of occurrence of asthma was determined and we ensured that only asthma cases occurring prior to COPD were defined as positive. IMD was categorised into quintiles (quintile 1 as least deprived and quintile 5 as most deprived); quintile 1 was the reference category.

We observed some differences in coding of smoking and asthma between the CPRD and PCCIU data: in comparison to the codes in PCCIU, CPRD comprised of four extra codes for smoking status and 15 extra codes for asthma (see supplementary File 1). This resulted in about 0.5% extra ever smokers and about 12% extra prior asthma cases by using CPRD codes compared with PCCIU codes. The Read code, H33.00 (Asthma), contributed to most (85%) of the extra asthma cases in CPRD. Whilst we used CPRD-derived codes throughout the validation exercise, we also separately analysed PCCIU-derived codes to assess whether there were any differences between the two codes in terms of the direction and magnitude of the prediction scores.

Statistical analysis

The Stata codes for data preparation and analyses are presented in Supplementary Files 2 and 3, respectively. For descriptive analysis we calculated the frequencies of participants’ background characteristics by cases and controls. Assessment of the associations between the risk factors and the risk of COPD was performed by calculating the unadjusted and adjusted risk estimates using conditional logistic regression. These estimates are reported as odds ratio (OR) and accompanied by their 95% confidence intervals (95% CI). It should be noted that while the development model was based on a longitudinal data (hence modelling using Cox regression), the current validation work is based on a matched case-control study (hence the calculation using conditional logistic regression).

A two-part approach was used to undertake the external validation of the prediction model. First, the risk scores derived from the prediction model (based on coefficients across categories of the variables included in the fully adjusted Cox regression model)11 were applied to the corresponding categories of each predictor in the validation model, thus deriving the prognostic scores for each individual in the validation model of the CPRD data; this was done separately for males and females. The resulting risk scores were then divided into 10 risk categories (deciles). The approach is described in Table 5.

Table 5
Calculation of prediction risk scores for males and females derived from the PCCIU dataset.

For example, a 60-year-old male ever smoker, with IMD level 3 and no previous history of asthma would yield a prognostic index (PI) of 5.0 (risk category 9 = high risk of future COPD); a 40-year-old female never smoker, with IMD 3 and a previous history of asthma would yield a PI of 2.2 (risk category 3 = low risk of future COPD).

In the second part of the validation model, the prognostic scores for each individual was derived solely based on the CPRD data using the method applied in the model development: i.e., calculation of the risk scores as the sum of the regression coefficients across the different categories of the variables in the fully adjusted conditional logistic regression model. Due to the CPRD validation dataset being matched for sex and age, the scores were calculated for males and females combined and not separated by sex. The resultant risk scores were then divided into deciles. Illustratively, given the beta regression coefficients for the categories of the different factors as β, calculation of the risk scores was given as follows:

An external file that holds a picture, illustration, etc.
Object name is srep44702-m1.jpg

Where β(smoking)*Smoking is the coefficient for a smoker vs never smoker as the reference; β(IMD2)*IMD2 is the coefficient for an individual in the second quintile of IMD to β(IMD)*IMD5 for an individual in the highest quintile of IMD vs firs quintile as the reference; and β(asthma)*Asthma for an individual with asthma vs non-asthma as the reference.

For each validation model, the accuracy of the prognostic scores in discriminating between COPD and non-COPD patients was estimated by calculating the area under the receiver operator characteristic curve (ROCAUC) for all values of the scores.

Additional Information

How to cite this article: Nwaru, B. I. et al. External validation of a COPD prediction model using population-based primary care data: a nested case-control study. Sci. Rep. 7, 44702; doi: 10.1038/srep44702 (2017).

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Material

Supplementary Information:


Access to the CPRD database was funded through the Medical Research Council’s licence agreement with the Medicines and Healthcare products Regulatory Agency. BN, CS and AS were supported by the Farr Institute and Asthma UK Centre for Applied Research. DK received additional support from the Ministry for Innovation, Science and Research of the German Federal State of North Rhine-Westphalia (“NRW-Rückkehrprogramm”).


The authors declare no competing financial interests.

Author Contributions B.N. undertook data analysis and drafting of the manuscript with several rounds of critical comments from D.K. A.S. and C.S. contributed to the development of the plans for this work, were involved in drafting the protocol and commented critically on earlier drafts of the manuscript.


  • Lozano R. et al. . Global and regional mortality from 235 causes of death for 20 age groups in 1990 and 2010: a systematic analysis for the Global Burden of Disease Study 2010. Lancet. 380, 2095–2128 (2012). [PubMed]
  • World Health Organization. Geneva: World health report: chronic respiratory diseases. Available from: (Accessed July 10, 2015).
  • American Lung Association. Trends in COPD (Chronic Bronchitis and Emphysema): morbidity and mortality. March 2013. Available from: (Accessed July 10, 2015).
  • Pride N. B. & Soriano J. B. Chronic obstructive pulmonary disease in the United Kingdom: trends in mortality, morbidity, and smoking. Curr Opin Pulm Med. 8, 95–101 (2002). [PubMed]
  • Calderón-Larraňaga A. et al. . Association of population and primary healthcare factors with hospital admission rates for chronic obstructive pulmonary disease in England: national cross-sectional study. Thorax 66, 191–196 (2011). [PubMed]
  • Quint J. F. Are clinical risk scores for COPD useful? BMJ Open Resp Res. 2, e000072 (2015). [PMC free article] [PubMed]
  • Mapel D. W. et al. . An algorithm for the identification of undiagnosed COPD cases using administrative claims data. J Manag Care Pharm. 12, 458–465 (2006). [PubMed]
  • Mapel W. E., Petersen H., Roberts M. H. et al. . Can outpatient pharmacy data identify persons with undiagnosed COPD? Am J Manag Care. 16, 505–512 (2010). [PubMed]
  • Smidth M., Sokolowski I., Kaersvang L. & Vedsted P. Developing an algorithm to identify people with chronic obstructive pulmonary disease (COPD) using administrative data. BMC Med Inform Decis. 12, 38 (2012). [PMC free article] [PubMed]
  • Hill K. et al. . Prevalence and underdiagnosis of chronic obstructive pulmonary disease among patients at risk in primary care. CMAJ. 182, 673–678 (2010). [PMC free article] [PubMed]
  • Kotz D., Simpson C. R., Viechtbauer W., van Schayck O. C. & Sheikh A. Development and validation of model to predict the 10-year risk of general practitioner-recorded COPD. npj Prim Care Respir Med. 24, 14011 (2014). [PMC free article] [PubMed]
  • Haroon S. et al. . Predicting the risk of COPD in primary care: development and validation of a clinical risk score. BMJ Open Resp Res. 1, e000060 (2014). [PMC free article] [PubMed]
  • Williams T., van Staa T., Puri S. & Eaton S. Recent advances and use of the General Practice Research Database as an example of a UK Primary Care Data resource. Ther Adv Drug Saf. 3, 89–99 (2012). [PMC free article] [PubMed]
  • Quint J. K. et al. . Validation of chronic obstructive pulmonary disease recording in the Clinical Practice Research Datalink (CPRD-GOLD). BMJ Open. 4, e005540 (2014). [PMC free article] [PubMed]
  • Siontis G. C. M., Tzoulaki I., Castaldi P. J. & ioannidis J. P. External validation of new risk prediction models is infrequent and reveals worse prognostic discrimination. J Clin Epidemiol. 68, 25–34 (2015). [PubMed]
  • Altman D. G. & Royston P. What do we mean by validating a prognostic model? Statist Med. 19, 453–473 (2000). [PubMed]
  • Altman D. G., Vergouwe Y., Royston P. & Moons K. G. Prognosis and prognostic research: validating a prognostic model. BMJ. 338, b605 (2009). [PubMed]
  • Robson J. et al. . The NHS Health Check programme: implementation in east London 2009–2011. BMJ Open. 5, e007578 (2015). [PMC free article] [PubMed]
  • Gershon A. S., Warner L., Cascagnette P., Victor J. C. & To T. Lifetime risk of developing chronic obstructive pulmonary disease: a longitudinal population study. Lancet 378, 991–996 (2011). [PubMed]

Articles from Scientific Reports are provided here courtesy of Nature Publishing Group