PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
 
Eur J Epidemiol. Apr 2011; 26(4): 261–264.
Published online Mar 24, 2011. doi:  10.1007/s10654-011-9567-4
PMCID: PMC3088798
PredictABEL: an R package for the assessment of risk prediction models
Suman Kundu, Yurii S. Aulchenko, Cornelia M. van Duijn, and A. Cecile J. W. Janssenscorresponding author
Department of Epidemiology, Erasmus University Medical Center, PO Box 2040, 3000 CA Rotterdam, The Netherlands
A. Cecile J. W. Janssens, Phone: +31-10-7044214, Fax: +31-10-7044657, a.janssens/at/erasmusmc.nl.
corresponding authorCorresponding author.
Received February 8, 2011; Accepted March 8, 2011.
The rapid identification of genetic markers for multifactorial diseases from genome-wide association studies is fuelling interest in investigating the predictive ability and health care utility of genetic risk models. Various measures are available for the assessment of risk prediction models, each addressing a different aspect of performance and utility. We developed PredictABEL, a package in R that covers descriptive tables, measures and figures that are used in the analysis of risk prediction studies such as measures of model fit, predictive ability and clinical utility, and risk distributions, calibration plot and the receiver operating characteristic plot. Tables and figures are saved as separate files in a user-specified format, which include publication-quality EPS and TIFF formats. All figures are available in a ready-made layout, but they can be customized to the preferences of the user. The package has been developed for the analysis of genetic risk prediction studies, but can also be used for studies that only include non-genetic risk factors. PredictABEL is freely available at the websites of GenABEL (http://www.genabel.org) and CRAN (http://cran.r-project.org/).
Keywords: Risk prediction, Genetic, Assessment, Measures, Software
The rapid identification of genetic markers for multifactorial diseases from genome-wide association studies is fuelling interest in investigating the predictive ability and health care utility of genetic risk models. Genetic risk models are investigated for their potential to target diagnostic, preventive and therapeutic interventions for multifactorial diseases. Implementation of these models in health care requires a series of studies that encompass all phases of translational research [1, 2], starting with a comprehensive evaluation of genetic risk prediction.
Various measures are available for the assessment of risk prediction models, each addressing a different aspect of performance and utility [3, 4]. The GRIPS Statement recommends that transparent and complete reporting should provide a description of the risk factors and the risk model by reporting univariate and multivariate odds ratios for the predictors, present risk distributions for individuals with and without the outcome of interest, and report measures of model fit, predictive ability and others, if pertinent [5, 6]. Examples of measures include the Hosmer–Lemeshow statistic [7] and Nagelkerke’s R2 [8] for model fit, the area under the receiver operating characteristic (ROC) curve (AUC) [9] and integrated discrimination improvement (IDI) [10] for predictive ability, and percentages of total reclassification [11] and net reclassification improvement (NRI) [10] for clinical utility.
Even though the assessment of risk prediction models is relatively standard, there is no single statistical package that would allow for the computation and production of all these measures and plots. Therefore, we developed PredictABEL, a freely available R package, which contains functions to obtain all descriptive tables, measures and plots that are used in genetic risk prediction studies.
The core part of PredictABEL comprises functions for the assessment of risk prediction models. The measures and plots covered in PredictABEL are listed in Table 1. Most functions can be applied to predicted risks, risk scores or any other continuous predictor variable, but some to predicted risks (probabilities) only. Predicted risks and genetic risk scores can be obtained using functions in the package, but they can be imported from other programs as well. The functions to obtain predicted risks using logistic regression analysis are specifically written for models that include genetic variables, eventually in addition to non-genetic factors, but they can also be applied to construct models based on non-genetic risk factors only. Genetic risk scores can be computed as unweighted and weighted risk scores, where weights are obtained from uploaded data or imported from meta-analyses, e.g., as beta coeffcients.
Table 1
Table 1
Measures and plots covered in PredictABEL (version 1.1)
The tables and plots generated using PredictABEL are saved as separate files in the working directory. Tables can be saved as Excel or tab-delimited text files and figures can be saved as publication-quality EPS or TIFF files or as JPEG files for insertion in manuscripts. All figures are available in a ready-made layout, but they can be customized to the journal style or preferences of the user. A hypothetical dataset and examples of use are included in the package to demonstrate all functions.
The hypothetical dataset included in the package was reconstructed from an empirical study on age-related macular degeneration (AMD) [12], using a simulation method that has been described in detail elsewhere [13]. Based on published frequencies and odds ratios of the genetic variants and non-genetic risk factors implicated in AMD and on published population disease risks, we created a dataset that contains genotype data and disease status for 10,000 individuals. Predicted risks were obtained using logistic regression analysis, for which the codes are provided in the package. Two risk models were constructed: a model based on non-genetic risk factors only and a model based on genetic and non-genetic predictors.
Figure 1 presents three examples of plots that are produced by PredictABEL. Figure 1a shows distributions of predicted risks based on genetic and non-genetic factors for individuals with and without AMD. The degree of overlap between the two histograms is indicative for the discriminative accuracy of the risk model. This discriminative accuracy is assessed by the AUC and visualized in a ROC plot. Figure 1b presents the ROC curves for the two risk models. The figure shows that the model with genetic factors had a higher AUC than the model without. Using the same function, the AUC values were quantified as 0.80 and 0.74. Finally, Fig. 1c presents the calibration plot for the risk model based on the genetic and non-genetic variables as predictors, which shows how well predicted risks match observed risks. The calibration plot suggests that the model was well calibrated, which was supported by the non-significance of the Hosmer–Lemeshow test (P = 0.65).
Fig. 1
Fig. 1
Example graphs produced by PredictABEL. a Distributions of predicted risks in individuals with and without age-related macular degeneration (AMD); b ROC plot presenting risk models without and with genetic variants; and c Calibration plot comparing predicted (more ...)
Finally, Table 2 presents an example of the reclassification table and statistics that are produced by PredictABEL. The reclassification table presents the categorization into risk groups according to the initial and updated risk models. The table provides information about the total number of individuals that change between risk categories and about correct and incorrect reclassification. The percentage of total reclassification and NRI are calculated from the reclassification table. The table indicates that net 8.8% of the individuals without AMD and 9.6% of those with AMD would be correctly reclassified when the clinical model was updated by the addition of genetic factors.
Table 2
Table 2
Reclassification table comparing clinical risk models without and with genetic factors
PredictABEL is a comprehensive software package, designed for the development and assessment of genetic risk prediction models. PredictABEL is a part of the GenABEL software suite for statistical genomics [14, 15] and for that reason written in R to enable easy transfer of data from gene discovery to genetic prediction studies. A detailed manual is available that demonstrates and explains all the functions in the package. The manual is accessible for researchers who do not regularly use R software. The manual and the package are freely available from the GenABEL project website (http://www.genabel.org) and from CRAN (http://cran.r-project.org/).
The current version of PredictABEL (version 1.1) includes all basic descriptive tables, measures and plots that are used in the assessment of risk prediction models. Planned extensions of the package include other strategies to construct risk models, e.g., using Cox Proportional Hazards analysis for prospective data, and functions to construct simulated data for the evaluation of genetic risk models [13]. Furthermore, we will optimize the interconnectivity between PredictABEL and other packages in the GenABEL suite.
Where the GRIPS Statement aims to improve the transparency, quality and completeness of reporting [5, 6], PredictABEL has similar goals for the assessment of genetic risk prediction studies. The collection of all measures and plots in a single, software package gives a comprehensive overview of the various measures that are available for the assessment of risk prediction studies. This overview emphasizes that different measures are available to answer different questions in the assessment of risk models and facilitates the selection of the most appropriate measure for the question under study.
Acknowledgments
This work was supported by the Vidi grant from the Netherlands Organization for Scientific Research (NWO), the Young Investigator grant from the Erasmus University Medical Center Rotterdam and by the Center for Medical Systems Biology within the framework of the Netherlands Genomics Initiative.
Open Access This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
Abbreviations
AUCArea under the ROC curve
IDIIntegrated discrimination improvement
NRINet reclassification improvement
ROCReceiver operating characteristic


1. Khoury MJ, Gwinn M, Yoon PW, Dowling N, Moore CA, Bradley L. The continuum of translation research in genomic medicine: how can we accelerate the appropriate integration of human genome discoveries into health care and disease prevention? Genet Med. 2007;9:665–674. doi: 10.1097/GIM.0b013e31815699d0. [PubMed] [Cross Ref]
2. Hlatky MA, Greenland P, Arnett DK, Ballantyne CM, Criqui MH, Elkind MS, et al. Criteria for evaluation of novel markers of cardiovascular risk: a scientific statement from the American Heart Association. Circulation. 2009;119:2408–2416. doi: 10.1161/CIRCULATIONAHA.109.192278. [PMC free article] [PubMed] [Cross Ref]
3. McGeechan K, Macaskill P, Irwig L, Liew G, Wong TY. Assessing new biomarkers and predictive models for use in clinical practice: a clinician’s guide. Arch Intern Med. 2008;168:2304–2310. doi: 10.1001/archinte.168.21.2304. [PubMed] [Cross Ref]
4. Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21:128–138. doi: 10.1097/EDE.0b013e3181c30fb2. [PMC free article] [PubMed] [Cross Ref]
5. Janssens ACJW, Ioannidis JPA, van Duijn CM, Little J, Khoury MJ. Strengthening the reporting of genetic risk prediction studies: The GRIPS statement. Eur J Epidemiol. 2011; 26. doi: 10.1007/s10654-011-9552-y. [PMC free article] [PubMed]
6. Janssens ACJW, Ioannidis JPA, Bedrosian S, Boffetta P, Dolan SM, Dowling N, Fortier I, Freedman AN, Grimshaw JM, Gulcher J, Gwinn M, Hlatky MA, Janes H, Kraft P, Melillo S, O’Donnell CJ, Pencina MJ, Ransohoff D, Schully SD, Seminara D, Winn DM, Wright CF, van Duijn CM, Little J, Khoury MJ. Strengthening the reporting of genetic risk prediction studies (GRIPS): elaboration and explanation. Eur J Epidemiol. 2011; 26. doi: 10.1007/s10654-011-9551-z. [PMC free article] [PubMed]
7. Hosmer DW, Hosmer T, Le Cessie S, Lemeshow S. A comparison of goodness-of-fit tests for the logistic regression model. Stat Med. 1997;16:965–980. doi: 10.1002/(SICI)1097-0258(19970515)16:9<965::AID-SIM509>3.0.CO;2-O. [PubMed] [Cross Ref]
8. Nagelkerke NJ. A note on a general definition of the coefficient of determination. Biometrika. 1991;78:691–692. doi: 10.1093/biomet/78.3.691. [Cross Ref]
9. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36. [PubMed]
10. Pencina MJ, D’Agostino RB, Sr, D’Agostino RB, Jr, Vasan RS. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med. 2008;27:157–172. doi: 10.1002/sim.2929. [PubMed] [Cross Ref]
11. Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation. 2007;115:928–935. doi: 10.1161/CIRCULATIONAHA.106.672402. [PubMed] [Cross Ref]
12. Seddon JM, Reynolds R, Maller J, Fagerness JA, Daly MJ, Rosner B. Prediction model for prevalence and incidence of advanced age-related macular degeneration based on genetic, demographic, and environmental variables. Invest Ophthalmol Vis Sci. 2009;50:2044–2053. doi: 10.1167/iovs.08-3064. [PubMed] [Cross Ref]
13. Janssens AC, Aulchenko YS, Elefante S, Borsboom GJ, Steyerberg EW, Duijn CM. Predictive testing for complex diseases using multiple genes: fact or fiction? Genet Med. 2006;8:395–400. doi: 10.1097/01.gim.0000229689.18263.f4. [PubMed] [Cross Ref]
14. Aulchenko YS, Ripke S, Isaacs A, Duijn CM. GenABEL: an R library for genome-wide association analysis. Bioinformatics. 2007;23:1294–1296. doi: 10.1093/bioinformatics/btm108. [PubMed] [Cross Ref]
15. Aulchenko YS, Struchalin MV, Duijn CM. ProbABEL package for genome-wide association analysis of imputed data. BMC Bioinformatics. 2010;11:134. doi: 10.1186/1471-2105-11-134. [PMC free article] [PubMed] [Cross Ref]
Articles from Springer Open Choice are provided here courtesy of
Springer