|Home | About | Journals | Submit | Contact Us | Français|
Advances in biotechnology have raised expectations that biomarkers, including genetic profiles, will yield information to accurately predict outcomes for individuals. However, results to date have been disappointing. In addition, statistical methods to quantify the predictive information in markers have not been standardized.
We discuss statistical techniques to summarize predictive information including risk distribution curves and measures derived from them that relate to decision making. Attributes of these measures are contrasted with alternatives such as receiver operating characteristic curves, R-squared, percent reclassification and net reclassification index. Data are generated from simple models of risk conferred by genetic profiles for individuals in a population. Statistical techniques are illustrated and the risk prediction capacities of different risk models are quantified.
Risk distribution curves are most informative and relevant to clinical practice. They show proportions of subjects classified into clinically relevant risk categories. In a population in which 10% have the outcome event and subjects are categorized as high risk if their risk exceeds 20%, we found to identify as high risk more than half of those destined to have an event, either 150 genes each with odds ratio of 1.5 or 250 genes each with odds ratio of 1.25 was required when the minor allele frequencies are 10%. We show that conclusions based on ROC curves may not be the same as conclusions based on risk distribution curves.
Many highly predictive genes will be required in order to identify substantial numbers of subjects at high risk.
Predicting risk is a natural part of human life. In the context of cancer research we seek markers that can predict the risk of developing cancer, predict the chance of responding to treatment for cancer, predict the risk of recurrence after treatment for cancer, and so forth. We have greeted advances in genomic, proteomic and imaging technologies with enthusiasm in part because of their potential to help predict outcomes for individuals. The first goal of this paper is to explore the extent to which we can we expect genes or other factors to be predictive of individual risk.
Statistical measures used to quantify the predictive information in a marker are often difficult to understand and not directly relevant to clinical practice. For example, the area under the receiver operating characteristic curve (AUC) has been used to quantify the potential of newly discovered SNPs and genes to improve risk prediction (1,2). But there is no direct relationship between increments in the AUC and clinically meaningful improvements in risk prediction. A second goal of this paper is to give guidance on clinically relevant ways to measure the predictive information provided by a genetic profile or other risk predictor.
Suppose two risk prediction calculators have been developed for predicting an outcome event, such as contracting cancer or dying from cancer within a specified time period, using different sets of genes and possibly other risk factors. For each subject in the dataset, his risk of a bad outcome can be calculated using both models, model A and model B. For example, according to the values of genes in model A, the risk of an event for a subject may be calculated as 30%. On the other hand when information on genes in model B is used to calculate risk, his risk is calculated to be 50%. The topic of this section concerns how to quantify and compare the predictive capacities of models A and B. In other words, in this population which set of genes does the better job of predicting risk?
We use a very large simulated dataset to illustrate statistical approaches to quantifying predictive information. On purpose we do not provide details of the data here in order to focus on the statistical method. Details are provided in the appendix. Overall 10% of subjects have an event and both risk calculators are “correct” in the following sense: the calculated risk value for a subject with genetic profile yA, where yA are the genes in model A, reflects the proportion of events amongst all individuals who have genetic profile yA and similarly for model B. In standard statistical terminology this means that both models are well calibrated. The issue is that one set of genes may be more informative than the other. Note that a risk model with completely uninformative genes would “correctly” assign risk equal to 10% to everyone, because uninformative markers tell us nothing about individual risk. We now discuss various ways to describe the predictive information provided by models A and B.
The top panels of Figure 1 show the population distributions of risk calculated according to the models. Displaying risk distributions is a fundamental step in evaluating the performance of a risk prediction model (3,4), a step that is often over looked in practice. We can see from the risk distributions the proportions of subjects identified as high or as low risk according to the risk models. For example, suppose we want to select for preventive intervention or treatment subjects at high risk for the event where risk levels at or above 20% are considered high. Only 1% of the population is identified as high risk according to model A, while 10% are identified as high risk according to model B. In this sense model B is better at identifying high risk subjects. Model B would be more useful as a screening tool for selecting patients to a clinical trial. The curves would also be of interest to individuals who are deciding whether or not to have their genetic information measured. Suppose an individual will opt for an intervention only if his risk is > 20%. In the absence of genetic information his risk is calculated as 10% and he declines intervention. There is only a 1% chance that his decision about intervention will change if he ascertains the genetic information required for model A but a 10% chance for model B. That is the information in model A is unlikely to alter medical decisions while information in model B is more likely to have an impact.
The information displayed in the top panels of Figure 1 is also displayed in the top panels of Figure 2 but using cumulative distribution curves rather than probability density curves. The cumulative distribution curves are more useful because one can directly read from them the proportions of subjects whose risk values lie below or above a threshold of interest.
In considering the impact of risk models on individual decision making, one should consider the costs and benefits consequent to those decisions. Subjects who would have an event in the absence of intervention, whom we call cases, will benefit on average from high risk designation because they will receive the potentially beneficial intervention. On the other hand, subjects who would not have an event in the absence of intervention, whom we call control subjects, will not benefit from intervention but will only suffer its negative effects including monetary costs. The bottom panels of Figures 1 and and22 display risk distributions separately for case subjects and for control subjects. These displays are also important and useful in evaluating risk prediction models. One observes for example that the proportions of case subjects placed in the high risk category are 26% with use of model B but only 2% with use of model A. From this we conclude that there is more benefit to be gained with use of model B. However we also see that unfortunately 8% of controls are designated as high risk with model B which is substantially more than the corresponding proportion, 1%, for model A. In this sense there is more cost associated with model B as well as more benefit. An alternative but equivalent way to display the information in Figures 1 and and22 was previously described (4) and is displayed in Figure 3.
The receiver operator characteristic (ROC) curve is derived from the case and control risk distributions. It plots the proportion of case risk values exceeding a threshold, TPR(riskH), versus the proportion of control risk values exceeding that same threshold, FPR(riskH), where riskH is the risk threshold (see Figure 4). Unfortunately the risk thresholds themselves are not visible from the ROC curve so one cannot see the correspondence between risk threshold and case and control proportions that one can see from the risk distributions in Figures 2 and and3.3. Moreover, from the ROC curves alone one cannot compare two risk models in regards to total, case or control population proportions that exceed a risk threshold of interest. Therefore we suggest displaying the risk distributions because they are more informative than the ROC curves.
A single number is often used to summarize the predictive performance of a risk prediction model. The area under the ROC curve (AUC) is the most popular statistical index. However, it has been criticized (6) and recent criticisms in the cardiovascular literature(7) have led to much debate about alternative approaches to summarizing predictive performance (3,8–10). The AUC is
This entity is not of interest in practice because the practical problem is not to determine the case and control identities in a random case-control pair. Rather the problem is to correctly flag subjects at high risk of an event. AUC values of .6 for model A and .7 for model B (Table 1) suggest that model B is superior but do not quantify their predictive capacities in a clinically useful way.
Most standard statistical indices of predictive performance summarize the difference between case and control risk distributions. The AUC is the Mann-Whitney-Wilcoxon statistic for testing for differences between these two risk distributions. The R-squared statistic, that is familiar from linear regression of continuous outcomes, generalizes to dichotomous event outcomes as
Again, this entity, the difference in mean risks, does not relate directly to the task of identifying subjects at high risk. The average risk for cases – average risk for controls is 0.11−0.10=0.01 for model A and 0.15−0.09=0.06 for model B. Interestingly, this version of R-squared is the same (11) as the integrated discrimination improvement (IDI) statistic recently proposed by Pencina et al (8) as an alternative to the AUC. However, it does not solve the main problem with AUC, namely lack of clinical usefulness. Other versions of R-squared, average functions of the risk values (12) and lack easy interpretation as well as practical relevance.
Since the AUC and R-squared do not represent quantities of clinical relevance, what should their roles be in evaluating the population performance of risk prediction models? We recommend that they be deemphasized in reporting study results. Rather than focusing on these single numerical summaries, the risk distributions themselves should be given greater prominence. Using a few risk thresholds of interest, one should report proportions of the case, control and overall populations that have risk values exceeding those thresholds. When no specific risk thresholds or risk ranges are of interest, one could complement the risk distribution displays with AUC or R-squared summary statistics and compare risk models by basing hypothesis tests on them (13).
Two new indices, the reclassification percent and the net reclassification improvement, have been proposed recently and are both based on the idea of categorizing risk. When two risk categories are defined, high versus low, based on a single risk threshold value, these statistics are directly related to the population proportions at high risk discussed earlier. We use our example with risk threshold equal to 20% to illustrate.
Consider first comparing a model, model B say, with no model. The reclassification percent (7) is the proportion of the whole population that is classified as high risk by the model, 10%, since in the absence of genetic data, all subjects are classified as low risk by assigning them risk values equal to the population prevalence ρ=10%. The net reclassification improvement (NRI) index proposed by Pencina et al (8) is the difference between the proportions of cases and controls classified as high risk, TPR−FPR=18%, in the notation used above. We recommend reporting the two components, TPR=26% and FPR=8%, separately, because it is more informative than just reporting the difference and offers the flexibility to weight differently the benefits and costs of high risk designations for cases and controls, respectively.
Now consider NRI and reclassification percent when comparing two models, model B versus model A. In this case the NRI = (TPRB−TPRA)− (FPRB−FPRA) = 17%. Again, reporting the components, namely the change in TPR, (TPRB−TPRA)= 26%−2%=24%, and the change in FPR, (FPRB−FPRA) = 8%−1%=7%, seems much more informative than reporting the single composite 17% number. The reclassification percent for comparing models A and B is the proportion of subjects who are classified in different risk categories according to the two models. This measure can be large or small even if the two models have exactly the same performance because it is a function of the correlation between the genes in the two models. Therefore, it has been argued (10) that reclassification percent is not well suited to the task of comparing models.
The reclassification percent and the NRI are both defined for settings involving more than two risk categories but suffer additional weaknesses in these settings. They do not distinguish between small and large changes in risk. Moreover they are highly dependent on the number and nature of risk categories chosen. In Table 1, using four risk categories defined by the risk thresholds 0.05, 0.10 and 0.20, we report values for reclassification percent of 60% when comparing model A with no model and 73% when comparing model B with no model. Corresponding NRI values are 0.15 and 0.47, respectively. In our opinion, these are not compelling measures of predictive performance. More informative than the single number summaries that accumulate data over multiple risk categories are the proportions of the total, case and control populations whose risks are in each of the four risk categories (Table 1). These values can be read from the risk distribution curves in Figure 2 as well. The curves have the advantage that the reader can specify his own risk categories of interest.
A subject will need to consider several factors simultaneously in evaluating the potential benefit associated with using a risk model. These factors are: his overall chance of having an event in the absence of intervention, denoted by ρ and equal to 10% in our example; the chance that the model will assign him high risk status if he is destined to be a case in the absence of treatment, denoted by TPR; and the chance that the model will assign him high risk status if he is destined to be a control in the absence of treatment, denoted by FPR. These values are TPR=26% and FPR=8%, respectively, for model B in our example (Figure 2). Finally he will need to consider the relative values of the potential benefit and of the potential cost associated with high risk designation. Interestingly, Vickers and Elkin (5) and several other papers drawing on results from decision theory, note that use of a high risk threshold, riskH, is equivalent to considering the cost-benefit ratio to be Cost/Benefit= riskH /(1− riskH). In our example the risk threshold is riskH = 20% which is equivalent to a cost-benefit ratio of 20%/(100%−20%)= 0.25. That is, use of the 20% risk threshold implies that the net benefit associated with intervention for a subject who would have an event in the absence of intervention is considered four times the net cost of intervention to subjects who would not have an event.
Formally, the expected benefit associated with use of the risk model and assigning high risk status to those with risks exceeding riskH is calculated as:
where “Benefit” denotes the benefit associated with a case being designated as high risk and this is the unit in which benefit is measured. In our example, with risk threshold riskH= 20%, the expected benefit associated with use of model B is (0.1×.258−0.9×0.25×0.081)= 0.0076. That is, the expected benefit is positive and equal to .0076 times the benefit associated with a case being designated as high risk. For model A the expected benefit is nearly zero (0.0002). Figure 5 displays expected benefit for the two models using risk thresholds ranging from 0 to .4. We see that no matter what risk threshold is employed, or equivalently no matter what cost-benefit ratio is entertained, model B yields more expected benefit. The calculations presented here ignore costs associated with obtaining the information needed to calculate the modeled risks. Therefore the expected benefit associated with any model is at least as good as not using any model, expected benefit ≥ 0. The expected benefit could be negative if the cost of genetic testing were taken into consideration.
The decision curves (5) shown in Figure 5 are useful in deciding whether or not to obtain genetic information for an individual who has in mind a specific risk threshold that would lead to an action. Risk thresholds may vary from individual to individual so the expected benefit for one individual may not be the same as that expected for another. Higher expected benefits correspond to lower risk thresholds in Figure 5 because subjects with low risk thresholds perceive low cost compared with benefit. In order to summarize the expected benefit of applying a risk model across the population, one needs to integrate with the decision curve the probability distribution of risk thresholds likely to be used in practice. For example, suppose that 50% of individuals in the target population use a risk threshold equal to 0.20 but that 25% use the lower threshold of 0.10 and 25% use the larger threshold of 0.30. The average benefit in the population is then calculated as the weighted average of expected benefits associated with each of the 3 thresholds: for model A the average benefit is .003 while for model B the average benefit is .011.
In conclusion, we promote displays of case and control risk distribution curves (Figures 1, ,22 or or3)3) in conjunction with decision curves (Figure 5) for evaluating and comparing risk prediction models. The case and control specific risk distribution curves that display the TPR and FPR values associated with the subject's risk threshold are easy to understand and are key to his decision about ascertaining his genetic profile and other risk factors in the risk prediction model. Decision curves provide additional insights by formalizing the cost-benefit analysis. Since subjects vary in their tolerance for risk, having the distributions displayed is convenient because it allows use of various risk threshold values. The overall benefit associated with use of a risk model in the population can only be summarized into a single meaningful number if one specifies a population distribution for risk thresholds used by individuals.
Janssens et al (2) evaluated the predictive potential of genetic profiling by simulating a wide variety of scenarios. We investigate the same scenarios as Janssens did. The illustrative dataset shown in Figures 1–3 was simulated using two such scenarios. Our simulation program is publically available so that an investigator can simulate specific scenarios of interest for themselves. Use of the program is described in the appendix.
A scenario is specified by the number of subjects, the overall event rate in the population, ρ, the number of genes that confer risk, the allele frequencies for the genes, and the association of each allele with risk. We consider simple settings where each gene has 2 alleles, with genotypes and allele frequencies in Hardy Weinberg equilibrium, and no linkage disequilibrium between genes. The true risk of an event for a subject is derived from a standard additive model on the logistic scale. That is, the logarithm of the odds of having an event is a sum of terms associated with each high risk allele, high risk homozygotes contributing two equal terms, one for each allele, and there are no statistical interactions between genes. The lower frequency allele for each gene is associated with higher risk. The magnitude of the association between a gene and risk is quantified by the odds ratio for the high risk allele: OR=odds of an event for heterozygotes/odds of an event for homozygotes with the dominant lower risk allele variant. Details of how data are simulated are given in the appendix.
Very large sample sizes were used in our simulation studies. Consequently, the results in Tables 1, ,22 and and33 show the true values (precise to 2 decimal places) of the prediction performance measures for each risk model, not estimates. We evaluate predictive performance by focusing on the proportions of high risk subjects identified from the information in their genetic profiles and expected benefit. We employ the high risk threshold equal to 20% for illustration. Tables for other risk thresholds and other scenarios are provided in the appendix. In contrast to our approach, Janssens reported AUCs and R-squared summary statistics. These are provided here as well for completeness. In addition to generating data, investigators can use our programs to calculate all of the summary indices shown in Tables 2 and and33 after specifying a risk threshold that defines the high (or low) risk category.
In the first set of simulations (Table 2) all genes in a gene profile have the same minor allele frequency and are equally predictive. We investigated settings where the number of genes associated with risk ranged from 50 to 350, the frequency of the minor allele varied from 5% to 30% and the odds ratio associated with the heterozygous genotype ranged from 1.05 to 1.5. Subjects whose risks are 20% or more are considered at high risk. This contrasts with the overall event rate of 10%.
The proportion of high risk subjects identified is generally low in the scenarios we studied. The maximum value for the proportion of high risk subjects identified was approximately 17%. For example, when the gene profile consists of 350 predictive genes each with a minor allele frequency of 5% and odds ratio equal to 1.5, 17% of the population have calculated risk values exceeding 20%.
The high risk population proportion typically increases with larger numbers of predictive genes, with stronger associations of genes with risk and with higher minor allele frequencies. However, counter examples abound. For example, with common OR=1.5, the proportion of the population at high risk is 17.5% when 150 genes are predictive but smaller, 13.1% when 250 genes are predictive. The overall reduction in the proportion at high risk in this example is due to the facts that fewer controls are deemed at high risk by the more predictive 250 gene model and that the bulk of the population is comprised of controls.
The sensitivity of risk models is low especially when genetic associations are weak. We see that less than half the cases are classified as high risk when odds ratios are less than or equal to 1.1, regardless of the number of genes in the profile. Even when the common odds ratio is 1.25, in order to classify > 50% of cases, at least 250 genes with allele frequencies of 10% or 150 genes with allele frequencies of 30% are required in the model. When only 50 genes are in the model, the proportion of cases classified as high risk only exceeded 50% in one scenario, namely for common genes with allele frequencies of 30% and large odds ratios equal to 1.5.
In Table 2 there are tendencies for improvements in proportions of cases and controls classified as high risk by the models with inclusion of larger numbers of predictive genes, with stronger associations of genes with risk and with higher minor allele frequencies. However there are no absolute rules evident in this regard. On the other hand, the expected benefit due to use of the risk model always improved with these 3 factors: with inclusion of larger numbers of predictive genes, with stronger associations of genes with risk and with higher minor allele frequencies.
Note that the expected benefit values displayed in Table 2 are weighted averages of the proportions of cases and controls classified as high risk. The weighting acknowledges that use of 0.20 as the high risk threshold implies that the cost for a control classified as high risk is equivalent to 1/4th the benefit for a case classified as high risk. Let's consider how to interpret expected benefit values shown in Table 2 with a concrete example. Suppose a policy maker is deciding if ascertaining information such as genotype is economically advantageous. Assume some hypothetical monetary costs for treatment, $20,000 say, for treating a subject diagnosed with disease, and $1000 for interventions to prevent disease occurring in the first place. If prevention interventions reduce the risk of disease by 25% then the expected benefit for a subject that would be a case in the absence of intervention is 0.25×($20,000)−$1000=$4,000 while the expected cost for a subject that would be a control in the absence of intervention is $1,000. The cost-benefit ratio is therefore $1,000/$4,000=1/4 in this setting leading to use of the risk threshold 0.2. The expected benefit values in Table 2 are in units corresponding to the benefit of high risk designation for a case. That is, to convert the values in Table 2 to monetary values in this hypothetical setting, we multiply by $4000. Thus, for example, the expected monetary benefit associated with the model in the last row in Table 2(a) is 0.05×$4000=$200 per person. If testing costs more than $200, there is no gain in financial terms by using this risk model. However, nonmonetary aspects must be factored into policy making as well.
In the second set of simulations summarized in Table 3, the genetic profiles are such that the odds ratios and minor allele frequencies both vary. The odds ratios for the strongest 20 genes vary uniformly from a maximum value displayed in Table 3 to 1.15, while the odds ratio decreases uniformly from 1.15 to 1.05 over the remaining genes. The minor allele frequency starts at .05 and increases by .005 for each gene over the first 50 genes, then by .0005 for each of the remaining genes. A key feature in these scenarios is that the strong genes are uncommon while the genes weakly associated with risk are relatively more common. Again, our scenarios mimic those reported by Janssens et al (2).
We see that the population proportions at high risk, overall, for cases and for controls, and the expected net benefit, are determined to a large extent by the relatively few genes in the strong set especially when their odds ratios are high.
Tables 2 and and33 display values of the AUC for each risk model. Janssens et al (2) use the criterion AUC ≥ 0.80 to indicate high discriminative accuracy. Others use similar criteria. However, a model may have AUC as large as 0.80 yet it may not be useful in practice. For example, the model in row 12 of Table 2 has expected benefit = 0.024. Assuming the hypothetical values mentioned earlier for monetary costs and benefits as well as risk reductions afforded by prevention interventions, the expected monetary benefit of using this test is $96 per person. If the cost of testing is $96, there is no net benefit despite the fact that the AUC for the risk model is 0.801. On the other hand, Gail and Pfeiffer (2005) have shown that the modified Gail Model for breast cancer risk (model 2 in Constantino et al 1999) is useful for selecting women for prevention treatment with tamoxifen despite the fact that its AUC=0.66. As another example consider that the expected monetary benefit for the model in Table 2(a) with 50 genes each with odds ratios 1.25 is .002×$4000 = $8 per person which is derived from its capacity to classify as high risk 7.5% cases and 2.3% controls using the risk threshold of 0.20 which is deemed clinically relevant in our hypothetical example. If the corresponding genetic test costs less than $8 per person, then it will be cost effective to offer it people. Yet the AUC for this model is only 0.64.
The crucial issue is that one cannot assess the value of a risk model according to AUC which ignores the population and clinical context in which the model is to be applied. For example, the AUC does not incorporate the case prevalence in the population. Another problem with the AUC is that it does not take into consideration risk thresholds that motivate intervention in the clinical context. Consider the setting in row 12 of Table 2(a) again. If the benefit of treating a case is constant but the cost of treating a control is high, so that only subjects at very high risk, say >30%, should receive intervention, the benefit of using the model will be different than if the cost of treating a control is less where subjects with risks > 10% say, should be intervened upon. With risk threshold equal to 30%, only 32% of cases and 5% of controls satisfy the criterion for high risk and the expected benefit is 0.014. The corresponding numbers using the lower risk threshold equal to 10% are 73% of cases, 28% of controls and expected benefit of 0.045. Clearly the implications of the risk model are different in these two scenarios. Yet, AUC makes no distinction. Indeed it accumulates over all possible risk thresholds, considering all values between 0 and 1 as plausible.
The R-squared summary statistic and the NRI, also shown in Tables 2 and and3,3, share many of the same drawbacks as AUC. They do not incorporate the clinical context into their calculations. Interestingly R-squared does vary with population prevalence and NRI does vary with the high risk threshold. But neither are incorporated in ways that make the resulting measure clinically relevant for evaluating the risk prediction model.
Our simulations indicate that in order to identify a sizable number of subjects at substantially increased risk for an event, large numbers of independent genes that confer at least moderately elevated relative risks are required or alternatively a few genes that are strongly associated with risk. To date whole genome analyses have yielded genes and SNPs in particular, that are only weakly associated with outcome. These are unlikely to be helpful in identifying large groups of individuals at substantially elevated risk.
Our conclusions are limited to the set of scenarios studied here. Tables for additional settings are provided in the appendix and alternative scenarios can be investigated using the general programs we have developed. A key feature of the scenarios simulated is that genes are in linkage equilibrium and that they have statistically independent effects on disease risk. Correlations between genes are likely to give rise to prediction models with poorer performance. On the other hand, it is possible that certain types of interactions between genes and interactions between environmental factors and genes may yield better capacities to predict risk.
In addition to exploring the potential predictive capacities of specific genetic profiles, we have argued for using clinically relevant, easy to understand ways of quantifying the capacity of genes, markers and other factors to predict risk. We promote the use of risk distribution plots because they are both easy to understand and because they give clinically useful information. Moreover, all statistical summaries of predictive capacity are derived from them. In addition, decision curves that are relatively simple and useful for formal cost-benefit analyses are derived from them.
We demonstrated that risk distribution curves are preferable to receiver operating characteristic (ROC) curves. In particular, criteria based on area under the ROC curve (AUC) can be misleading. A risk model that is beneficial in a particular population may not have an AUC that indicates good discrimination. A risk model that is not beneficial in a particular population may have an excellent AUC.
In this paper we do not provide technical discussion about using data to fit risk prediction models or to assess their performance. We investigated performance in the ideal setting where the true risk values can be calculated from an individual's genetic profile and a very large dataset is available to assess the true performance of the risk model in the population. Methods for estimating performance from study data along with confidence interval construction and hypothesis testing are under development (13, 14, 15, 16).
Grant support: US National Institute of Health (R01 GM054438; R01 CA129934).