Common complex traits are typically the combined effect of genetic and environmental factors. Since no practical predictor can account for all factors and their interactions, clinical prediction can at best assign probabilistic risks rather than deterministic outcomes. Viewed on the population level, these risk assignments can be seen as comprising a risk distribution, which is an estimate of the population’s true risk distribution. Maximal predictive accuracy occurs when the estimated risk matches the true risk.
The prevalence and heritability of any trait restrict the set of possible genetic risk distributions. If we know the risk corresponding to each individual’s genetic profile in a large sample, then we can obtain an expression for broad-sense heritability (H2
) on the binary scale [10
indexes people, n
is the sample size, riski
is individual i
’s genetic risk (i.e. the conditional probability of the trait given genes), and
is the average genetic risk, which equals the average population risk (see Methods). The meaning of risk
depends on the context: for instance, when the phenotype is current disease status, the average risk in the population is its prevalence, whereas in prediction of lifetime illness, risk
is the lifetime risk of disease. (When possible, we nonetheless opt for the term prevalence
1 mathematically expresses that heritability
is the proportion of phenotypic variance explained by the genetic risk distribution.
To mathematically derive the risk distribution that yields the best genetic prediction, we model the distribution as a histogram with equally-spaced bins located from 0 to 100% representing risk groups, where the height of each bin denotes the proportion of the population who fall into that risk group (for an example, see Figure ). This approach can define any risk distribution. Indeed, multiple genetic risk distributions can correspond to a given combination of prevalence and heritability; each such distribution, however, can lend itself differently to genetic prediction. Our method is based precisely on determining which such distribution (for a given prevalence and heritability) would allow the best predictive accuracy. Thus, for each combination of prevalence and heritability, we optimized the AUC that would be achieved if everyone’s risk were ideally ordered over the set of risk distributions that satisfied the combination of prevalence and heritability; similarly, we maximized the sensitivity for any given specificity, prevalence, and heritability over the set of risk distributions and thresholds that satisfied the constraints.
Figure 1 Example risk distribution. This distribution has a prevalence of 30% and a heritability of 10%. The mean of the distribution equals the prevalence of the trait. Variance represents the variance of risk due to genetic variation, sometimes called genetic (more ...)
Using this approach, we have derived the maximum limits on the genetic predictive accuracy of any binary trait given only its prevalence and heritability. These values are tabulated in Additional files 1
in terms of the AUC and sensitivity/specificity pairs, respectively. Additional file 3
contains computer code in the R software environment [14
] for the algorithms we developed. Figure displays AUC limits over all heritabilities for several prevalences, and it includes a comparison with the limits that would exist if genetic risk followed a beta distribution. The beta distribution is a flexible statistical distribution which is consistent with the assumptions of previous analytical approximations of the effect of prevalence and heritability on the ROC curve [12
], because it can take the shape of countless smooth unimodal risk distributions. Furthermore, unlike previous approximations which deteriorate at high heritabilities [12
], the beta distribution limits do not. The limits that the beta distribution imposes on the AUC closely track these previous approximations [12
] and also match a predictive genomics simulation based on a multiplicative genetic model [10
Figure 2 Heritability vs. predictive accuracy. Relationship of heritability (computed on the observed binary scale) or proportion of variance explained to the maximal upper limit on AUC. The numbers next to the curves represent the prevalence. The maximal AUCs (more ...)
Knowledge of this maximal limit on accuracy is beneficial in the case of type 2 diabetes (T2D), where early targeted intervention can be costly but effective [15
]. Many prediction studies of T2D have been reported, yet the genetic contribution to their predictive accuracy has been disappointing: genes alone yield ~60% AUC, and adding genes to clinical risk factors yields incremental improvements of ~1-2% AUC [16
]. The heritability of T2D per se
(as opposed to related continuous traits with higher heritability, e.g. body mass index) was estimated to be 26% by a population-based twin study [18
], with a prevalence of 13%. Applying our method to these statistics determines the maximum sensitivity/specificity pairs displayed in Figure , which show that, for example, if a specificity of 99% is desired, sensitivity cannot exceed 36%, and that if a sensitivity of 99% is desired, specificity cannot exceed 74%. Similarly, they determine the maximum achievable AUC for genetic prediction of lifetime T2D to be 89%. This motivates the search for additional genetic factors influencing risk for T2D.
Figure 3 ROC curves for type 2 diabetes and breast cancer from genomic profiles. Maximal sensitivity / 1-specificity pairs for prediction of type 2 diabetes and breast cancer from full genomic profiles. The maximal pairs are compared to the pairs that would exist (more ...)
Breast cancer has the same maximal AUC as T2D, albeit with a distinct ROC curve from T2D. Breast cancer was found to have a prevalence of 4% [19
], and we calculated its heritability on the binary scale to be 11% (see Methods), which yields a maximum AUC of 89%. Although this is the same maximum AUC as for T2D, the sensitivity/specificity pairs for breast cancer (Figure ) are not identical to those for T2D, owing to the different disease parameters. For example, to reach a specificity of 99%, sensitivity cannot exceed 24%, which is substantially lower than the corresponding maximal sensitivity of T2D when specificity is 99%. The divergence of these two ROC curves as specificity approaches 100% illustrates the importance of identifying the maximal ROC curve, rather than relying on the maximal AUC alone.
Heritability is the proportion of phenotypic variance explained by all genetic factors, but our analytic approach can treat the proportion of phenotypic variance explained by any particular set of factors. If the proportion of phenotypic variance explained by a particular set of genes is known, that proportion of variance explained could be substituted for heritability in our model. For instance, if a subset of genes could explain 50% of the genetic variance of T2D (i.e. explain 13% of phenotypic variance), then the maximum achievable AUC of this subset would be 80%.
Our method can also be applied in elucidating the maximum accuracy of predictors that integrate features such as gene expression, de novo
mutation, body mass index, and lifestyle (which are not fully inherited). The proportion of variance explained by such an integrated predictor can then be greater than heritability. When there are no gene-environment interactions, this difference is the proportion of phenotypic variation that these features explain independently
of genes. For example, weekly physical activity can explain 4% of phenotypic variance of T2D (see Methods), is moderately heritable [20
], and was found to not interact with well-known gene variants in T2D [21
]. Accordingly, the proportion of variance explained by the integrated predictor comprised of genomic profile and physical activity does not increment by the full 4% beyond the heritability of T2D. If the proportion of T2D variance that physical activity explains independently of genes was known to be only 3%, say, then the integrated predictor’s maximum AUC would be calculated based on a proportion of variance explained of 29% (sum of 26% and 3%), which yields a maximum AUC of 90%. If, however, we did not have an estimate for the proportion of T2D variance that physical activity explains independently of genes, then we could conservatively use 4% in the previous calculation, yielding a similar AUC. This analysis applies to predictors based on non-genetic features that are supplemented by genetics. In general, the estimation of the proportion of variance explained by integrated predictors is complicated by the interaction of genetic and non-genetic features; our method can nonetheless be applied when the interaction can be estimated or bounded. Note that genetic testing alone can still accurately predict outcome for some small, extreme risk groups (such as those with highly penetrant variants), but such a test will not benefit the general population without both a high sensitivity and specificity [22