One of the biggest challenges of complex trait prediction is the lack of statistical power that is the direct result of small effects of many causal factors and relatively small sample sizes. These problems are exacerbated when consideration is given to the potential for interaction among the causal factors to interact with one another. We have shown that RF modeling can produce accurate results using hundreds of SNPs obtained from a relatively small study (131 cases, 291 controls). With only 417 subjects, using 160 SNPs, we are able to generate a good predictive model for childhood asthma exacerbations, with a > 0.66 AUC and about 0.66 sensitivity and 0.6 specificity (Figures , ). Depending on the portion of the ROC curve that is used, this can equate to a positive predictive value (PPV) of 0.81 and a negative predictive value (NPV) of 0.74 with proportion of exacerbators = 0.3 as shown in Table and choosing a scoring threshold corresponding to sensitivity = 0.2 and specificity = 0.95, allowing for reasonable prediction of asthma exacerbations. The permutation control, random SNP control, and independent replication results all support the validity and robustness of the random forest predictive model. The ROC curve obtained for the model training using the Stage 2 (independent replication) samples is very similar to that obtained for the Stage 1 (training) samples (Figure ), and the p-value for the independent replication AUC is < 0.05, indicating reproducibility of the predictive accuracy in the model using 160 SNPs.
The 160 SNPs are in or near to 140 genes (Additional File
1, Table S1). Among the top 160 SNPs, one SNP (rs10496476) is located within the intron of gene DPP10, which has been shown to be associated with asthma in multiple populations, based on a recent review [
27]. All other genes are not on the list of replicated asthma genes reported by [
27], suggesting most SNPs and genes identified by our RF method are novel. A couple of factors may contribute to the discovery of new SNPs and genes: 1) RF evaluates individual SNPs in the context of interactions. This is different from conventional statistical methods such as logistic regression, applied in GWAS which searches SNPs one by one without consideration of SNP-SNP interactions; 2) asthma exacerbations are related to but different phenotypes from asthma diagnosis.
Our study highlights an innovative way of integrating a large number of individually weak predictors to effectively build a reasonable predictive model for asthma exacerbations. Given that complex trait studies so far have used only up to a dozen predictors (i.e. an order of magnitude fewer than what we use here) with limited consideration of interaction and have generated relatively poor predictability, our approach of employing RF with hundreds of predictors with a relatively small sample size gives hope for additional improvement in complex trait prediction using a variety of machine learning approaches. Talmud, et al, studied 20 SNPs derived from genome-wide association studies of type 2 diabetes susceptibility in a population of 5,535 subjects followed for 10 years [
28]. They noted that clinical factors outperformed genetic markers in the prediction of incident diabetes and that the addition of the SNPs produced minimal improvement in risk estimation based only on clinical variables. In contrast to this study, which focused on the additive effects of 20 SNPs, our RF model simultaneously accounts for both additive and interactive effects using 160 SNPs to more effectively predict asthma exacerbations compared with clinical factors alone.
Most complex traits such as adult height, cardiovascular diseases, cancer, diabetes, autism, and asthma etc. are likely to be encoded by a large number of both genetic and environmental factors [
5,
29-
32]. Asthma exacerbations, as shown in this study, is associated with at least several hundred genetic markers and environmental factors. The top 10 SNPs has AUC score 0.57, showing marginal predictability. But with 160 SNPs, the AUC of the RF model approached 0.66. Our study suggests that in order to get good prediction of a complex trait, methods capable of integrating hundreds of predictors, such as machine learning approaches, like random forests, will be valuable.
Asthma exacerbations have historically been difficult to predict. Several clinical models have been designed to try to enhance the ability to predict exacerbations [
21,
33-
36]. Most studies [
21,
33,
36] attempted to isolate predictive clinical variables individually without accounting for interaction using odds ratio or regression analysis. The results of these studies are reported as odds ratios or as p-values for individual factors, and cannot be directly compared with ours using AUC.
There have been several publications that have evaluated the use of a clinical classification tree in the development of a prognostic model for asthma exacerbations [
34]. One study evaluated six clinical variables including prior year hospitalization, the classification tree method was able to achieve - without independent replication - 94% sensitivity and 68% specificity, better than logistic regression (87% sensitivity and 48% specificity) or an additive risk model (46% sensitivity and 93% specificity), suggesting the value of accounting for interactions among predictive variables. A recent study [
35] reported 66.8% sensitivity and 85.8% specificity (with no independent validation testing) on childhood asthma exacerbations (defined as rescue oral corticosteroid use, an unscheduled visit to a physician or emergency room, or hospitalization) prediction with daytime cough, daytime wheeze, and β2-agonist use at night 1 day before the exacerbations as predictors. However, none of the clinical models developed to date have been independently validated. In our model, we successfully used both internal (e.g. permutation) controls as well as external replication in an independent subset of subjects to demonstrate predictive power of our model. While the independent subset of subjects were derived from the same source population, there were some differences in baseline characteristics between the two samples (Table ), further supporting the generalizability of our model. RF also uses a classification tree, but with a difference - it uses many classification trees (1500 trees in our models), not just one, and it can handle a greater number of input variables without over-fitting.
A critical issue for complex disease prediction is the difficulty of extending the predictive power of a model obtained from one population to an independent population. None of the studies mentioned in the proceeding paragraph has tested their models in independent populations. One important factor that makes researchers hesitate to do so is the concerns of small sample sizes and the heterogeneity of asthma exacerbations. We applied the RF models built with the Stage 1 samples to predict the independent Stage 2 samples (Figure ). Overall, the independent test samples paralleled the predictive accuracy of the Stage 1 (training) samples with increasing numbers of SNPs until 160 SNPs. At 160, the replicating AUC reached its maximum and flattens out thereafter. As such, we cite the 160 SNP model as our best performing model. The independent replication AUCs are obviously higher than 0.5, indicating true predictability of the RF models. However, they are lower than the training and internal cross-validation AUCs for 160 and 320 SNPs, suggesting certain degree of over-fitting may still exist.
As discussed above, clinical traits alone did not produce desirable predictability for asthma exacerbation. We did, however, exclude one predictor that is a strong predictor of severe exacerbations - prior exacerbations [
37]. The rationale for excluding this predictor was that we were interested in developing a predictive model based upon determinants of exacerbations; these determinants would by their nature include both prior and current exacerbations. Moreover, since we sought to determine genetic predictors of exacerbations, the inclusion of prior exacerbations would mitigate the strength of the genetic association in our analyses.
One of the reasons that clinical predictors may not have provided the same strength of prediction as our genetic models is that many of the clinical traits themselves are genetically determined. Indeed, our results (Figure ) have shown that without clinical traits, SNPs alone can predict as well as with the clinical traits, suggesting asthma exacerbations are at least partly caused by genetic factors. For instance, among our clinical predictors, sex is genetically determined; age itself is not genetic, but it may be associated with age of onset due to the patient recruitment process, and age of onset in turn can be genetic [
38,
39]; and pre-bronchodilator FEV
1% is influenced by genetics, especially in children.
There were several potential limitations to our study. We have already discussed the limitation due to a limited sample size. One potential problem is that with more than 160 SNPs, the training AUC keeps increasing (Figure ), but the replicating AUC does not. This suggests that the chance of getting false positive SNPs increases with the number of SNPs used for prediction. One way to reduce false positive SNPs is to increase the sample size, which is costly.