Given the computational challenges of gene-by-environment interaction analysis in the setting of GWAS, we sought to optimize a publicly available machine learning algorithm to minimize hardware limitations. The optimizations were mostly related to memory use. Comparing execution speed using datasets smaller than the full FHS GWAS data, the optimized code ran in about 1/30th of the time needed to execute the original code. Future analyses could include formal testing of the computational efficiency with increasing dataset sizes.
The optimized RF algorithm identified smoking status and the rs2011345 SNP as important classifiers in RF. Subsequent regression models confirmed that the SNP and smoking status had significant main effects with the outcome, and also a significant interaction. While RF deemed this SNP important, it may have been overlooked had the analysis been based exclusively on regression and p-values. Out of the 355,649 SNPs tested in standard tests of association using linear regression, rs2011345 ranked as the 2,111th smallest p-value (p = 0.006) in a sex-adjusted model, and only the 29,776th smallest p-value (p = 0.079) for the sex-adjusted SNP-by-smoking status interaction model in PLINK (treating allele count as a linear variable). Other important covariates identified by the RF runs included sex, smoking status, and decade of birth--the latter picking up a potential survival bias where older individuals included in the study are more likely to have a later age of onset.
Whereas the relative magnitude of this interaction compared with all other potential interactions is presently untested, the following correlative information is of interest. SNP rs2011345 is approximately 11 kb from the 3' end of the flavin-containing monooxygenase 4 gene (FMO4
, Map Viewer build 36.3). This region appears to be in linkage disequilibrium with the last two exons of FMO4
in the CEU cohort from the HapMap project (build 36). This gene is part of a family of enzymes involved in the metabolism of nicotine and other tobacco-related products. Additionally, this region (1q24) has been linked to essential hypertension [8
]. Thus, the highest ranked candidate of this optimized RF algorithm has at least minimal biological plausibility worth exploring in future studies of CHD.
Our study has a few limitations. First, sample size limitations should be considered; while the interaction between rs2011345 and smoking was statistically significant, the study contained only 224 incident cases of CHD. Second, the interpretation of results is also limited by a relatively crude measure of smoking exposure. The available information on smoking behavior only provided a cross-sectional glimpse of the subject's smoking habits and, in some cases, incident CHD occurred a decade or more after the last known smoking status. Furthermore, the RF algorithm cannot account for familial correlation using traditional approaches. However, we were able to confirm the RF findings using a GEE model that accounts for familial correlation. While the 224 cases in this analysis did not have a large degree of relatedness, it could be an issue in the larger FHS population.
Future work addressing practical issues of RF may enhance its attractiveness as an analysis tool for GWAS. We used runs of 500 trees in this analysis and found the top covariates in the output to be more stable than in runs of 200 trees. Runs of 1,000 and 1,500 trees did not produce markedly different results. Another important problem is reconciling the different SNPs identified by RF with traditional p-value based methods. Nonetheless, it appears as though RF can provide an alternative to traditional regression techniques to reduce the high-dimensional data space of GWAS searching for gene-by-environment interactions.