|Home | About | Journals | Submit | Contact Us | Français|
Grammatical Evolution Neural Networks (GENN) is a computational method designed to detect gene-gene interactions in genetic epidemiology, but has so far only been evaluated in situations with balanced numbers of cases and controls. Real data, however, rarely has such perfectly balanced classes. In the current study, we test the power of GENN to detect interactions in data with a range of class imbalance using two fitness functions (classification error and balanced error), as well as data re-sampling. We show that when using classification error, class imbalance greatly decreases the power of GENN. Re-sampling methods demonstrated improved power, but using balanced accuracy resulted in the highest power. Based on the results of this study, balanced error has replaced classification error in the GENN algorithm.
Grammatical Evolution Neural Networks (GENN) uses grammatical evolution to evolve neural networks to detect gene-gene interactions in studies of complex human diseases . GENN has shown initial successes in both real and simulated data, and while these results are encouraging, previous simulation studies have used datasets with balanced numbers of cases and controls. Unfortunately, when using standard classification error as the fitness function, many machine learning methods are not robust to class imbalance.
To try to solve this problem, investigators have tried techniques such as re-sampling  or altering the fitness metric. One metric that has been shown to be highly successful is balanced error/accuracy . This metric has been shown to solve the class imbalance problem for another approach designed to detect epistasis–Multifactor Dimensionality Reduction (MDR) .
We assessed the performance of GENN on data with varying levels of class imbalance and show that the power of GENN using classification error decreases as the control:case ratio departs from unity. We compared three methods for addressing this concern: re-sampling methods (over- and under-sampling) and balanced accuracy as a fitness function.
The steps of GENN have been previously described in detail . For the purposes of the current study, an option was added to the configuration file to specify the fitness function used: classification error (CE) or balanced error (BE). BE is the inverse of balanced accuracy, defined as the mean of sensitivity and specificity :
where TP represents true positives, TN represents true negatives, FP represents false positives, and FN represents false negatives. This formula equally weights the errors within each class. In the case of balanced data, this is equivalent to standard CE.
The intention of the data simulations for this power study was to mimic gene-gene interaction, or epistasis, in case-control genetic data to evaluate GENN using penetrance functions. Penetrance defines the probability of disease given a particular genotype combination by modeling the relationship between genetic variations and disease risk. We used two well-described purely epistatic models, where the heritability (the proportion of trait variance due to genetics) ~5%. The first is referred to as the XOR model, and the second is referred to as the ZZ model . Both are nonlinear models with no marginal main effects. Software described by Moore et al  was used to simulate the data.
For both models, we simulated data with a range of control:case ratios and sample sizes. For the first set of simulations, the total number of individuals in the dataset was held constant, at two different total sample sizes: 600 and 1200. For each sample size, three control:case ratios were simulated: 1:1, 2:1, and 4:1. To ensure the results seen were due to class imbalance instead of decreasing numbers of cases, a second set of simulations was done, holding the number of cases constant at 300 and 600. Again, for each number of cases, three control:case ratios were simulated. For each set of parameters, 100 replicates were simulated. Each dataset had a total of 100 SNPs, two of which were functional in predicting disease. For the models with imbalanced control:case ratios, re-sampling was performed. In the case of under-sampling (US), controls were randomly removed until a ratio of 1:1 was achieved. In the case of over-sampling (OS), cases were randomly re-sampled until a 1:1 ratio was achieved.
GENN was used to analyze all epistasis models with classification error, balanced error, or classification error in combination with data re-sampling. Parameter settings remained identical between the analyses and included: 4 demes, migration every 25 generations, population size of 200 per deme, 400 generations, crossover rate of 0.9, and a reproduction rate of 0.1. Power for all analyses is reported as the number of times GENN correctly identified the correct loci with no false positives over 100 datasets.
Tables 1 and and22 show the results for all analyses, with several apparent trends. Using classification error (CE), increased imbalanced ratios greatly decreases the power of GENN. The power of GENN greatly improves when OS is used. With US, a marked decrease in power in smaller datasets with large class imbalance is seen. This trend is ameliorated somewhat in larger datasets, as well as the datasets with fixed numbers of cases. Most significantly, for all models analyzed, power recovers completely when using balanced error (BE).
From these results, we conclude that balanced error should be used as the fitness metric in GENN instead of classification error, as it outperforms standard classification error and re-sampling methods. Additionally, since balanced error and classification error are mathematically equivalent in when data is balanced, there is no disadvantage to using balanced error in balanced data.
This work was supported by National Institutes of Health grants HL65962, GM62758, and AG20135. We would also like to thank Jason H. Moore and Digna R. Velez for helpful discussions on class imbalance. This paper has been reviewed and approved for publication according to US EPA policy but does not necessarily represent the views of the Agency.
Categories and Subject Descriptors Genetics-Based Machine Learning and Learning Classifier Systems.
General Terms Algorithms