Genomic selection has gained much attention and the main goal is to increase the
predictive accuracy and the genetic gain in livestock using dense marker
information. Most methods dealing with the large p (number of covariates)
small n (number of observations) problem have dealt only with continuous
traits, but there are many important traits in livestock that are recorded in a
discrete fashion (e.g. pregnancy outcome, disease resistance). It is necessary to
evaluate alternatives to analyze discrete traits in a genome-wide prediction
This study shows two threshold versions of Bayesian regressions (Bayes A and
Bayesian LASSO) and two machine learning algorithms (boosting and random forest)
to analyze discrete traits in a genome-wide prediction context. These methods were
evaluated using simulated and field data to predict yet-to-be observed records.
Performances were compared based on the models' predictive ability.
The simulation showed that machine learning had some advantages over Bayesian
regressions when a small number of QTL regulated the trait under pure additivity.
However, differences were small and disappeared with a large number of QTL.
Bayesian threshold LASSO and boosting achieved the highest accuracies, whereas
Random Forest presented the highest classification performance. Random Forest was
the most consistent method in detecting resistant and susceptible animals, phi
correlation was up to 81% greater than Bayesian regressions. Random Forest
outperformed other methods in correctly classifying resistant and susceptible
animals in the two pure swine lines evaluated. Boosting and Bayes A were more
accurate with crossbred data.
The results of this study suggest that the best method for genome-wide prediction
may depend on the genetic basis of the population analyzed. All methods were less
accurate at correctly classifying intermediate animals than extreme animals. Among
the different alternatives proposed to analyze discrete traits, machine-learning
showed some advantages over Bayesian regressions. Boosting with a pseudo Huber
loss function showed high accuracy, whereas Random Forest produced more consistent
results and an interesting predictive ability. Nonetheless, the best method may be
case-dependent and a initial evaluation of different methods is recommended to
deal with a particular problem.