In this work, we introduce a novel approach for identifying genetic similarities between diseases using classifiers. We identify genetic similarities between several diseases. In this section, we first discuss the implications of these findings. We then consider challenges in the application of classifiers to GWAS data. Finally, we propose possible extensions of this approach.
We identify a strong similarity between T1D and RA. Genetic factors that are common to these two autoimmune diseases were identified well before the advent of GWAS, and linked to the HLA genes (Torfs et al.
, Lin et al.
). The original WTCCC study (WTCCC, 2007
) identifies several genes that appear to be associated with both diseases. We look at the classifiers corresponding to these two diseases. The SNP with the highest information gain in T1D is rs9273363, which is located on chromosome 6, near MHC class II gene HLA-DQB1, and is also the SNP that is most strongly associated with T1D in the initial analysis of the WTCCC data, with a P
-value of 4.29 × 10−298
(Nejentsev et al.
). This is the strongest association reported for any disease in the WTCCC study, which explains to a large extent why the T1D classifier so clearly outperforms the classifiers for the other diseases. This SNP is also significantly associated with RA (P
-value of 6.74 × 10−11
). The SNP with the highest information gain in RA is rs9275418, which is also part of the MHC region, and is strongly associated with both RA (P
-value of 1.00 × 10−48
) and T1D (P
-value of 7.36 × 10−126
). This shows that our approach is able to recover a known result, and uses SNPs that have been found to be significantly associated with both diseases in an independent analysis of the same data.
The similarity we identify between HT and BD is interesting, since there does not appear to be previous evidence of a link between the two diseases at the genetic level. However, a recent study identified an increased risk of HT in patients with BD compared with general population, as well as compared to patients with schizophrenia in the Dannish population (Johannessen et al.
). The WTCCC study only identified SNPs with moderate association to HT (lowest P
-value of 7.85 × 10−6
) and a single SNP with strong association with BD (P
-value of 6.29 × 10−8
). The decision trees for both diseases use a large number of SNPs that have a very weak association with the respective disease. Both classifiers have a classification error that is clearly below the baseline error, and provide evidence of similarity between the two diseases. This indicates that our classifier-based approach is able to use the weak signals of a large number of SNPs to identify evidence for similarities that would be missed by comparing only SNPs that show moderate or strong association with the diseases. Further analyzes are necessary to identify the nature and implications of the similarity we find between HT and BD, as well as the weaker similarity we identified between these two diseases and T1D.
We also show that we can train a classifier that can distinguish the two control sets, and we use it to identify diseases that are more similar to one of the control set than the other. This is not an unexpected finding, since SNPs that were strongly associated with a control set were identified and discarded in the WTCCC study. These SNPs were also removed in the preprocessing step of our study, and the results we obtain when trying to distinguish the two control sets therefore show that the decision tree classifier is able to achieve a classification error below the baseline error even though the SNPs with the strongest association could not be used by the classifier. The similarities between some diseases and one of the control sets can most likely be explained by some subtle data quality issue. During quality control, the authors of the WTCCC study found several hundreds of SNPs in which some datasets exhibited a particular probe intensity clustering [see the Supplementary Material
of the original WTCCC study (WTCCC, 2007
) for details]. This particular pattern was always observed in 58C
, but not in UKBS
. This matches the result obtained using our classifier-based approach, in which RA
were predicted to be most similar to UKBS
, and could therefore be a possible explanation of the similarities we find.
While we do find several interesting similarities between diseases, we also observe that training a classifier that distinguishes between individuals with a disease and controls using SNP data poses numerous challenges. The first is that whether someone will develop a disease is strongly influenced by environmental factors. The genetic associations that can be identified using GWAS are only predispositions, and it is therefore likely that some fraction of the control set will have the predispositions, but will not develop the disease. Furthermore, depending on the level of screening, the disease might be undiagnosed in some control individuals, and individuals that are part of a disease set might have other diseases as well. This is especially true for high-prevalence diseases like HT.
Obtaining good classifier performance by itself is not, however, the main goal of our approach. We show that we can find similarities even when the classifier performance only shows small improvements compared with the baseline error. In this work, we focus on the comparison approach, not on developing a classifier specially suited for the particular task of GWAS classification. We use decision trees because they are a simple, commonly used classification algorithm.
This work shows that classifiers can be used to identify similarities between diseases. This novel approach can be expanded into several directions. First, classification performance can be potentially improved by using a different generic classifier, or by developing classifiers that do take into account the specific characteristics of SNP data. Second, further analysis methods need to be developed in order to analyze the trained classifiers, and identify precisely the SNPs that do lead to the similarities this approach detects. Such a methodology would be useful, for example, to further analyze the putative similarity between HT and BD. Third, building on the fact that our approach considers the whole genotype of an individual, it could be possible to identify subtypes of diseases, and cluster individuals according to their subtype. Finally, modifying the approach to allow the integration of studies performed in populations of different origins or using different genotyping platforms would allow the comparison of a larger number of diseases.
Our approach identifies similarities between the genetic architecture of diseases. This is, however, only one of the many axes along which disease similarities could be described. In particular, both genetic and environmental factors interact in diseases, and the genetic architecture for two diseases could be similar, but the environmental triggers could be different, leading to low co-occurrence. There is therefore a need for methods that integrate similarities of different kinds that were identified using different measurement and analysis modalities. An example of such an approach is the computation of disease profiles that integrate both environmental ethiological factors and genetic factors (Liu et al.