We compare the six classification techniques in the context of two classification tasks: schizophrenia versus control and bipolar versus control. In a first set of experiments, the data consist of only the gene expression features (Figures , , , , , ), and in a second batch, the data include both demographic and clinical features as well as gene expression (Figures , , , , , ).
ROC curves, schizophrenia/control, expression data.
ROC curves, schizophrenia/control, expression data, male subjects.
ROC curves, schizophrenia/control, expression data, female subjects.
ROC curves, bipolar/control, expression data.
ROC curves, bipolar/control, expression data, male subjects.
ROC curves, bipolar/control, expression data, female subjects.
ROC curves, schizophrenia/control, demographic, clinical, and expression data.
ROC curves, schizophrenia/control, all data, male subjects.
ROC curves, schizophrenia/control, all data, female subjects.
ROC curves, bipolar/control, demographic, clinical, and expression data.
ROC curves, bipolar/control, all data, male subjects.
ROC curves, bipolar/control, all data, female subjects.
Figure compares the classification techniques for the schizophrenia versus control task, using only the gene expression data. SVM outperforms the other techniques. It yields a cross validated AUC of 0.91, which is significantly better than NSC (AUC = 0.71, p = 0.002), DT (AUC = 0.64, p = 0.0001), EOV (AUC = 0.71, p = 0.0001), NB (AUC = 0.71, p = 0.0004), and 3NN (AUC = 0.70, p = 0.0002). The same holds for the bipolar versus control task (Figure ). SVM (AUC = 0.92) outperforms the other techniques. The second best technique is NSC (AUC = 0.73, p = 0.01).
We also present experiments on data for male subjects only (Figure and Figure ) and for female subjects only (Figure and Figure ), to assess if the diseases can be better predicted if data of only one sex is used. The SVM result AUC = 0.91 on the combined data of Figure for schizophrenia versus control is, however, not significantly different from the result for male subjects (Figure , AUC = 0.92, p = 0.9) or that for female subjects (Figure , AUC = 0.87, p = 0.4). The same holds for the bipolar versus control task. We hypothesize that this is because the data sets with subjects of one sex only are much smaller than the combined data, so that there is less training data for each model. Even if classification is easier for such data, this is offset by the smaller data size.
Figures , , , , , present a similar set of experiments for the data set that includes demographic and clinical data in addition to the gene expression data. Adding demographic and clinical information improves classification performance. SVM, for example, performs better on the schizophrenia versus control task with demographic information (Figure , AUC = 0.94) than without such additional information (Figure , AUC = 0.91, p = 0.06). The same holds for the other classification techniques and for the bipolar versus control task (Figure versus Figure ).
We again present experiments for male and female subjects separately (Figures , , and ), this time with the demographic and clinical data included. The conclusion from this set of experiments is similar to the previous conclusion: separating the subjects by sex does not significantly improve the classification performance for SVMs.
The superior performance of SVMs when compared to the other classification algorithms can be understood based on the properties of gene expression data. Gene expression data is typically characterized by a high dimension combined with a relatively low number of samples. For example, the present data set records the expression level of 22,283 probe sets for a number of samples that is two orders of magnitude smaller. Many classification algorithms are known to perform poorly on such high dimensional data. SVMs, on the other hand, are well suited to this setting because their classification performance can be independent of the dimensionality of the feature set [13
]: their performance rather depends on the margin with which they separate the samples (Figure ). This explains the good performance of SVMs on high dimensional data. Additional empirical evidence is that SVMs are known to perform well on text classification problems (where each word in the vocabulary represents a dimension) [13
]. Previous studies on gene expression data also illustrate the good performance of SVMs [5
Note that the above discussion does not imply that SVMs will always outperform other algorithms on gene expression data. For example, NSC, which implicitly performs dimensionality reduction (recall that it shrinks the class centroids towards the overall data centroid), has also been shown to work well on gene expression data [7
]. Therefore, it is common practice in machine learning to evaluate different classification algorithms on a new data set and based on this evaluation select the one that works best. This is also the approach that we follow in this work.
Most relevant features
To asses which features are most relevant to each of the classification tasks, we apply two techniques: (a) ranking the features by their p
-value, and (b) ranking the features by their SVM weight. The first technique performs, for each feature, a two-sided t
-test comparing the feature's values in the two classes. It then ranks the features by their t
-value. Besides the p
-values, we also report q
-values measure significance in terms of the false discovery rate. For example, if all features with a q
-value ≤ 5% are called significant, then 5% of these features may be false discoveries, that is, their mean value in the two classes may be actually identical. We use the software QVALUE developed by Storey [15
] to compute the q
-values. The second technique ranks the features by the weight that the SVM classifier assigns to each feature in the linear equation of its classification hyper-plane. The larger the absolute value of the SVM weight, the more important the feature is to the classification task.
The QVALUE software computes, in addition to the q-values, also an estimate of the proportion π0 of truly null features. For each of the schizophrenia versus control tasks it estimates π0 to be 1.0, that is, no significant features; for the bipolar versus control tasks, π0 ranges from 0.54 to 0.72. Note that the estimate for schizophrenia versus control is conservative (an overestimate). QVALUE makes certain assumptions about the p-value distribution of the data, which do not hold in this case (cf. Figure ). It is interesting that, even though QVALUE estimates that there are no significant individual features, it is still possible to build classification models that are highly accurate on previously unseen data. (Recall that SVM yields a cross-validated AUC of 0.91 for schizophrenia versus control.) This is partly because the QVALUE estimate is conservative. But it also is partly because classification techniques do not rely on a single feature, but exploit the combined effect of the set of most relevant features. Therefore, obtaining an accurate classifier is possible even if there are no individual significant features.
Figure 14 p-value histogram for (a) schizophrenia/control and (b) bipolar/control (expression data, all subjects). p-values of truly null features are distributed uniformly, while p-values of significant features are clustered around 0.0. This translates to a flat (more ...)
Table (schizophrenia versus control) and Table (bipolar versus control) rank the features by p
-value. The left panel of each table shows results based on expression data only; the right panel presents results that include the demographic and clinical features as well. Each table consists of three parts: the top part contains the rankings for all subjects combined, the middle part the male subjects' rankings, and the bottom part the female subjects' rankings.(see additional file 3
Genes sorted by p-value, schizophrenia versus control
Genes sorted by p-value, bipolar versus control
Comparing (a) the rankings for expression data only (the left panels of the tables) to (b) the rankings for expression and demographic data (the right panels) shows that similar features appear in (a) and (b). For example, all probe sets that appear in (b) also appear in (a) for Table , all subjects. In addition, (b) also includes a number of highly ranked demographic and clinical features. Table shows, for example, that drug use and alcohol use are ranked high for the all and male subjects cases. This indicates that some of the demographic and clinical features are important to the classification tasks. Note that we also observed this while comparing classification models: the models with demographic and clinical features are more accurate.
When comparing the features that appear in the different tables, we observe that for the schizophrenia versus control task (Table , expression data), the rankings for all subjects and male subjects have 14 features in common: LBH [GenBank:NM_030915], MT1X [GenBank:NM_002450], MT1X [GenBank:NM_005952], TNFSF10 [GenBank:NM_003810], ABCG2 [GenBank:AF098951], MT1E [GenBank:BF217861], SST [GenBank:NM_001048], CRHBP [GenBank:NM_001882], EMX2 [GenBank:AI478455], NPY [GenBank:NM_000905], MT2A [GenBank:NM_005953], MT1H/P2 [GenBank:NM_005951], SOX9 [GenBank:NM_000346], S100A8 [GenBank:NM_002964].
On the other hand, the rankings for all subjects and female subjects have only one feature (FAM107A [GenBank:NM_007177]) in common. For the bipolar versus control task (Table , expression data) all subjects and male subjects share 6 features (SST [GenBank:NM_001048], TNFSF10 [GenBank:NM_003810], NEUROD6 [GenBank:NM_022728], CD74 [GenBank:K01144], NPY [GenBank:NM_000905], TYROBP [GenBank:NM_003332]), and all subjects and female subjects also share 6 features (PPID [GenBank:NM_005038], WTAP [GenBank:BC000383], RND3 [GenBank:BG054844], DNAJA1 [GenBank:NM_001539], KIF2A [GenBank:NM_004520]). For both diseases, there is no overlap between the ranking for the female subjects and that for the male subjects. Possibly of higher interest are the features relevant to both the schizophrenia versus control and bipolar versus control tasks. Comparing the rankings (the top left rankings of Table and Table ) shows that there are 4 common features: LBH [GenBank:NM_030915], TNFSF10 [GenBank:NM_003810], SST [GenBank:NM_001048], and NPY [GenBank:NM_000905]. These are relevant to both diseases.
Table and Table show rankings based on SVM weights. Also here the relevant features observed for expression and demographic data are similar to those found for expression data only. There is also overlap between the rankings for the different subject subsets (all, male only, and female only). Note, however, that the features identified with the SVM weights are different from those identified with the p-value method. Consider the expression data only, all subjects rankings. For schizophrenia versus control, there are no common features in the rankings produced by the p-value method (Table ) and the SVM method (Table ). For bipolar versus control (Tables and ), there are two shared features: SST [GenBank:NM_001048] and LBH [GenBank:NM_030915]. This difference in rankings arises because the methods essentially have a different goal: the p-value method looks for individual features that distinguish the two classes while the SVM method yields a set of features that together distinguish the classes.
Genes sorted by SVM weight, schizophrenia versus control
Genes sorted by SVM weight, bipolar versus control
Of the top 20 genes identified using the p
-value ranking, 11 have been previously implicated in schizophrenia in at least one study. These genes include: NR4A3 [GenBank:U12767
], SST [GenBank:NM_001048
], NPY [GenBank:NM_000905
], S100A8 [GenBank:NM_002964
], CRH [GenBank:NM_000756
], GAD1 [GenBank:NM_013445
], FOS [GenBank:BC004490
], JUN [GenBank:BG491844
], DNAJB1 [GenBank:BG537255
], SLC16A1 [GenBank:AL162079
], and EGR2 [GenBank:NM_000399
Overlap with the current literature occurs for bipolar disorder as well, although overlap is not as large primarily because of the relative immaturity of the field and concomitant smaller number of literature results. Of the top 20 genes identified using SVM weight or p
-value, 7 genes have been implicated previously in bipolar disorder. Interestingly, multiple probes for the same gene are in the top 20 for DUSP6 [GenBank:BC003143
] and HLA-DRA [GenBank:M60333
]. Single probes previously implicated in bipolar disorder include: SST [GenBank:NM_001048
], HLA-A [GenBank:AA573862
], NPY [GenBank:NM_000905
], HLA-DRB3 [Genbank:NM_002125
], and DNAJB1 [GenBank:BG537255
Interestingly, most of the remaining genes in the list are known to interact with the genes that have a documented association with either bipolar disorder or schizophrenia. These interactions were determined using Ingenuity Systems software. 14 of the 20 genes in the schizophrenia sample are involved in the same biological pathway (Figure ). By combining the two networks generated by the software package via 3 overlapping genes, 19 of the 20 genes are in a single biological network. Similarly, 13 of the 20 genes are in a single pathway for bipolar disorder (Figure ). By combining two of the 3 generated pathways through 3 overlapping genes, this biological network represents 16 of the 20 genes on the list.
Biological network representing the schizophrenia p-value ranking. The network was generated using Ingenuity Systems Pathway analysis. The darker the red the more significant the correlation with the disease.
Biological network representing the bipolar disorder p-value ranking. The network was generated using Ingenuity Systems Pathway analysis. The darker the red the more significant the correlation with the disease.
One of the more remarkable features of this analysis is the difference in gene expression patterns between males and females. Much speculation has surrounded the role of gender in psychiatric disorders based on morphological and clinical comparisons between affected males and affected females. This analysis may provide further evidence to support and broaden this hypothesis. The most prevalent gender-based differences associated with mental disorders are in the structural abnormalities that have long been known in schizophrenia [38
]. These have been validated using CT and MRI scans, demonstrating differences in ventricle size in males and females with schizophrenia; specifically the left ventricles of males are known to be enlarged relative to both their healthy counterparts and affected females [39
]. Another structure showing a difference in affected males and females is the corpus calosum [46
]. The temporal lobe appears smaller in affected men than women [49
]. Specifically, the superior temporal gyri [50
], the posterior superior temporal gyrus [51
], and Herschel's gyri [51
] have all been shown in one or more studies to be reduced in affected males when compared to their unaffected male counterparts or affected females. Volume reductions have also been observed in the amygdale-hipocampal complex [52
]. Reduced asymmetry of the planum temporale has been observed in females in both MRI and post mortem studies [53
]. In this study we provide additional evidence to further bolster the claim of gender differences, but this new evidence is in the form of molecular differences between affected males and females in both schizophrenia and bipolar disorder. This may all provide evidence that gender may have confounded the results of past molecular analyses into these disorders.
The ranking based on the SVM weights does not produce a significant number of genes previously implicated in schizophrenia or bipolar disorder. This does not necessarily mean this measure does not provide as much biological insight as the ranking based on p-value. The smaller overlap may instead be because the SVM-based method is more different from previous studies than is the method based on p-values. Whereas previous studies sought individual genes, much as the ranking by p-value does, the ranking by SVM weight seeks genes that are predictive in the context of other genes. Therefore it seems likely that this more global look at bipolar disorder and schizophrenia is producing genes missed in previous analyses of microarray data on brain tissue.
Impact of alcohol and drug use
Consider the schizophrenia versus control classification task. Feature rankings by p-value, such as Table , may include alcohol use (AU) and drug use (DU) associated genes, and some of these may not be associated to schizophrenia. AU and DU are known to alter gene expression, that is, there are genes that are differently expressed in heavy AU/DU subjects and in low AU/DU subjects. Such AU/DU associated genes will also be differently expressed in the schizophrenia and control classes, simply because there are more high AU/DU subjects in the schizophrenia class, and more low AU/DU subjects in the control class (Table ). Therefore, AU/DU associated genes may appear in the feature rankings (Table ).
Identical reasoning applies to the bipolar disorder versus control task, which exhibits a similar difference in AU/DU distribution between the two classes. This should be kept in mind when analyzing the rankings. Note that these differences in distribution are already present in the population and therefore difficult to avoid in the samples.
Post-stratification can potentially be used to remove the confounding effects of variables such as AU and DU. Essentially, post-stratification computes a subset of the data such that the subset's AU/DU distribution is identical in each class. We have applied post-stratification. A detailed description of the method that we used together with its results is available in additional files 1
. In these results however, post-stratification proved ineffective because it significantly reduced the amount of data, and therefore also the power of the statistical methods, resulting in unacceptably high false discovery rates. We briefly quantify this in the following paragraph.
Consider the p-value ranking for the schizophrenia versus control task, expression data only, all subjects (Table ). Reporting all top 20 features as being significant results in a false discovery rate of 3.8%, that is, fewer than one feature is a false discovery. However, if we compute a similar ranking on the stratified data, which contains only 121 of the 332 samples in the original data , then reporting only the four top-ranked features as significant already yields a false discovery rate of 62.8%, that is, more than two of these four may be false discoveries. Because of this high false discovery rate, we decided not to use stratification in the paper and to accept the possibility of AU/DU associated genes in the rankings. Further analysis, possibly using more data, is required to identify such genes.
Note that this problem is partially mitigated by use of the SVM-based ranking instead, when using demographic features in addition to gene expression. If a gene's correlation with disease is only because of its correlations with AU/DU, then the SVM will prefer to place most/all of the weight on AU and DU rather than on this gene. The gene will receive high weight only if it provides additional predictive ability for disease beyond its association with AU/DU. As a result, the ranking will mostly include genes that are truly associated to the disease, which may explain the difference between the SVM and p-value rankings. The extent of this mitigation requires further study to quantify, which is difficult because we do not have a ground truth to compare to, that is, it would require that we know which genes are directly associated with AU/DU and are not associated with the diseases.