Machine learning classification applied to fMRI data have shown strong potential to diagnose cognitive disorders and identify behavioral states (Fan et al. (2006) Zhang and Samaras (2005) Ford et al. (2003)), but drawing inference to the general population from small-sample studies can be difficult. The assumptions of reproducibility of reactions over different fMRI runs may not be realistic (Lange et al. (1999) McKeown et al. (2003)), and factors such as small sample sizes, feature selection methods, and sampling variation may cause the cross-validation results one sees in publication to be a biased estimate of the testing accuracy one realizes in practice. Even when some care is taken to exclude obvious artifacts the resulting classifiers may be difficult to interpret, as they typically are formed without prior functional hypotheses. To illustrate these methodological susceptibilities we present and then deconstruct a classifier to test the true power of machine learning. In Anderson et al. (2010) the spectral classification method was presented which allows classification among fMRI scans that have not been aligned spatially using the temporal correlations among the independent components. From this, there arises the question of which components temporal activity differs enough between groups to power the classifier. To identify the discriminative component relationships we present a method called Common Component Classification that facilitates post-hoc identification of the components powering the classifier. Multi-session temporal concatenation (MSTC), a procedure based on independent components analysis, extracts common spatial maps across subjects as well as component- specific time series for each subject (Smith et al. (2004)). Classification is performed by characterizing correlations between pairs of components, revealing which components behaved differently between patients and controls.
Our classifier is tested on data from irritable bowel syndrome (IBS) patients and healthy controls (HC) undergoing a gastrointestinal stress task. IBS is a common functional pain disorder associated with chronic abdominal pain, discomfort, and associated altered bowel habits (Drossman (2006), Mayer et al. (2006)). When applying our machine-learning classifier to fMRI scans acquired during controlled rectal distension in IBS patients and HCs, these methods identified which participants were IBS or HCs and exposed entire networks differing between groups corresponding to identifiable neurological phenomena.
We next deconstruct this classifier by training and testing it within and across two runs to assess its sensitivity to permutation of the stimulus set as well as the reproducibility of stimulus effects across runs. We show how models can be made biased by mistakes made in the feature selection, parameter choice and cross-validation stages and measure the magnitude of this error. We further assess the strength of group-ICA methods by extracting components within and across runs, and evaluate the effectiveness of ICA-based methods to identify and remove artifacts. The classifier is also evaluated on data that has been cleaned of physiological noise to evaluate how much of the classification ability is attributable to scan artifacts such as motion versus true neurological signal. We examine the impact of motion artifacts on the classifier and the ability to remove it without also removing signal. Finally, we examine the statistical assumptions underlying machine learning classifiers, discussing the reproducibility of stimulus effects across runs, how bias can skew the predictive accuracy of the model and how the small sample sizes typical in fMRI affect our Type 1 and Type 2 errors and limit the ability to draw inference from findings of such machine learning studies. From our exercise of creating and deconstructing a classifier we seek collectively to identify what is being learned from machine learning.