|Home | About | Journals | Submit | Contact Us | Français|
Classifying neurodegenerative brain diseases in MRI aims at correctly assigning discrete labels to MRI scans. Such labels usually refer to a diagnostic decision a learner infers based on what it has learned from a training sample of MRI scans. Classification from MRI voxels separately typically does not provide independent evidence towards or against a class; the information relevant for classification is only present in the form of complicated multivariate patterns (or “features”). Deep learning solves this problem by learning a sequence of non-linear transformations that result in feature representations that are better suited to classification. Such learned features have been shown to drastically outperform hand-engineered features in computer vision and audio analysis domains. However, applying the deep learning approach to the task of MRI classification is extremely challenging, because it requires a very large amount of data which is currently not available. We propose to instead use a three dimensional scattering transform, which resembles a deep convolutional neural network but has no learnable parameters. Furthermore, the scattering transform linearizes diffeomorphisms (due to e.g. residual anatomical variability in MRI scans), making the different disease states more easily separable using a linear classifier. In experiments on brain morphometry in Alzheimer's disease, and on white matter microstructural damage in HIV, scattering representations are shown to be highly effective for the task of disease classification. For instance, in semi-supervised learning of progressive versus stable MCI, we reach an accuracy of 82.7%. We also present a visualization method to highlight areas that provide evidence for or against a certain class, both on an individual and group level.
1 Over the last two decades, Magnetic Resonance Imaging (MRI) has been widely adopted for studying the human brain and diseases affecting brain tissue. The neurodegenerative processes behind these diseases are still poorly understood. Previous studies show complex and multivariate patterns of tissue damage, such as in Alzheimer's disease (Jack et al., 2004, Dyrba et al., 2015, Zhang et al., 2011, Cuingnet et al., 2011, Young et al., 2013, Moradi et al., 2015, Arbabshirani et al., 2016) and brain damage induced by HIV-infection (Su et al., 2016), compared to healthy controls.
Machine learning techniques have been proven powerful in identifying such multivariate patterns in an approach to classify patients and controls. Deep learning is a machine learning methodology currently transforming fields such as speech recognition (Hinton et al., 2012, Chorowski et al., 2015, Bengio and Heigold, 2014), image analysis (Xu et al., 2014), natural language processing (Pennington et al., 2014), high energy physics (Baldi et al., 2014), among others. Data in these domains usually look like low dimensional measurements where neighboring elements are highly correlated. Convolutional neural networks exploit this correlation and scale to massive datasets.
The situation in classifying MRI data is starkly different however. Here we collect data with millions of features, but the number of patients often does not exceed a few hundred or a few thousand. Applying high capacity deep learning methods in healthcare applications is therefore highly challenging and prone to severe overfitting.
In this work, we will show that deep models can be applied successfully to the medical imaging domain by applying a fixed (i.e. not learned) feature transformation to the input data. In detail, we apply a 3-dimension (3D) translation invariant transformation referred to as a 3D scattering transform, which is also stable to actions of small diffeomorphisms, such as deformations, to MRI data. In order to compute the scattering coefficients, a convolution network is established by cascading wavelet transforms and modulus operators (Bruna and Mallat, 2011, Mallat, 2012). The resulting 3D scattering representation of each data instance (MRI scan in our experiments), i.e. a vector containing multi-scale and multi-direction co-occurrence information, is subsequently used for classification. The performed experiments aim to show that the 3D scattering representation has more discriminative power than the original data features and than features derived from independent component analysis (ICA), and that it performs well in the low data regime. Effectively, by avoiding learning these transformations altogether, we also avoid overfitting.
Our main three contributions are as follows: First, we provide the first implementation of the feature representation referred to here as the 3D scattering transform, which is inspired by the 2D scattering transform proposed by Bruna and Mallat, 2011, Mallat, 2012. Second, we provide state-of-the-art performance on the classification in three datasets. These include the comorbidity and aging with HIV (AGEhIV) study looking into long-term effects of Human Immunodeficiency Virus (HIV)-infection on the brain, and two studies into (progression to) Alzheimer's disease (AD), namely the Open Access Series of Imaging Studies (OASIS) study (Marcus et al., 2007) and Alzheimer's Disease for Neuroimaging (ADNI) study (http://adni.loni.usc.edu). In the AD-studies, we will study gray matter volume, segmented in T1-weighted scans, while in the HIV-study we will assess white matter microstructure, measured with diffusion weighted MRI. Within the ADNI study, we will in more detail classify sub-classes with baseline data on Mild Cognitive Impairment (MCI) that will either progress into AD or remain stable. Third, we present a visualization method to highlight regions in the original MRI input that provide input for or against a certain class. This visualization method facilitates the understanding of how a particular classification decision is taken through the complex non-linear function resulting from the scattering transformation. Our method can visualize evidence on both the group and the individual level, and will be demonstrated on the ADNI-dataset.
The rest of the paper is organized as follows. Section 3 describes the 3D scattering transform, the relevant theoretical properties of this transform and the proposed visualization method. Empirical results on three MRI datasets on brain damage due to HIV (AGEhIV) and Alzheimer's disease (OASIS and ADNI) are presented in Section 4, followed by a conclusion.
The introduced paradigm consists of two principal steps. The first is to transform all data instances from their MRI representation into the scattering representation by the 3D scattering transform. The second is to perform supervised (respectively semi-supervised in the last experiment) learning. A scheme of the phases of the paradigm is displayed in Fig. 1. In the beginning, an initial procedure is performed to regress out the effect of age on the MRI scans. After estimating the age effect in a regression on the healthy controls, it is then regressed out of the data belonging to all classes. The 3D scattering transform is subsequently applied to all the age regressed MRI data, resulting in the scattering representation of each MRI dataset. A detailed description of the 3D scattering transform is provided throughout the rest of this section. Afterwards, learning is applied to the data in the scattering representation. Cross-validation is performed in all the experiments where a portion of 60% of the data is reserved for training, 20% for validation and 20% for testing. Supervised learning is performed using a linear support vector machines (SVM) classifier. The learning algorithm used in the last experiment, which is a semi-supervised learning experiment performed on the ADNI dataset, is a Laplacian SVM. Finally, we will present our method of visualization of evidence for or against a class, on both the group and individual level.
We will first review the wavelet transform, before describing in detail the structure of the scattering transform. The 3D wavelet transform (Mallat, 1999) expands a three-dimensional signal on a basis of rotated and dilated wavelets. Wavelets are constructed from a Gabor mother wavelet ψ,
where Λ is an optional diagonal metric matrix that controls the aspect-ratio of the filter and x R3. The Gaussian window of the complex Gabor wavelets is modulated by a frequency, ξ 3, which controls the frequency of the filter relative to the width σ of the Gaussian window. We use .
A wavelet ψj,θ,β,γ at scale j and angles θ, β and γ is constructed from the mother wavelet by rotation and dilation:
where Rθ,β,γ is a rotation matrix and a is a scale parameter. Rθ,β,γ is a matrix used to perform rotation in a 3D space with angles, θ, β and γ in directions x, y and z, respectively, assuming x-y-z 3D coordinates (Arfken, 1985).
For notational convenience we define λ = (j,θ,β,γ) and write ψλ = ψj,θ,β,γ.
The wavelet transform of a signal f :3 → (e.g. an MRI scan) is then computed as
where * denotes convolution.
The 3D scattering transform is computed as a sequence of multi-directional and multi-scale wavelet transforms, interleaved with modulus nonlinearities. The wavelet we use is a Gabor wavelet. The resulting computation is similar to a convolutional neural network, but does not have any learnable parameters, making it suitable for use in the low-data regime. At the first layer (layer 0), a Gaussian filter ϕJ(x) at scale J is applied leading to a blurred version of x (see Fig. 2). The averages of these subsampled coefficients are then stored as part of the scattering representation. Then, a wavelet transform is applied to the MRI scan f: the convolutions are computed for all j1, θ1, β1 and γ1 in a fixed grid (see Fig. 2) before storing the averages of these coefficients as another part of the scattering representation. After computing the modulus, we again blur and subsample (see Fig. 2), this time at scale jq (jq is the scale of last wavelet applied, as we will indicate later, this is equivalent to j2 in our case and in the classification problems in general), and then store averages of the resulting coefficients as part of the scattering representation. A final wavelet transform is applied, yielding for all combinations of j1,j2,(θ1,β1,γ1),(θ2,β2,γ2) where j2 < j1. After a final blurring and subsampling, we obtain the coefficients for the last layer.
The scattering transform of f consists of the computed coefficients at each layer:
Scattering coefficients can distinguish complex image structures, e.g. corners and junctions, from edges as they yield co-occurrence information for any pair of scales j1 and j2, and directions (θ1,β1,γ1) and (θ2,β2,γ2), where the subscript number indicates the order. Scattering coefficients are calculated only for scales j2 < j1, because, as proven in Mallat (2012), the energy in is negligible at scales j2 ≥ j1.
Dependent on the number of scales and angular resolution, the scattering transform can greatly increase the dimensionality of the data. Hence, care must be taken in choosing the resolution settings for the scattering transform, as well as the regularization parameters. In this section we derive how the number of scattering coefficients depends on the settings, and describe the settings used in our experiments.
After averaging scattering coefficients, there is one scattering coefficient at order 0, which is the average of the original instance representation. The number of averaged scattering coefficients at orders 1 and 2 (most of the energy needed for classification lies in coefficients of orders 0, 1 and 2) depend on the number of wavelet scales J, and the resolution along each rotation axis θ,β,γ that we denote by L. At layer 1, there are d1 = JL3 wavelets and hence equally many feature maps and output coefficients. For each of the JL3 feature maps, the second layer wavelet transform is computed, but only for smaller scales. Hence, the number of coefficients at layer 2 is
For example, for L = 2, J = 4, number of wavelet coefficients at orders 1 and 2 can be calculated as follows:
Thus, the scattering transform representation of the original 3D MRI volume has dimensionality 417.
Similarly for L = 4, J = 4, the number of scattering coefficients at orders 1 and 2 can be calculated as follows:
Thus, input is transformed to approximately 2.5 104 dimensions in this case. Assuming a resolution of 2 mm, the input dimensionality of an MRI-scan is approximately 106. In this situation, the dimensionality would therefore be reduced by roughly two orders of magnitude.
In brain morphometry studies, the differences between healthy and diseased states, and between diseased sub-states, are assumed to be described by non-linear deformations of the 3D volumetric data, which are mapped to modulated tissue probability maps that are to be compared.
An important step in the preprocessing is to remove any unwanted intensity variation, for instance due to anatomical misalignment between subjects by non-rigid registration. Such preprocessing will however not be able to remove all anatomical variations, such that residual misalignments will still be present in the data. Hence, it is important that the representation is stable to the actions of small diffeomorphisms (which explain most of the intra-class variability), while making the classes linearly separable.
As shown in Bruna and Mallat, 2013, Mallat, 2012, the scattering transform is Lipschitz continuous to deformations. This means that for a signal f, a diffeomorphism τ acting on the signal as fτ(x) = f(x − τ(x)), there is a constant C such that:
Here S denotes the scattering transform and supx|τ(x)| measures the magnitude of the deformation via the norm of the deformation gradient, τ(x) (Bruna and Mallat, 2013). This result shows that the difference between the scattering transform of f and its transformed version fτ is bounded by a constant C times the norm of f times the magnitude of the deformation.
It follows from the Lipschitz continuity that the effect of a small diffeomorphism on the scattering representation is well-approximated by a linear map. As we verify in our experiments, this ensures that linear classifiers are able to handle variability due to deformations well.
The idea of the Lipschitz continuity in the context of the scattering transform was first noted and illustrated in pages 2, 3 and 4 -starting from Eq.(2)- in Mallat (2012). The idea of the Lipschitz continuity in general can as well be found in Chapter 9 in O’Searcoid (2006) .
We propose a visualization method via which regions in the MRI scans providing evidence for or against a certain class can be highlighted. As such, the proposed visualization method provides insight into the decision making process of a linear classifier acting on the scattering representations of input MRIs. The visualization method is based on the gradient of the learning function with respect to the input features of an input MRI scan. We thus aim to quantify the contribution of voxel values per individual subject to the classification boundary. These are furthermore averaged per class (patients or controls), resulting in distinct class mappings. Importantly, this approach differs from the commonly visualized classifier coefficients. To reduce noise effects in computing the gradients, we compute gradients by perturbing principal components (PCs) of the input data.
More formally, for an input data sample, x, and a discrete classification output (class), y, let the the learning function be G(x) = P(y|x), hk be the PCs of the input matrix (of the original voxels), k be indices over the selected PCs, and ϵk be small values used to perturb hk. Using Taylor's expansion, we can write:
since we can safely neglect terms of o(ϵ2). Also note that hk forms an orthonormal basis, for which the following holds:
And this can be the case with small values of ϵ. Eq. (12) can be approximated by using our PCs hk. Recall that these are computed in the original voxel space, and not over the scattering operators. Thus, by perturbing hk we approximately compute the gradient of G(x) by Eq. (12) and obtain the directions of maximum change in G(x). The directions of maximum change for each input image are the weights of the respective principal directions. The number of PCs can be selected as a small fraction of the total number of included subjects in the dataset. Using fewer PCs leads to more blurring.
The learning model, which in our case is an SVM classifier on top of the scattering representation, is expressed in this visualization methodology by G(x). Using probabilities is one common way of expressing G(x), but is not necessary. We use a function of the distance from the corresponding instance, x, to the boundary of the SVM, divided by the sum of the distances of all training instances belonging to the same class, to the boundary, to compute G(x) = P(y|x). This is valid since the farther an instance is from the boundary, the more certain it is that this is the right class for such instance. An alternative is to use Platt scaling, which is calibrated but it requires post-processing and a separate validation set.
In our experiments, we will compare three different feature representations:
These will be computed on three brain MRI datasets, that will be explained in more detail in the following sections:
The classifier in use is an SVM with a soft margin. SVM is one of the most commonly used classification algorithms and it was demonstrated in previous works that it is more accurate than alternatives, e.g. Othman et al., 2011, Yang et al., 2011. The dual form of soft margin SVMs (13), which is formulated into a quadratic programming problem, is solved using sequential minimal optimization (SMO) (Platt, 1999),
where αi for i = 1...,n denote Lagrange multipliers, and αi = 0 for all training instances except the support vectors. The symbol K denotes the SVM kernel.
Linear SVM (SVM with a linear kernel) is used in all the supervised learning experiments. Cross-validation is also applied to all the performed experiments. Training is performed on only 60% of the available data, whereas 20% of the data is reserved for validation of the scattering parameters, namely number of orientations, L, and number of scales, J (respectively validation of number of independent components in case of the ICA representation), among other parameters. The test set consists of the remaining 20% of the data.
The AGEhIV Cohort Study is an ongoing study on prevalence, incidence and risk factors of ageing-associated comorbidities and organ dysfunction among HIV patients and highly comparable HIV-uninfected controls of at least 45 years of age (i.e. same geographic region with similar socio-demographic and behavioral(risk) factors) (Schouten et al., 2014).
The AGEhIV dataset consists of MRI data of 100 HIV-infected patients and 60 healthy controls, included in the Academic Medical Center (AMC) in Amsterdam, The Netherlands. Of these subjects, diffusion weighted MRI data were acquired at 2 mm resolution. Preprocessing of the data was performed with software developed in-house (Matlab, MathWorks, Natick, MA), using the HPCN-UvA Neuroscience Gateway and using resources of the Dutch e-Science Grid (Shahand et al., 2014). As a result, Fractional Anisotropy (FA) maps were computed. FA is sensitive to microstructural damage and therefore expected to be, on average, decreased in patients. Subjects were scanned on two 3.0 Tesla scanner systems, 121 subjects on a Philips Intera system and 39 on a Philips Ingenia system. Patients and controls were evenly distributed over both scanners. More details on the study, data and preprocessing can be found elsewhere (Su et al., 2016).
We perform two experiments on the AGEhIV dataset. In the main experiment, Experiment A, the class labels refer to the disease type, i.e. being diagnosed with or without HIV. The used MRI scanner is added as an additional feature to the voxel data. In Experiment B, MRI scanner is used as the class label, and HIV-status is added as a feature to the data. In these experiments we chose not to linearly regress out age, but rather add age as an additional feature to the voxel data, similar to the above-mentioned covariates.
The number of ICA components was chosen by optimizing performance on the validation set and resulted in 9 components. For the scattering representation, parameters leading to the results displayed in Table 1 have also been determined on the validation set. The angular resolution is L = 4, while the number of wavelet scales used is J = 4, which leads to a total of 24,833 features, i.e. scattering coefficients.
Table 1 shows the SVM classification results of both experiments A and B. In experiment A, SVM applied to the scattering representation of the MRI scans achieves better accuracy than SVM over the raw voxel form, and than SVM over the ICA-based representation. Adding up age and scanner information does not improve the overall classification accuracy. An equivalent improvement for the scattering representation compared to the other two representations is obtained in experiment B.
As mentioned in Section 2.3, most of the energy required to capture discriminative information for classification lies in scattering coefficients of orders 0, 1 and 2. Therefore, all results obtained here are based on two scattering layers. To validate this choice, we performed an additional experiment using one, two and three layers of coefficients. The classification accuracies obtained were 58.0%, 75.1% and 75.0% respectively. Thus, involving scattering coefficients of order 3 leads to a massive increase in the number of used scattering coefficients without any performance gain. Both of these conclusions have as well been reached by Bruna and Mallat, 2011, Bruna, 2013.
Fig. 3 (a) displays a histogram showing the difference in mean error, i.e. error of SVM applied to the original features minus the error of SVM applied to the scattering features, for different data partitions. A positive difference thus refers to a larger error of SVM applied to the original features.
Receiver operating characteristic (ROC) curves for the SVM classifier applied to the three representations of the HIV dataset are shown in Fig. 4 (a). For the classification results arising from the ICA components vs. the scattering representation, the p-value for a paired Student's t-test is equal to p = 0.045, which means that the null hypothesis – being that the pairwise difference in the results between the two representations is not significant – is rejected at a standard significance level of p = 0.05. The p-value of the raw form (original voxels) vs. the scattering representation is equal to p = 4.48 × 10 −4.
Out of the 24,833 scattering features, 12,483 features have values smaller than 10 −14, across all instances (MRI scans). By discarding such features and selecting the rest to be the input to SVM, identical accuracy results are obtained with a sped up run-time. We used a linear kernel for the results reported in Table 1. The training set consists of 60 HIV and 36 healthy scans. The validation set consists of 20 HIV as well as 12 healthy scans, and the same goes for the test set.
The Open Access Series of Imaging Studies (OASIS) dataset consists of high-resolution T1-weighted MRI scans of 182 MRI scans, including 60 Alzheimer's patients and 122 healthy controls. These data were preprocessed using the Statistical Parametric Mapping software (SPM v12). This involved tissue segmentation into gray matter and white matter probability maps, followed by spatial normalization to a pre-defined standard template, in MNI152 space. The registration parameters for the normalization were calculated using the DARTEL non-linear approach in SPM. The non-linear warping was then applied to the gray matter and white matter segmentations separately to align them to the template, during which images were resampled to 1.5 mm isotropic voxels, modulated by the extent of warping necessary in order to retain volumetric information and smoothed using a 4 mm FWHM Gaussian kernel. The resulting volumes contain (120 × 144 × 120) voxels.
Scattering representations of the 3D scans are used as the input to an SVM classifier. MRI scans are then classified into one of two classes: Alzheimer or healthy.
As mentioned above, the raw form representation of each OASIS MRI scan contains 120 × 144 × 120 features. 13 ICA components are used in the ICA-based representation experiment, which was determined on a validation set. Again, an angular resolution, L = 4, and number of wavelet scales, J = 4, were used to extract the scattering coefficients.
Experimental results on the OASIS data are displayed in Table 2. Out of the three representations, SVM applied to the scattering representation of the MRI scans achieves the highest accuracy.
Fig. 3 (b) displays a histogram showing the difference in mean error between SVM applied to the original features and SVM applied to the scattering features, i.e. error of SVM applied to the original features - error of SVM applied to the scattering features. Again, a positive difference refers to a larger error of SVM applied to the original features.
Receiver operating characteristic (ROC) curves for the SVM classifier applied to the three representations of the dataset are shown in Fig. 4 (b). For the classification results of the ICA-based representation vs. the scattering representation, the p-value of a paired T-test is equal to 0.06. The p-value for the classification accuracy of SVM applied to the raw MRI form vs. SVM applied to the scattering representation is 0.05.
Similar to Section 3.1, a simple feature selection mechanism is applied to the scattering coefficients where coefficients with value <10 −14 are removed for the sake of optimizing computational run-time. The training set consists of 36 Alzheimer and 74 healthy scans. The validation set consists of 12 Alzheimer as well as 24 healthy scans, same size as the test set.
The Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu) was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD). For up-to-date information, see www.adni-info.org. We conduct experiments on two subsets of the ADNI dataset, one, which stands in line with the main genesis of the work, consists of a relatively small sample of each class. On the other hand, the other sample is a bigger sample by which we assess supervised and semi-supervised learning performances based on the scattering feature representation in cases where the training data are not too small.
Focussing on the main theme of this work, which is to assess the scattering transform in small datasets, we initially work with a subset of 150 MRI scans consisting of 50 AD scans, 50 MCI scans and 50 healthy scans. The subjects were randomly chosen from each class, and they cover different sex and acquisition differences (recall that age has already been regressed out). Similar to previous works on the same dataset, we perform binary classification experiments consisting of all possible two-class combinations. We also perform an experiment combining AD and MCI scans (we refer to this composite class as “disordered”) against healthy scans.
Each ADNI MRI scan used in the experiments is a normalized 160 × 192 × 160 voxel volume. In the ICA-based representation experiment, 18 ICA components are used, as determined on a validation set. Again, an angular resolution, L = 4, and number of wavelet scales, J = 4, are used to extract the scattering coefficients. The training set consists of 30 AD, 30 MCI and 30 healthy scans. The validation set consists of 10 AD, 10 MCI and 10 healthy scans, and the same goes for the test set.
Experimental results on the ADNI dataset are displayed in Table 3. Out of the three representations, SVM applied to the MRI scattering representation achieves significantly better performance than the raw form and the ICA-based representations.
The four ROC-curves correspondent to the four experiments in the upper part of Table 3 are shown in Fig. 4 (c–f). Again, all experiments are SVM classification experiments applied to the three representations of the ADNI dataset. As can be seen in the figures, classification results of SVM over the scattering representation are more accurate than the raw form and than the ICA-based representation.
The p-values of a paired T-test for the ADNI experiments are displayed in Table 4.
The larger sample of ADNI used in this experiment consists of 835 MRI scans. One of the advantages of using this ADNI sample in particular is that - contrary to the majority of experiments performed on the ADNI dataset in the literature - it allows us to compare on common ground with a previous work on the data, since the same sample was used in the experiments performed in Moradi et al. (2015). This sample consists of 200 Alzheimer's disease (AD) MRI scans, 231 healthy MRI scans and 404 MCI MRI scans. We perform two classification tasks and a semi-supervised learning task on this ADNI sample. The first is a binary classification task of the AD vs. healthy MRI scans. Two MCI-based tasks are performed on the MCI subset of this ADNI sample, one classification and one semi-supervised learning task. The goal of the MCI classification task is to discriminate between progressive MCI (pMCI) scans and stable MCI (sMCI) scans. In this task, the goal is to predict whether an MCI patient will convert to an AD over a 3-year period (referred to as pMCI) or not (referred to as sMCI) (Moradi et al., 2015). The MCI MRI scans we have in this sample include 164 pMCI and 100 sMCI scans besides other MCI scans not known as to which MCI state they belong (unlabeled MCI scans).
Regarding the classification of the AD and healthy MRI scans, the subsample used for this task is composed of 200 AD and 231 healthy MRI scans. The number of ICA components in use in the ICA-based representation experiment is 17, as determined on a validation set. As in the corresponding AGEhIV, OASIS and small ADNI sample experiments, an angular resolution, L = 4, and number of wavelet scales, J = 4, are used to extract the scattering coefficients. A proportion of 60% of the data is used for training, and 20% of the data is used for validation and for testing. Results of the AD vs. healthy classification experiment are shown in the lower part of Table 3. The classification accuracy achieved by the proposed scattering-based classifier is 84.9%, compared to state-of-the-art leading to corresponding accuracy of 82% by Moradi et al. (2015). The scattering-based classifier leads to state-of-the-art results on this sample of ADNI, better than the other two representations, and better than the result reported in Moradi et al. (2015). We perform a single additional experiment to reduce the number of scattering coefficients by PCA. The first 20 principal components (PCs) were selected and given as an input to the SVM classifier, instead of the scattering features. This results not only in speeding up the classifier, but also in an improvement in the classification accuracy, as can be seen in Table 3.
Results of the ADNI MCI experiments are displayed in Table 5. Data for the classification experiment consist of 164 pMCI and 100 sMCI MRI scans. By applying a linear SVM classifier to the scattering coefficients of these scans, a classification accuracy of 73.5% is achieved, based on cross-validation (upper half of Table 5), compared to state-of-the-art accuracy on the same data ranging from 66.0% to 69.2%, as reported in Table 3 in Moradi et al. (2015).
Data for the semi-supervised learning (SSL) experiment consisted of 164 pMCI, 100 sMCI and 140 unlabeled MCI scans. The SSL algorithm we use on top of the scattering representation is a variation of semi-supervised SVM (Melacci and Belkin, 2011) where Laplacian SVM is efficiently trained in the primal with preconditioned conjugate gradient based on the L2 hinge loss2 . The learning accuracy resulting from using Laplacian SVM on the scattering coefficient representation is superior to the best SSL classification accuracy reported in Table 3 in Moradi et al. (2015). It can also be noticed that adding unlabeled MRI data to data of both algorithms improves the learning accuracy over their fully supervised learning counterparts.
We present a visualization of the classification of scattering coefficients on the supervised learning experiment of pMCI vs. sMCI only. Based on Eq. (12), gradients of all MRI scans belonging to both the pMCI and sMCI classes are computed.
Fig. 5 shows an example on two single pMCI and two sMCI brains, with minimal and maximal distance to the classification boundary. Sub-cortical regions receive higher weight when classifying pMCI, while cortical gray matter provides more evidence for classifying sMCI. The cingulate cortex provides evidence for both classes.
Additionally, averaging those maps over subjects per class provides more generic information. Of interest, in the stable MCI group, larger cortical gray matter areas receive a high weight. In the progressive MCI group, we see effects in the hippocampus, amygdala, entorhinal cortex, precuneus and cingulate cortex.
In this paper, we propose a 3D scattering transform as a fixed representation with no learnable parameters, that can replace deep learning methods in paradigms where we do not have at disposal massive amounts of data points available for use. We present a visualization method of evidence for belonging to a certain class, both for individual cases and class averages. In our experiments we have extensively validated our method on different datasets, showing improved performance over currently established SVMs, also with ICA added as a dimensionality reduction step.
The scattering transform and deep learning are directly related (see Bruna and Mallat (Bruna and Mallat, 2011, Bruna and Mallat, 2013), Mallat (2012) for an in depth explanation). For instance, there is a pooling operation in scattering networks implemented as a subsampling step and there is also a nonlinearity implemented as the absolute value of the activities (instead of a rectified linear unit (ReLU)). The fact that features are extracted from every layer also has an equivalent in Convolutional Neural Networks (CNNs) called “skip-connections” (Graves, 2013, He et al., 2016, Huang et al., 2016). Finally, the SVM is not all that different from the final fully connected logistic regression layers in CNNs, since both are linear classifiers. Since in our case where we do not train the parameters of the CNN (it is given by a fixed scattering transform) the SVM is more convenient because software packages exist that very quickly and reliably optimize the SVM's parameters. Also, very good software packages for semi-supervised learning exist for the SVM. Next we describe CNNs before moving to the principal conclusions of this work.
Convolutional neural networks (CNNs) are a type of feed-forward artificial neural networks. CNNs convolve the input image (or voxel grid) by a set of learned filters, resulting in a so-called feature map for each filter. Each response value (neuron activation) in a feature map is the result of a small filter operating on a small region of the input referred to as the receptive field of the neuron. CNNs achieve insensitivity to small translations and deformations by pooling activations in small regions of the feature maps, for example by computing the maximum over a 2 ×2 region (max-pooling) or average (mean-pooling). Deep networks can be constructed by stacking convolutions, pooling operators and point-wise nonlinearities such as rectified linear activation functions. CNNs have been intensively applied to vision and image recognition problems (Farabet et al., 2010, Matusugu et al., 2013, Ciresan et al., 2013, Behnke, 2003, Yaniv et al., 2015, Masci et al., 2013), document recognition (LeCun et al., 1998) and medical signal processing (Graupe et al., 1988, Graupe et al., 1989), among numerous other applications. However, the learned CNN transformations have mostly led to very good classification results in cases where a lot of data is available. Since learning the accompanying parameters necessitates the availability of significant amounts of data, a high risk of overfitting occurs when using deep learning techniques in low-data regimes.
The proposed 3D scattering representation is a fixed representation with no learnable parameters, which is one of the reasons why it is much less prone to severe overfitting than CNNs in low-data regimes. The scattering transform is translation invariant and stable to small deformations. The scattering transform involves convolutions followed by an averaging step (can as well be seen as a mean-pooling step), and a supervised learner (classifier) is applied on top of it so that labels can be learned. As demonstrated by the experiments, the premise is that the scattering transform can be a better representation for the data in terms of explaining the important variations and discarding the rather unimportant variations, and this ultimately leads to better classification accuracy.
We have developed and implemented a feature extraction method based on three dimensional scattering transformations and tested it for its ability to discriminate HIV and Alzheimer's disease from healthy subjects and subjects with mild cognitive impairment. We have clearly shown that our proposed methodology achieves higher accuracy than the best competing methods, in particular an SVM applied to features extracted using ICA. We believe that one of the main reasons that scattering is successful in the neuroimaging domain is its stability against small deformations of the input image. Also its ability to detect differences in “tissue textures” may be important. We believe scattering representations are particularly useful in the high-dimensional but small-number-of-patients regimes that are typical in medical imaging.
In our experiments, we have used two open source databases. In Yang et al. (2011), a fast ICA representation was implemented and used for the classification of OASIS MRI signals. It achieved an average accuracy of 70.7%, which is outperformed by the 73% classification performance of our scattering representation. Regarding the ADNI-study, the average MCI vs. Healthy classification accuracy over the fast ICA representation implemented in Yang et al. (2011) is 72%, compared to a classification accuracy of 79.4% achieved by the scattering representation. For the AD vs. Healthy experiment, the classification accuracy reported in Yang et al. (2011) is 76.9%, whereas the corresponding scattering classification accuracy is 78.5%. The scattering representation outperforms Fast ICA in spite of the fact that we use only 50 MRI scans per class, in contrast with 202 AD, 410 MCI and 236 healthy MRI scans used in the experiments conducted in Yang et al. (2011). The work in Gupta et al. (2013), which is based on a high-data regime where a sparse auto-encoder was used to learn a set of bases from natural images, uses 755 ADNI MRI scans per class, compared to our 50 scans per class in our experiments, and they consequently achieve higher classification results. In the context of classifying progressive MCI, Moradi et al. (2015) showed that they classified beyond the state of the art. Here we show that the proposed scattering transform allows to further improve upon the classification. Still, our classification performance is not an outlier compared to earlier studies (Arbabshirani et al., 2016), reducing the likelihood of possible overtraining. In the ADNI-dataset age effects were regressed out on voxel-level in a univariate way, whereas for the AGEhIV dataset we followed an alternative approach by adding nuisance variables as independent features. Thus, we have carried out different but equally valid approaches to handle this issue throughout the different performed experiments.
Visualizing the classification result is of high clinical importance. In the current practice, a classification experiment is interpreted by studying the classifiers' weight vector. In this respect, previous work studied the interpretation of weight vectors in backward models (Haufe et al., 2013), and provided an analytical approximation to permutation testing for running computationally efficient experiments (Gaonkar and Davatzikos, 2013).
We propose a method that allows to study the probability of a single subjects' brain to be assigned to a specific class. When averaging over the progressive MCI group, as can be seen in Fig. 6, we observed that the hippocampus, amygdala and entorhinal cortex were involved. These regions have been reported earlier in the context of predicting progression in MCI (Ye et al., 2012) and discriminating MCI from healthy controls (Desikan et al., 2009). Atrophy in the enthorinal cortex has been shown to predict cognitive decline in Alzheimer's disease (Velayudhan et al., 2013). Hippocampal atrophy has been observed in earlier stages of Alzheimer research (Henneman et al., 2009). An additional strong effect that we observe in the precuneus and cingulate cortex has been shown to be associated to hypometabolism in MCI (Bailly et al., 2015).
Although we have not done so, we believe that regression problems could also benefit from applying the scattering transform to the data. This would provide a more natural embedding of regressing out covariates such as age and scanner. We have done this in a separate step and a univariate way, which is a limitation of our experiments.
In conclusion, we propose a scattering transform that proved to be highly effective in small datasets, under different experimental conditions and for multiple disease types. The classification can be visualized on both the individual and group level.
We gratefully acknowledge support from the NWO IPPSI-KIEM program no.: 628.005.012.
We warmly thank Dr. James Cole for providing us with preprocessed OASIS data, and Dr. Jussi Tohka for providing us with the preprocessed ADNI data of their paper Moradi et al. (2015).
The work described in this paper is partially performed using the AMC Neuroscience Gateway, using resources of the Dutch e-Science Grid with the support of SURF Foundation. The acquisition of the AGEhIV HIV-data was supported by the Nuts-OHRA Foundation (grant no. 1003-026), Amsterdam, The Netherlands, as well as by The Netherlands Organisation for Health Research and Development (ZonMW) together with AIDS Fonds (grant nos. 300020007 and 2009063, respectively). Additional unrestricted scientific grants were received from Gilead Sciences, ViiV Healthcare, Janssen Pharmaceutica N.V., Bristol-Myers Squibb, Boehringer Ingelheim, and Merck&Co. None of these funding bodies had a role in the design or conduct of the study, the analysis and interpretation of the results, or the decision to publish.
The OASIS dataset is supported by grant numbers P50 AG05681, P01 AG03991, R01 AG021910, P50 MH071616, U24 RR021382, R01 MH56584.
Data collection and sharing for this project was funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (National Institutes of Health grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer's Association; Alzheimer's Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer's Disease Cooperative Study at the University of California, San Diego. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.
1Most of the changes performed in this resubmission are colored blue. However, in cases, like figures and equations, where we thought that having two different colors might cause confusion, color remains black.
2Code for Laplacian SVM used in our experiments is available at http://sourceforge.net/projects/lapsvmp/