Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Med Image Comput Comput Assist Interv. Author manuscript; available in PMC 2011 January 1.
Published in final edited form as:
Med Image Comput Comput Assist Interv. 2010; 13(Pt 3): 611–618.
PMCID: PMC2951627

Sparse Bayesian Learning for Identifying Imaging Biomarkers in AD Prediction


We apply sparse Bayesian learning methods, automatic relevance determination (ARD) and predictive ARD (PARD), to Alzheimer’s disease (AD) classification to make accurate prediction and identify critical imaging markers relevant to AD at the same time. ARD is one of the most successful Bayesian feature selection methods. PARD is a powerful Bayesian feature selection method, and provides sparse models that is easy to interpret. PARD selects the model with the best estimate of the predictive performance instead of choosing the one with the largest marginal model likelihood. Comparative study with support vector machine (SVM) shows that ARD/PARD in general outperform SVM in terms of prediction accuracy. Additional comparison with surface-based general linear model (GLM) analysis shows that regions with strongest signals are identified by both GLM and ARD/PARD. While GLM P-map returns significant regions all over the cortex, ARD/PARD provide a small number of relevant and meaningful imaging markers with predictive power, including both cortical and subcortical measures.

1 Introduction

Neuroimaging is a powerful tool for characterizing neurodegenerative process in the progression of Alzheimer’s disease (AD) and can provide potential surrogate biomarkers for therapeutic trials. This paper is focused on identifying relevant imaging biomarkers from structural magnetic resonance imaging (MRI) data for AD classification. Machine learning methods have been applied to many problems in computational neuroscience, including computer-aided diagnosis for AD [1, 3, 4, 6, 7, 9]. While popular methods like support vector machines (SVMs) [15] can achieve decent prediction accuracy, most of them are not optimized for selecting sensitive features.

This paper presented the results of applying novel sparse Bayesian learning methods, automatic relevance determination (ARD) and predictive ARD (PARD) [13], to MRI-based AD classification for achieving two goals at the same time: (1) accurate prediction rate and (2) selection of relevant imaging biomarkers. Linear SVM and general linear model (GLM) based cortical thickness analyses were also performed on the same data for comparison to ARD/PARD. Our overarching goal is to learn from these data sparse Bayesian models so that they are easy to interpret while maintaining high predictive power.

2 Materials and Methods

MRI Data used in this study were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database ( ADNI is a landmark investigation sponsored by the NIH and industrial partners designed to collect longitudinal neuroimaging, biological and clinical information from 800 participants that will track the neural correlates of memory loss from an early stage. Further information can be found in [11] and at Following a previous imaging genetics study [14], 378 non-Hispanic Caucasian participants (203 healthy control (HC) and 175 AD participants) were selected for this work. For one baseline scan of each participant, FreeSurfer V4 was employed to automatically label cortical and subcortical tissue classes [2, 5] and to extract target region volume and cortical thickness, as well as to extract total intracranial volume (ICV), as previously described [14]. For each hemisphere, thickness measures of 34 cortical regions of interest (ROIs) (Fig. 1(a–f)) and volume measures of 15 cortical and subcortical ROIs (Fig. 1(c–f)) were included in this study. All these measures were adjusted for the baseline age, gender, education, handedness, and baseline ICV using the regression weights derived from the HC participants. Participant characteristics are summarized in Table 1.

Fig. 1
(a–b) Heat maps of PARD weights −w in cortical analyses using 34 thickness measures. (c–f) Heat maps of PARD (c,d) and ARD (e,f) weights −w in analyses using 15 volume (top) and 34 thickness (bottom) measures. 10-fold cross-validation ...
Table 1
Participant characteristics

ARD and Predictive ARD

We apply ARD and predictive ARD (PARD) [13] to classify the imaging features. ARD is one of the most successful Bayesian feature selection methods [8, 12]. It is a hierarchical Bayesian approach where there are hyperparameters which explicitly represent the relevance of different input features. These relevance hyperparameters determine the range of variation for the parameters relating to a particular input, usually by modeling the width of a zero-mean Gaussian prior on those parameters. If the width of that Gaussian is zero, then those parameters are constrained to be zero, and the corresponding input cannot have any effect on the predictions, therefore making it irrelevant. ARD optimizes these hyperparameters to discover which inputs are relevant.

Predictive ARD impoves upon ARD in the following aspects. First, the Laplace approximation used in ARD [8] is replaced by the more accurate expectation propagation (EP) [10]. Second, EP computes an estimate of leave-one-out predictive performance without requiring expensive cross-validation experiments. This estimate of predictive performance can be used as an important criterion for ARD to avoid the overfitting problem associated with evidence maximization. Last, predictive ARD uses a fast sequential optimization method such that we can efficiently prune and add new features without updating a full covariance matrix for the classifier.

Now we describe ARD for linear classification. A linear classifier classifies a point x according to t = sign(wTx) for some parameter vector w (the two classes are t = ±1). Given a training set D = {(x1, t1),, (xN, tN)}, the likelihood for w can be written as

equation M1

where equation M2, Ψ (·) is the cumulative distribution function for a Gaussian. One can also use the step function or logistic function as Ψ (·). The basis function [var phi]T (xi) allows the classification boundary to be nonlinear in the original features. This is the same likelihood used in logistic regression and in Gaussian process classifiers. Given a new input xN+1, we approximate the predictive distribution:

equation M3

equation M4

where left angle bracketwright angle bracket denotes the posterior mean of the weights, called the Bayes Point.

The basic idea in ARD is to give the feature weights independent Gaussian priors:

equation M5

where α = {αi} is a hyperparameter vector that controls how far away from zero each weight is allowed to go. The hyperparameters α are trained from the data by maximizing the Bayesian ‘evidence’ p(t|α), which can be done using a fixed point algorithm or an expectation maximization (EM) algorithm treating w as a hidden variable [8]. The outcome of this optimization is that many elements of α go to infinity such that the classifier w would have only a few nonzero weights wj. This naturally prunes irrelevant features in the data.

Unlike previous approaches that use the EM algorithm and find a solution that maximizes the evidence, the predictive-ARD (PARD) algorithm trains the sparse classifier as follows: (1) Initialize the model so that it only contains a small fraction of features. (2) Sequentially update the classifiers via a fast sequential optimization method and calculate the required statistics by EP until the algorithm converges. The sparsity level of the classifiers increases along the optimization iterations. (3) From all the classifiers, choose the classifier with minimum predictive error probability estimate.


A linear support vector machine (SVM) was applied in our study to provide a comparison to ARD/PARD in terms of prediction accuracy. SVMs represent a new generation of learning systems based on recent advances in statistical learning theory [15]. The aim in training a linear SVM is to find the separating hyperplane with the largest margin; the expectation is that the larger the margin, the better the generalization of the classifier. We employed the OSU SVM Matlab Toolbox ( in this work.

We also performed surface based analysis for identifying thickness changes on the brain cortex and comparing these regions with the imaging markers detected by ARD/PARD. We consider the following general linear model (GLM): y = + + ε, where the dependent variable y is cortical thickness; X = (x1, ···, xp) are the variables of interest (diagnosis in our case); Z = (z1, ···, zk) are the variables whose effects we want to exclude (age, gender, education, handedness and ICV in our case); Ψ = (ψ1, ···, ψp)T and Φ = ([var phi]1, ···, [var phi]k)T are the coefficients; and ε is the error term. The goal is to test if X is significant (i.e., Ψ = 0) for some y [set membership] [partial differential]Ω, where [partial differential]Omega; is the cortical surface manifold. To test GLMs, we used SurfStat [16], a Matlab toolbox for the statistical analysis of univariate and multivariate surface and volumetric data using linear mixed effects models and random field theory (RFT) [17].

3 Results

Classification was performed on each hemisphere separately, using two sets of imaging features: (1) 34 thickness measures (Fig. 1(a–b)), and (2) 34 thickness measures and 15 volume measures (Fig. 1(c–f)). 10-fold cross-validation was performed for accuracy estimation. Shown in Table 2 is the performance comparison among ARD, PARD and SVM. ARD and PARD outperformed SVM except for the case of using both thickness and volume measures from right hemisphere. PARD outperformed ARD except for the case of using both thickness and volume measures from left hemisphere. PARD was designed for improving ARD predictive performance based on theoretical considerations, which empirically worked better for most cases but not all. Using thickness measures only, the best prediction rate was obtained at 85.3% by PARD for left hemisphere. Using both thickness and volume measures, the best prediction rate was improved to 87.6% by applying ARD to the left hemisphere data. In all cases, the prediction rates were improved after including 15 additional volume measures in the analyses, indicating both cortical and subcortical changes were related to AD.

Table 2
Performance comparison. Training and testing error rates (mean±std) of 10-fold cross validation are shown for SVM, PARD (Predictive ARD) and ARD.

A linear classifier is usually characterized by a weight vector w, which projects each individual data point (i.e., a feature vector) into a 1-D space for getting a discriminative value. Each weight measures the amount of the contribution of the corresponding feature to the final discriminative value. ARD and PARD aim to reduce the number of nonzero weights so that only relevant features are selected by examining these weights. For consistency, we always visualize negative weights −w so that larger values (red) correspond to more grey matter in HC. Fig. 1(a–b) shows the heat maps of PARD weights −w in cortical thickness analysis for one run of 10-fold cross validation for both hemispheres. The weight vectors (i.e., columns in the map) derived by different trials in cross validation are very similar. Most weights are close to zero, indicating a small number of relevant imaging markers. While entorhinal cortex (EntCtx) appears to be a strong predictor in both sides, rostral middle frontal gyri (RostMidFrontal) are strong only on the left and inferior temporal gyri (InfParietal) on the right.

These weights can be back-projected to the original image space for an intuitive visualization. Fig. 2(c–d) shows such a visualization for PARD and ARD results using thickness data. Since we only examine the mean thickness of each cortical subregion in our analysis, the entire region is painted with the same color defined by the corresponding weight. The patterns of imaging marker selection between PARD and ARD are very similar to each other. For comparison, surface-based GLM analysis using SurfStat is also performed to examine diagnosis effect (HC-AD) on cortical thickness and Fig. 2(a–b) shows the resulting T-map and P-map. Regions with strongest signals, such as entorinal cortex on both sides and left middle temporal gyri are picked up by GLM and ARD/PARD. While GLM P-map returns significant regions across the entire cortex, PARD/ARD maps provide a small number of selective regions with predictive power.

Fig. 2
(a–b) GLM results of diagnosis effect (HC-AD) on cortical thickness include (a) the map of the t statistics and (b) the map of corrected P values for peaks and clusters (only regions with corrected p ≤ 0.01 are shown), where positive t ...

Heat maps of ARD/PARD weights −w in combined thickness and volume analyses are shown in Fig. 1(c–f). Again, the patterns are very similar between ARD and PARD. Shown in Table 3 are top imaging markers selected by ARD using thickness and volume measures (PARD data not shown but extremely similar to ARD) and by PARD using thickness measures (ARD data not shown but extremely similar to PARD). While most top markers are thickness measures from cortical regions, two markers are volume measures from subcortical structures including hippocampus and amygdala.

Table 3
Top imaging markers: “mean weight, rank” shown in each cell.

4 Discussion

We presented a novel application of sparse Bayesian learning methods, ARD and PARD, to AD classification. Our strategy was to minimize the complexity of both data and methods for deriving a simple model easy to interpret. For methods, we focused on linear classifiers and showed that ARD/PARD in general outperformed SVM. For data, we focused on summary statistics (i.e., thickness and volume) of anatomically meaningful grey matter regions across the whole brain, and showed that promising prediction accuracy (87.6%) could be achieved with a small number of relevant imaging measures. Most prior studies (e.g., [1, 3, 4]) performed feature selection/extraction before classification. Our method integrated feature selection into the learning process to form a simple and principled procedure. Prior research [6] also integrated feature selection into classification and reported lower prediction rates (77–82%) for analyzing a subset of the same ADNI MRI data. Comparison to other feature selection schemes merits further investigation. While some prior studies [3, 4, 7, 9] reported better prediction rates, they analyzed many more imaging variables in much smaller data sets. One interesting future topic is to apply our method to more detailed imaging features to determine if better prediction rates and refined imaging marker maps can be achieved. It is unclear if disease duration of AD is comparable between ADNI cohort examined by us and [1, 4, 6] and others cohorts by [3, 7, 9], and this could have an effect on prediction rates. Incorporating disease duration in predictive models warrants further investigation. To sum up, contributions of this work include: (1) a simple and unified learning method that inherently does feature selection and enables biomarker discovery while maintaining high predictive power; (2) a much larger AD sample tested with much fewer variables, resulting in a better power; and (3) promising rates for predicting mild AD with identified biomarkers that are known to be related to AD.


Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (U01 AG024904). This project was also supported in part by Indiana CTSI IUSM/CTR(RR025761), 1RC 2AG036535, NIA R01 AG19771, Foundation for the NIH, IEDC #87884, NIBIB R03 EB008674, U01 AG032984, and P30 AG10133.


1. Batmanghelich N, Taskar B, Davatzikos C. A general and unifying framework for feature construction, in image-based pattern classification. Inf Process Med Imaging. 2009;21:423–34. [PubMed]
2. Dale A, Fischl B, Sereno M. Cortical surface-based analysis. I. Segmentation and surface reconstruction. Neuroimage. 1999;9(2):179–94. [PubMed]
3. Duchesne S, Caroli A, Geroldi C, Barillot C, Frisoni GB, Collins DL. MRI-based automated computer classification of probable ad versus normal controls. IEEE Trans Med Imaging. 2008;27(4):509–20. [PubMed]
4. Fan Y, Batmanghelich N, Clark CM, Davatzikos C. Spatial patterns of brain atrophy in MCI patients, identified via high-dimensional pattern classification, predict subsequent cognitive decline. Neuroimage. 2008;39(4):1731–43. [PMC free article] [PubMed]
5. Fischl B, Sereno M, Dale A. Cortical surface-based analysis. II: Inflation, flattening, and a surface-based coordinate system. Neuroimage. 1999;9(2):195–207. [PubMed]
6. Hinrichs C, Singh V, et al. Spatially augmented LP boosting for AD classification with evaluations on the ADNI dataset. Neuroimage. 2009;48(1):138–49. [PMC free article] [PubMed]
7. Kloppel S, Stonnington CM, et al. Automatic classification of MR scans in Alzheimer’s disease. Brain. 2008;131(Pt 3):681–9. [PMC free article] [PubMed]
8. MacKay DJ. Bayesian interpolation. Neural Computation. 1992;4(3):415–447.
9. Magnin B, Mesrob L, Kinkingnehun S, et al. Support vector machine-based classification of Alzheimer’s disease from whole-brain anatomical MRI. Neuroradiology. 2009;51(2):73–83. [PubMed]
10. Minka TP. Expectation propagation for approximate Bayesian inference. 17th Conf. in Uncertainty in Artificial Intelligence; 2001. pp. 362–369.
11. Mueller SG, Weiner MW, et al. The Alzheimer’s disease neuroimaging initiative. Neuroimaging Clin N Am. 2005;15(4):869–77. xi–xii. [PMC free article] [PubMed]
12. Neal RM. Lecture Notes in Statistics. 118. Springer; New York: 1996. Bayesian Learning for Neural Networks.
13. Qi Y, Minka T, et al. Predictive automatic relevance determination by expectation propagation. 21st Int. Conf. on Machine learning; 2004. pp. 671–678.
14. Shen L, Kim S, et al. Whole genome association study of brain-wide imaging phenotypes for identifying quantitative trait loci in MCI and AD: A study of the ADNI cohort. Neuroimage. 2010 doi: 10.1016/j.neuroimage.2010.01.042. [PMC free article] [PubMed] [Cross Ref]
15. Vapnik V. Statistical Learning Theory. John Wiley and Sons; 1998.
16. Worsley KJ. SurfStat.
17. Worsley KJ, Andermann M, Koulis M, et al. Detecting changes in non-isotropic images. Human Brain Mapping. 1999;8:98–101. [PubMed]