Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC5607873

Formats

Article sections

- Abstract
- 1 Introduction
- 2 Multi-Source Multi-Target Dictionary Learning
- 3 Experiments
- 4 Conclusion and Future Work
- References

Authors

Related links

Inf Process Med Imaging. Author manuscript; available in PMC 2017 September 21.

Published in final edited form as:

Published online 2017 May 23. doi: 10.1007/978-3-319-59050-9_15

PMCID: PMC5607873

NIHMSID: NIHMS858944

Jie Zhang,^{1,}^{*} Qingyang Li,^{1,}^{*} Richard J. Caselli,^{2} Paul M. Thompson,^{3} Jieping Ye,^{4} and Yalin Wang^{1}

Alzheimer’s Disease (AD) is the most common type of dementia. Identifying correct biomarkers may determine pre-symptomatic AD subjects and enable early intervention. Recently, Multi-task sparse feature learning has been successfully applied to many computer vision and biomedical informatics researches. It aims to improve the generalization performance by exploiting the shared features among different tasks. However, most of the existing algorithms are formulated as a supervised learning scheme. Its drawback is with either insufficient feature numbers or missing label information. To address these challenges, we formulate an unsupervised framework for multi-task sparse feature learning based on a novel dictionary learning algorithm. To solve the unsupervised learning problem, we propose a two-stage Multi-Source Multi-Target Dictionary Learning (MMDL) algorithm. In stage 1, we propose a multi-source dictionary learning method to utilize the common and individual sparse features in different time slots. In stage 2, supported by a rigorous theoretical analysis, we develop a multi-task learning method to solve the missing label problem. Empirical studies on an *N* = 3970 longitudinal brain image data set, which involves 2 sources and 5 targets, demonstrate the improved prediction accuracy and speed efficiency of MMDL in comparison with other state-of-the-art algorithms.

Alzheimer’s disease (AD) is known as the most common type of dementia. It is a slow progressive neurodegenerative disorder leading to a loss of memory and reduction of cognitive function. Many clinical/cognitive measures such as Mini Mental State Examination (MMSE) and Alzheimer’s Disease Assessment Scale cognitive subscale (ADAS-Cog) have been designed to evaluate a subject’s cognitive decline. Subjects are commonly divided into three different groups: AD, Mild Cognitive Impairment (MCI) and Cognitively Unimpaired (CU), defined clinically based on behavioral and above assessments. It is crucial to predict AD related cognitive decline so an early intervention or prevention becomes possible. Prior research have shown that measures from brain magnetic resonance (MR) images correlate closely with cognitive changes and have great potentials to provide early diagnostic markers to predict cognitive decline presymptomatically in a sufficiently rapid and rigorous manner.

The main challenge in AD diagnosis or prognosis with neuroimaging arises from the fact that the data dimensionality is intrinsically high while only a small number of samples are available. In this regard, machine learning has been playing a pivotal role to overcome this so-called “large *p*, small *n*” problem. A dictionary that allows us to represent original features as superposition of a small number of its elements so that we can reduce high dimensional image to a small number of features. Dictionary learning [8] has been proposed to use a small number of basis vectors to represent local features effectively and concisely and help image content analysis. However, most existing works on dictionary learning focused on the prediction of target at a single time point [19] or some region-of-interest [18]. In general, a joint analysis of tasks from multiple sources is expected to improve the performance but remains a challenging problem.

Multi-Task Learning (MTL) has been successfully explored for regression with different time slots. The idea of multi-task learning is to utilize the intrinsic relationships among multiple related tasks in order to improve the prediction performance. One way of modeling multi-task relationship is to assume all tasks are related and the task models are connected to each other [6], or the tasks are clustered into groups [21]. Alternatively, one can assume that tasks share a common subspace [4], or a common set of features [1]. Recently, Maurer *et al.* [12] proposed a sparse coding model for MTL problems based on the generative methods. In this paper, we proposed a novel unsupervised multi-source dictionary learning method to learn the different tasks simultaneously which utilizes shared and individual dictionaries to encode both consistent and individual imaging features for longitudinal image data analysis.

Although a general unsupervised dictionary learning may overcome the missing label problem to obtain the sparse features, we still need to consider the prediction labels at different time points after we learn the sparse features. A forthright method is to perform linear regression at each time point and determine weighted matrix *W* separately. However, even when we have the common dictionary which models the relationship among different tasks, if prediction is purely based on linear regression which treats all tasks independently and ignores the useful information reserved in the change along the time continuum, there still exists strong bias to predict future multiple targets clinical scores.

To excavate the correlations among the cognitive scores, several multi-task models were put forward. Wang *et al.* [14] proposed a sparse multi-task regression and feature selection method to jointly analyze the neuroimaging and clinical data in prediction of the memory performance. Zhang and Shen [17] exploited a *l*_{2,1}-norm based group sparse regression method to select features that could be used to jointly represent the different clinical status and two clinical scores (MMSE and ADAS-cog). Xiang *et al.* [16] proposed a sparse regression-based feature selection method for AD/MCI diagnosis to maximally utilize features from multiple sources by focusing on a missing modality problem. However, the clinical scores for many patients are missing at some time points, i.e., the target vector *y _{i}* may be incomplete and the above methods all failed to model this issue. A simple strategy is to remove all patients with missing target values. It, however, significantly reduces the number of samples. Zhou

In this paper, we propose a novel integrated unsupervised framework, termed Multi-Source Multi-Target Dictionary Learning (MMDL) algorithm, we utilize shared and individual dictionaries to encode both consistent and changing imaging features along longitudinal time points. Meanwhile, we also formulate different time point clinical score predictions as multi-task learning and overcome the missing target values in the training process. The pipeline of our method is illustrated in Fig. 1. We evaluate the proposed framework on the *N* = 3970 longitudinal images from Alzheimer’s Disease Neuroimaging Initiative (ADNI) database and use longitudinal hippocampal surface features to predict future cognitive scores. Our experimental results outperform some other state-of-the-art methods and demonstrate the effectiveness of the proposed algorithm.

The pipeline of our method. We extracted hippocampi from MRI scans (a), then we registered hippocampal surfaces (b) and computed surface multivariate morphometry statistics (c). Image patches were extracted from the surface maps to initialize the dictionary **...**

Our main contributions can be summarized into threefold. Firstly, we considered the variance of subjects from different time points (Multi-Source) and proposed an unsupervised dictionary learning method in stage 1 of the MMDL algorithm, in which not only does a patient share features between different time slots but different patients share some common features within the same time point. We also explore the relationship between the shared and individual dictionary in stage 1. Secondly, we use sparse features learned from dictionary learning as an input and multiple future clinical scores as corresponding labels (Multi-Target) to train the multi-task prediction model in stage 2 of the MMDL Algorithm. To the best of our knowledge, it is the first learning model which unifies both multiple source inputs and multiple target outputs with dictionary learning research for brain imaging analysis. Lastly, we also take into account the incomplete label problem. We deal with the missing label problem during the regression process and theoretically prove the correctness of the regression model. Our extensive experimental results on the ADNI dataset show the proposed MMDL achieves faster running speed and lower estimation errors, as well as reasonable prediction scores when comparing with other state-of-the-art algorithms.

Given subjects from *T* time points: {*X*_{1}, *X*_{2}, …, *X _{T}*}, our goal is to learn a set of sparse codes {

For the subject matrix *X _{t}* of a particular time point, MMDL learns a dictionary

$$\underset{{Z}_{1},\cdots ,{Z}_{T}}{\underset{{D}_{1},\cdots ,{D}_{T}\in {\mathit{\Psi}}_{t}}{min}}\phantom{\rule{0.16667em}{0ex}}\sum _{t=1}^{T}\frac{1}{2}{\Vert {X}_{t}-[{\widehat{D}}_{t},{\overline{D}}_{t}]{Z}_{t}\Vert}_{F}^{2}+\lambda \sum _{t=1}^{T}{\Vert {Z}_{t}\Vert}_{1},\text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to:}\phantom{\rule{0.16667em}{0ex}}{\widehat{D}}_{1}=\cdots ={\widehat{D}}_{T}$$

(1)

where *Ψ _{t}* = {

Fig. 2 illustrates the framework of MMDL with subjects of ADNI from three different time points which represents as *X*_{1}, *X*_{2} and *X*_{3}, respectively. Through the multi-source dictionary learning stage of MMDL, we obtain the dictionary and sparse codes for subjects from each time point *t: D _{t}* and

Illustration of the learning process of MMDL on ADNI datasets from multiple different time points to predict multiple future time points clinical scores.

The initialization of dictionaries in MMDL is critical to the whole learning process. We propose a random patch method to initialize the dictionaries from different time points. The main idea of the random patch method is to randomly select *l* image patches from *n* subjects {*x*_{1}, *x*_{2}, …, *x _{n}*} to construct

In the longitudinal AD study, we measure the cognitive scores of selected patients at multiple time points. Instead of considering the prediction of cognitive scores at a single time point as a regression task, we formulate the prediction of clinical scores at multiple future time points as a multi-task regression problem. We employ multi-task regression formulations in place of solving a set of independent regression problems since the intrinsic temporal smoothness information among different tasks can be incorporated into the model as prior knowledge. However, the clinical scores for many patients are missing at some time points, especially for 36 and 48 months ADNI data. It is necessary to formulate a multitask regression problem with missing target values to predict clinical scores.

In this paper, we use a matrix *Θ* ^{mt×nt} to indicate missing target values, where *Θ _{i}*

$$\underset{{W}_{1},\cdots ,{W}_{T}}{min}\sum _{t=1}^{T}{\Vert \mathit{\Theta}({Y}_{t}-{W}_{t}{Z}_{t})\Vert}_{F}^{2}+\xi \sum _{t=1}^{T}{\Vert {W}_{t}\Vert}_{F}^{2}$$

(2)

Input: Samples and corresponding labels from different time points: {X_{1}, X_{2}, .....X} and {_{T}Y_{1}, Y_{2}, .....Y}_{T} | |

Output: The model for different time points: {W_{1}, …, W}._{T} | |

1: | Stage 1: Multi-Source Dictionary Learning |

2: | for
k = 1 to κ
do |

3: | For each image patch x(_{t}i) from sample X, _{t}i {1, …, n} and _{t}t {1, …, T}. |

4: | Update ${\widehat{D}}_{t}^{k}:{\widehat{D}}_{t}^{k}=\mathit{\Phi}$. |

5: | Update ${z}_{t}^{k+1}(i)$ and index set ${I}_{t}^{k+1}(i)$ by a few steps of CCD: |

6: | $[{z}_{t}^{k+1}(i),{I}_{t}^{k+1}(i)]=\mathit{CCD}({D}_{t}^{k},{\overline{D}}_{t}^{k},{x}_{t}(i),{I}_{t}^{k}(i),{z}_{t}^{k}(i))$. |

7: | Update the and _{t}D̄ by one step SGD:_{t} |

8: | $[{\widehat{D}}_{t}^{k+1},{\overline{D}}_{t}^{k+1}]=\mathit{SGD}({\widehat{D}}_{t}^{k},{\overline{D}}_{t}^{k},{x}_{t}(i),{I}_{t}^{k+1}(i),{z}_{t}^{k+1}(i))$. |

9: | Normalize ${\widehat{D}}_{t}^{k+1}$ and ${\overline{D}}_{t}^{k+1}$ based on the index set ${I}_{t}^{k+1}(i)$. |

10: | Update the shared dictionary $\mathit{\Phi}:\mathit{\Phi}={\widehat{D}}_{t}^{k+1}$. |

11: | end for |

12: | Obtain the learnt dictionaries and sparse codes: {D_{1}, …, D}, {_{T}Z_{1}, …, Z}._{T} |

13: | Stage 2: Multi-Target Regression with incomplete label |

14: | for
t = 1 to T
do |

15: | Given the jth column Y(_{t}j) in Y, for the _{t}jth model w(_{t}j) in W_{t} |

16: | ${w}_{t}(j)={({\stackrel{\sim}{Z}}_{t}{\stackrel{\sim}{Z}}_{t}^{T}+\xi I)}^{-1}{\stackrel{\sim}{Z}}_{t}{\stackrel{\sim}{Y}}_{t}(j)$ |

17: | end for |

Although the Eqn. 2 is associated with missing values on the labels, we show that it has a close form and present the theoretical analysis of stage 2 as follows:

*For the data matrix pair* (*Z _{t}*,

$$\underset{{w}_{t}(j)}{min}{\Vert ({\stackrel{\sim}{Y}}_{t}(j)-{w}_{t}(j){\stackrel{\sim}{Z}}_{t})\Vert}_{2}^{2}+\xi {\Vert {w}_{t}(j)\Vert}_{2}^{2}$$

(3)

Eqn (3) is known the Ridge regression [7]. To optimize the problem, we calculate the gradient and set the gradient to be zero. Then we can get the optimal *w _{t}*(

$$\begin{array}{l}2{\stackrel{\sim}{Z}}_{t}({\stackrel{\sim}{Z}}_{t}^{T}{w}_{t}(j)-{\stackrel{\sim}{Y}}_{t}(j))+2\xi {w}_{t}(j)=0,{\stackrel{\sim}{Z}}_{t}{\stackrel{\sim}{Z}}_{t}^{T}{w}_{t}(j)-{\stackrel{\sim}{Z}}_{t}{\stackrel{\sim}{Y}}_{t}(j)+\xi {w}_{t}(j)=0,\\ ({\stackrel{\sim}{Z}}_{t}{\stackrel{\sim}{Z}}_{t}^{T}+\xi I){w}_{t}(j)={\stackrel{\sim}{Z}}_{t}{\stackrel{\sim}{Y}}_{t}(j),{w}_{t}(j)={({\stackrel{\sim}{Z}}_{t}{\stackrel{\sim}{Z}}_{t}^{T}+\xi I)}^{-1}{\stackrel{\sim}{Z}}_{t}{\stackrel{\sim}{Y}}_{t}(j)\end{array}$$

After solving *w _{t}*(

Our MMDL algorithm can be summarized into Algorithm 1. *k* denotes the epoch number where *k* {1, …, *κ*}. *Φ* represents the shared part of each dictionary *D _{t}* which is initialized by the random patch method. For each image patch

After we pick an image patch *x _{t}*(

$$\underset{{z}_{t}(i)}{min}F({z}_{t}(i))=\frac{1}{2}{\Vert {x}_{t}(i)-[{\widehat{D}}_{t},{\overline{D}}_{t}]{z}_{t}(i)\Vert}_{2}^{2}+\lambda {\Vert {z}_{t}(i)\Vert}_{1}.$$

(4)

It is known as the Lasso problem [13]. Coordinate descent [3] is known as one of the state-of-the-art methods for solving this problem. In this study, we perform the CCD to optimize Eqn (4). Empirically, the iteration may take thousands of steps to converge, which is time-consuming in the optimization process of dictionary learning. However, we observed that after a few steps, the support of the coordinates, i.e., the locations of the non-zero entries in *z _{t}*(

- Perform
*P*steps CCD to update the locations of the non-zero entries ${I}_{t}^{k+1}(i)$ and the model*z*(_{t}*i*)^{k}^{+1}. - Perform
*S*steps CCD to update the*z*(_{t}*i*)^{k}^{+1}in the index of ${I}_{t}^{k+1}(i)$.

In (a), for each step CCD, we will pick up *j*-th coordinate to update the model *z _{t}*(

$$g={[{\widehat{D}}_{t}^{k},{\overline{D}}_{t}^{k}]}_{j}^{T}(\mathit{\Omega}([{\widehat{D}}_{t}^{k},{\overline{D}}_{t}^{k}],{z}_{t}^{k}(i),{I}_{t}^{k}(i))-{x}_{t}(i)),$$

(5)

$${z}_{t}^{k+1}{(i)}_{j}={\mathit{\Gamma}}_{\lambda}({z}_{t}^{k}{(i)}_{j}-g),$$

(6)

where *Ω* is a sparse matrix multiplication function that has three input parameters. Take *Ω*(*A*, *b*, *I*) as an example, *A* denotes a matrix, *b* is a vector and *I* is an index set that records the locations of non-zero entries in *b*. The returning value of function *Ω* is defined as: *Ω*(*A*, *b*, *I*) = *Ab*. When multiplying *A* and *b*, we only manipulate the non-zero entries of *b* and corresponding columns of *A* based on the index set *I*, speeding up the calculation by utilizing the sparsity of *b*. *Γ* is the soft thresholding shrinkage function [5] and the definition of *Γ* is given by:

$${\mathit{\Gamma}}_{\phi}(x)=\mathit{sign}(x)(\mid x\mid -\phi ).$$

(7)

In the end of (a), we count the non-zero entries in
${z}_{t}^{k+1}(i)$ and store the nonzero index in
${I}_{t}^{k+1}(i)$. In (b), we perform *S* steps CCD by only considering the non-zero entries in
${z}_{t}^{k+1}(i)$. As a result, for each index *μ* in
${I}_{t}^{k+1}(i)$, we calculate the gradient *g* and update the
${z}_{t}^{k+1}{(i)}_{\mu}$ by:

$$g={[{\widehat{D}}_{t}^{k},{\overline{D}}_{t}^{k}]}_{\mu}^{T}(\mathit{\Omega}([{\widehat{D}}_{t}^{k},{\overline{D}}_{t}^{k}],{z}_{t}^{k+1}(i),{I}_{t}^{k+1}(i))-{x}_{t}(i)),$$

(8)

$${z}_{t}^{k+1}{(i)}_{\mu}={\mathit{\Gamma}}_{\lambda}(({z}_{t}^{k+1}{(i)}_{\mu}-g).$$

(9)

Since we only focus on the non-zero entries of the model and *P* is less than 10 iteration and *S* is a much larger number, we accelerate the learning process of sparse codes significantly.

We update the dictionaries by fixing the sparse codes and updating the current dictionaries. Then, the optimization problem becomes as follow:

$$\underset{{\widehat{D}}_{t},{\overline{D}}_{t}}{min}F({\widehat{D}}_{t},{\overline{D}}_{t})=\frac{1}{2}{\Vert {x}_{t}(i)-[{\widehat{D}}_{t},{\overline{D}}_{t}]{z}_{t}(i)\Vert}_{2}^{2}$$

(10)

After we update the sparse codes, we have already known the non-zero entries of
${z}_{t}^{k+1}(i)$. Another key insight of MMDL is that we just need to focus on updating the non-zero entries of the dictionaries but not all columns of the dictionaries, and it accelerates the optimization dramatically. For example, when we update the *i*-th column and *j*-th row’s entry of the dictionary *D*, the gradient of *D _{j}*

$${H}_{t}^{k+1}={H}_{t}^{k}+{z}_{t}^{k+1}(i){z}_{t}^{k+1}{(i)}^{T}.$$

(11)

We perform one step SGD to update the dictionaries:
${\widehat{D}}_{t}^{k+1}$ and
${\overline{D}}_{t}^{k+1}$. To speed up the computation, we use a vector to store the information *Dz* − *x*:

$$R=\mathit{\Omega}([{\widehat{D}}_{t}^{k},{\overline{D}}_{t}^{k}],{z}_{t}^{k+1}(i),{I}_{t}^{k+1}(i))-{x}_{t}(i).$$

(12)

For entry of dictionary in the *μ*-th column and *j*-th row, the procedure of learning dictionaries take the form of 1

$${[{\widehat{D}}_{t}^{k+1},{\overline{D}}_{t}^{k+1}]}_{j,\mu}={[{\widehat{D}}_{t}^{k},{\overline{D}}_{t}^{k}]}_{j,\mu}-\frac{1}{{H}_{t}^{k+1}(\mu ,\mu )}{z}_{t}^{k+1}{(i)}_{\mu}{R}_{j},$$

(13)

where *μ* is the non-zero entry stored in
${I}_{t}^{k+1}(i)$. For the *μ*-th column of dictionary, we set the learning rate as the inverse of the diagonal element of the Hessian matrix, which is
$1/{H}_{t}^{k+1}(\mu ,\mu )$ Due to *D _{t}*

We studied multiple time points structural MR Imaging from ADNI baseline (837) and 6-month (733) datasets. The responses are the MMSE and ADAS-cog coming from 5 different time points: M12, M18, M24, M36 and M48. Thus, we learned a total of 3970 images which combines 2 sources and 5 targets. The sample sizes corresponding to 5 targets are 728, 326, 641, 454 and 251. For the experiments, we used hippocampal surface multivariate statistics [15] as learning features, which is a 4 *×* 1 vector on each vertex of 15000 vertices on every hippocampal surface.

We built a prediction model for the above datasets using MMDL algorithm. To train the prediction models, 1102 patches of size 10 *×* 10 are extracted from surface mesh structures and each patch dimension is 400. The model was trained on an Intel(R) Core(TM) i7-6700 K CPU with 4.0GHz processors, 64 GB of globally addressable memory and a single Nvidia GeForce GTX TITAN X GPU. In the experimental setting of Stage 1 in MMDL, the sparsity *λ* = 0.1. Also, we selected 10 epochs with a batch size of 1 and 3 iterations of CCD (*P* is set to be 1 and *S* is 3). When the dictionaries and sparse codes were learned, Max-Pooling was used to generate features for annotation and get a 1 *×* 1000 vector feature for each images. In the Stage 2, 5-fold cross validation is used to select model parameters *ξ* in the training data (between 10^{−3} and 10^{3}).

In order to evaluate the model, we randomly split the data into training and testing sets using a 9:1 ratio and used 10-fold cross validation to avoid data bias. Lastly, we evaluated the overall regression performance using weighted correlation coefficient (wR) and root mean square error (rMSE) for task-specific regression performance measures. The two measures are defined as
$wR(Y,\widehat{Y})={\sum}_{i=1}^{t}\mathit{Corr}({Y}_{i},{\widehat{Y}}_{i}){n}_{i}/{\sum}_{i=1}^{t}{n}_{i},\phantom{\rule{0.16667em}{0ex}}\mathit{rMSE}(y,\widehat{y})=\sqrt{{\Vert y-\widehat{y}\Vert}_{2}^{2}/n}$. For wR, *Y _{i}* is the ground truth of target of task

We compared MMDL with multiple state-of-the-art methods, ODL-L: the single-task online dictionary learning [11] followed by Lasso, L21: the multi-task method called *L*_{2,1} norm regularization with least square loss [1]. TGL: the disease multi-task progression model called Temporal group Lasso [21], as well as Ridge and Lasso. For the parameters selection, we used the same method with the experimental setting in our stage 2.

In Stage 1 of MMDL, the common dictionary is assumed to be shared by different tasks. It is necessary to evaluate what is an appropriate size of such common dictionary. Therefore, we set the dictionary size to be 1000 and partitioned the dictionary by different proportions: 125:875, 250:750,500:500, 750:250 and 875:125, where the left number is the size of common dictionary while the right one is the size of individual dictionary for each task. Fig. 3 shows the results of rMSE of MMSE and ADAS-cog prediction. As it shows in Fig. 3, the rMSE of MMSE and ADAS-Cog are lowest when we split the dictionary by half and a half. It means the both of common and individual dictionaries are of equal importance during the multi-task learning. In all experiments, we use the split of 500:500 as the size of common and individual dictionaries, the dimension of each sparse code in MMDL is 1000.

We compare the efficiency of our proposed MMDL with the state-of-the-art online dictionary learning (ODL). In this experiment, we focus on the single batch size setting, that is, we process one image patch in each iteration. We vary the dictionary size as: 500, 1000 and 2000. For MMDL, the ratio between the common dictionary and the individual parts is 1:1. We report the results in Table 1. We observe that the proposed MMDL use less time than ODL. When the size of dictionary are increasing, MMDL is more efficient and has a higher speedup compared to ODL.

We report the results of MMDL and other methods on the prediction model of MMSE with ADNI group in Table 2. The proposed approach MMDL outperformed ODL-L, Lasso and Ridge, in terms of both rMSE and correlation coefficient wR on four different time points. The results of Lasso and Ridge are very close while sparse coding methods are superior to them. For sparse coding models, we observe that MMDL obtained a lower rMSE and higher correlation result than traditional sparse coding method ODL-L since we consider the correlation between different time slots for different tasks and the relationship with different time points on the same patient among all tasks. We also notice that the proposed MMDL’s significant accuracy improvement for later time points. This may be due to the data sparseness in later time points, as the proposed sparsity-inducing models are expected to achieve better prediction performance in this case.

We follow the same experimental procedure in the MMSE study and explore the prediction model by ADAS-cog scores. The prediction performance results are shown in Table 3. We can observe that the best performance of predicting scores of ADAS-Cog is achieved by MMDL for four time points.

Comparing with L21, after MMDL dealing with missing label, the results more linear, reasonable and accurate. Due to the dimension of M36 and M48 is too small, it is hard to learn a complete model. TGL also considered the issue of missing labels, however, MMDL still achieved the better results because MMDL incorporates multiple sources data and uses common and individual dictionaries. Although the result of MMDL had bias, MMDL still achieved the best result compared with the other five methods on predicting both MMSE and ADAS-cog, which shows our method is more efficient about dealing with missing data.

We show the scatter plots for the predicted values versus the actual values for MMSE and ADAS-Cog on the M12 and M48 in Fig. 4. In the scatter plots, we see the predicted values and actual clinical scores have a high correlation. The scatter plots show that the prediction performance for ADAS-Cog is better than that of MMSE.

In this paper, we propose a novel Multi-Source Multi-Target Dictionary Learning for modeling cognitive decline, which allows simultaneous selections of a common set of biomarkers for multiple time points and specific sets of biomarkers for different time points using dictionary learning. We consider predicting future clinical scores as multi-task and deal with the missing labels problem. The effectiveness of the proposed progression model is supported by extensive experimental studies. The experimental results demonstrate that the proposed progression model is more effective than other state-of-the-art methods. In future, we will extend our algorithm to multi-modality data and propose more completely multiple sources with multiple targets algorithms.

The research was supported in part by NIH (R21AG049216, RF1AG051710, U54EB020403) and NSF (DMS-1413417, IIS-1421165).

1. Argyriou A, Evgeniou T, Pontil M. Convex multi-task feature learning. Machine Learning. 2008;73(3):243–272.

2. Boureau YL, Ponce J, LeCun Y. A theoretical analysis of feature pooling in visual recognition. Proceedings of the 27th Annual ICML. 2010:111–118.

3. Canutescu AA, Dunbrack RL. Cyclic coordinate descent: A robotics algorithm for protein loop closure. Protein science. 2003;12(5):963–972. [PubMed]

4. Chen J, et al. A convex formulation for learning shared structures from multiple tasks. Proceedings of the 26th Annual ICML; ACM; 2009. pp. 137–144.

5. Combettes PL, Wajs VR. Signal recovery by proximal forward-backward splitting. Multiscale Modeling & Simulation. 2005;4(4):1168–1200.

6. Evgeniou T, Micchelli CA, Pontil M. Learning multiple tasks with kernel methods. Journal of Machine Learning Research. 2005:615–637.

7. Hoerl AE, Kennard RW. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 1970;12(1):55–67.

8. Lee H, Battle A, Raina R, Ng AY. Advances in neural information processing systems. 2006. Efficient sparse coding algorithms; pp. 801–808.

9. Lin B, et al. Stochastic coordinate coding and its application for drosophila gene expression pattern annotation. 2014 arXiv preprint arXiv:1407.8147.

10. Lv J, et al. MICCAI. Springer; 2015. Modeling task fmri data via supervised stochastic coordinate coding; pp. 239–246.

11. Mairal J, Bach F, Ponce J, Sapiro G. Online dictionary learning for sparse coding. Proceedings of the 26th Annual ICML; ACM; 2009. pp. 689–696.

12. Maurer A, Pontil M, Romera-Paredes B. Sparse coding for multitask and transfer learning. Proceedings of the 26th Annual ICML 2013; Atlanta, GA, USA. 16–21 June 2013; 2013. pp. 343–351.

13. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 1996:267–288.

14. Wang H, et al. ICCV. IEEE; 2011. Sparse multi-task regression and feature selection to identify brain imaging predictors for memory performance; pp. 557–562. [PMC free article] [PubMed]

15. Wang Y, et al. Surface-based TBM boosts power to detect disease effects on the brain: an N=804 ADNI study. Neuroimage. 2011 Jun;56(4):1993–2010. [PMC free article] [PubMed]

16. Xiang S, et al. Bi-level multi-source learning for heterogeneous block-wise missing data. NeuroImage. 2014;102:192–206. [PMC free article] [PubMed]

17. Zhang D, Shen D, Initiative ADN, et al. Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in alzheimer’s disease. Neuroimage. 2012;59(2):895–907. [PMC free article] [PubMed]

18. Zhang J, et al. MICCAI. Springer; 2016. Hyperbolic space sparse coding with its application on prediction of Alzheimer’s disease in mild cognitive impairment; pp. 326–334. [PMC free article] [PubMed]

19. Zhang J, et al. Applying sparse coding to surface multivariate tensor-based morphometry to predict future cognitive decline. Biomedical Imaging (ISBI), 2016 IEEE 13th International Symposium on; IEEE; 2016. pp. 646–650. [PMC free article] [PubMed]

20. Zhang T. Solving large scale linear prediction problems using stochastic gradient descent algorithms. Proceedings of the 25th Annual ICML; ACM; 2004. p. 116.

21. Zhou J, Liu J, Narayan VA, Ye J. Modeling disease progression via fused sparse group lasso. Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining; ACM; 2012. pp. 1095–1103. [PMC free article] [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |