We were able to assess different clinical scores with respect to the same structural data using RVRs. Our results imply strong linear relationships between DRS, MMSE and ADAS-Cog scores and GM segments of T1 whole brain weighted images, but not with the AVLT. The normalized RMS results verify that the DRS, closely followed by MMSE in set 1, and the ADAS-Cog, closely followed by MMSE in sets 2 and 3 provided the best predictions. Whole brain images gave a better correlation with MMSE, DRS, and ADAS-Cog because they test multiple domains, unlike the AVLT. The AVLT largely tests the single domain of memory, which is associated with medial temporal lobe structures. In this case, brain regions outside this territory may have contributed relatively more noise than discriminant signal. Thus, for prediction of single domain test scores from structural images, using a well placed VOI may prove useful. Set 1 did not include an MCI group; the prediction accuracies may have been inflated by the large group of CN subjects with many scores at ceiling. The correlations were also likely stronger in Set 1 due to inclusion of more severe AD subjects, reflected by lower MMSE scores. The difference in disease severity between Sets 1 and 2 is also evident when comparing their MMSE weighted images (). Removing the large group of CN subjects from Set 1 lowered prediction accuracies to more closely reflect those of Set 2. Conversely, when Set 1 (no MCI group and a more severe AD group) was used for training and Set 2 (large MCI group) was used for testing, the predication accuracy worsened, probably because comparable subjects were missing from the training set.
Demonstration of stability between different datasets is important for the future clinical use of machine learning methods. Training with one dataset and testing with another demonstrated stability between them when the training and testing groups were comparable, e.g., set 1 and set 2 with no MCI group, or when the training set included a wider group than the testing set, e.g. set 2 for training and set 1 for testing. Therefore, the prediction accuracy correlation is likely more trustworthy with a distributed range of scores and scans and a large number of training samples, such as in set 2 with the large group of MCI subjects in addition to AD and CN subjects. The inclusion of more severe AD subjects in the training set would likely improve the performance further due to an even wider range of both scores and structural changes.
One proposed use of prediction accuracy by RVR is to test how well a particular score correlates with structure for any disease. Future studies should evaluate RVRs of whole brain images with other instruments, such as the Short-Test of Mental Status (Kokmen et al., 1987
) and the Montreal Cognitive Assessment (Nasreddine et al., 2005
). Using prediction accuracy to determine which of the commonly used clinical global assessment screens are most accurately predicted from brain images of patients with MCI and early AD should prove a useful validation method of the instrument and might establish an optimal short battery of screening tests for tracking disease progression. From the individual patient perspective, this method may prove useful when clinical score data are not available. For example, predicting performance on global cognitive screening tests from an MRI scan may help to distinguish delirium from dementia in patients presenting to an emergency department with confusion and no prior records reflecting previous mental status.
There are several cautions and limitations when interpreting of our results. The results of the ratio of relevance vectors suggest that the training is not very sparse; the low sparsity suggests that information from many images contributes towards predictions, which may indicate that more scans would provide additional information. Even though statistically significant, the more modest correlations with a limited set consisting of only AD or MCI patients cautions us from drawing firm conclusions regarding the clinical significance of the procedure at this juncture. Further validation studies with differing sample sizes and ranges of disease severity will help to clarify this issue with respect to RVR. We restricted our analyses to GM segmented images. It is possible that certain clinical tests reflect white matter (WM) changes whereas others reflect GM changes (Baxter et al., 2006
). However, analyses using a kernel of GM plus WM performed on the same data sets did not improve the accuracy of any predictions. Given that atrophy in GM is a more established attribute of AD, the WM images likely added more noise than useful information. Future studies that also incorporate WM hyperintensities reflecting vascular pathology, which is known to occur in parallel with GM changes, may add more useful information. Also, after eliminating subjects whose testing was outside 3 months of a scan, CN subjects were slightly younger than the AD patients in set 1. However, the contribution from age should be relatively small, and further univariate analysis in which we removed the effect of age at each voxel by treating it as a confounding variable improved rather than diminished the correlations (MMSE=0.72, DRS=0.76, and AVLT=0.63). Similarly, the slight group differences in gender distribution and education in sets 2 and 3 are unlikely to have substantially affected prediction accuracy. Furthermore, comparing prediction accuracy of one clinical test with another within the same set should not be affected by such a bias, since the prediction accuracy of each clinical score would be subject to the same inhomogeneity.
It is possible that for tests showing a good correlation with structure, the prediction accuracy (the actual score minus the predicted score) may provide useful clinical information. Since RVR gives probabilistic predictions, it is possible to measure the distance in standard deviations between predicted and actual scores. For example, in those who have learned compensatory strategies or can tolerate progressive brain pathology without manifesting cognitive symptoms, i.e.
, have a greater cognitive reserve (Stern, 2006
), the expectation would be a predicted score lower than the clinical score. Years of education—one factor thought to provide cognitive reserve—and the prediction error (actual score minus predicted score) were significantly correlated for the MMSE and ADAS-Cog in the ADNI data-set. It is interesting to note that the 3 obvious outliers in (Set 2) include 2 MCI subjects, each with 18 years of education: one with an ADAS-Cog score of 10, but a predicted ADAS-Cog score of 39.04 (actual MMSE of 28 and predicted MMSE of 23.47) and the other with an actual MMSE of 29 and predicted MMSE of 23.99 (actual ADAS-Cog of 23.33 and predicted ADAS-Cog of 30.14). Conversely, the third outlier was a CN subject with 12 years of education, actual MMSE of 25 and predicted MMSE of 30.68 (actual ADAS-Cog 7.33 and predicted ADAS-Cog 3.8). Visual inspection of the MRI scans for these subjects reveals more atrophy in the MCI subjects than the CN subject, indicating that the predicted score is more reflective of brain pathology than actual scores in these cases, but that factors such as educational level may have boosted performance on clinical screening tests. Furthermore, among MCI subjects with an MMSE score of 30, the predicted score was significantly lower in those who subsequently converted to AD than those who did not (P
=0.0006). However, more studies are needed to assess this method for making predictions about cognitive reserve that include additional factors such as exercise and leisure activities; or whether fatigue or depression is a factor in subjects whose actual test scores are lower than predicted scores; as well as further evaluation and correction for the possibility that the RVR is under-estimating higher and over-estimating lower scores.
RVR offers a novel, multivariate method to test specific inter-regional dependencies between structural changes and clinical scores. As expected, and consistent with results from VBM studies, our results support the utility of the DRS, MMSE, and ADAS-Cog for screening and tracking AD. Perhaps more intriguing is RVR's ability to aid in making predictions for individual subjects. In the subset of MCI subjects from the ADNI data-set, correlation of predicted ADAS-Cog or MMSE scores with days to conversion to AD was not substantially better than the actual score. Nonetheless, it is possible that other imaging modalities might work better for this purpose and RVR may well prove useful in the prediction of imminent disease. Future studies will be directed at developing and assessing methods to combine clinical scores, with MRI, PET and CSF biomarkers for the purpose of predicting clinical outcome.