The test set (generalization) accuracy of the voxel-based AD-Control classifier, built using 70 training samples per class, was 0.89 (86 of 88 Controls, and 16 of 27 AD subjects, were correctly classified.) This classifier, with high specificity for Controls, was then applied to a population of MCI subjects to determine the CT and NT subgroups.
3.2.1 Classification experiments for the MCI population shows the sizes of the converter-by-CDR (C-CDR) and nonconverter-by-CDR (N-CDR) groups within the ADNI MCI cohort for a typical experiment in our work. shows the same population broken up as converters-by-trajectory (CT) and nonconverters-by-trajectory (NT). Superimposing the two charts, illustrates their overlap, where converters by both definitions are accordingly indicated by orange. Since converters-by-CDR are relatively scarce, we used a large majority of them (

, i.e. 39 individuals among the 48) for the by-CDR classifier's training set, with the rest (

) put into the test set. We reiterate that a general disadvantage of the by-CDR approach is its scarcity of converter examples – by contrast, a more balanced number of examples is available for by-trajectory training (at least

, rather than

, training samples per class, as in ). Note also that if we were to use a “probable AD”, rather than a by-CDR converter definition, where converters are required to have undergone both CDR and MMSE changes, there would necessarily be even
fewer converter examples, which makes “probable AD” even less attractive than by-CDR from the standpoint of having an adequate sample for classifier training and testing. This also raises the possibility of an
alternative clinically-based definition, based on a logical OR-ing of CDR-based and MMSE-based conversions,
i.e. where a converter must
either have undergone CDR change or MMSE changes (or both). One difficulty here is how to define MMSE-based conversion. In
[13] MMSE scores were averaged over all visits in order to reduce noise. One can accordingly then define MMSE-based conversion if a subject's MMSE score, averaged over all visits, falls below a given threshold. The average MMSE score over
all MCI subjects is 25.85. To assess the number of additional cognitive score-based converters one could obtain by considering MMSE, we varied a threshold on the average MMSE score, evaluating at cutoffs of 24, 23, and 22, and finding that the additional number of converters declared in this way were 55, 35, and 22, respectively. Thus, especially for an MMSE cutoff of 24, one can obtain a significant number of extra converters using MMSE in addition to CDR. However, it is unclear what is in fact a proper choice for the MMSE threshold – simply choosing a threshold at 24 because this leads to more converters is somewhat arbitrary, without a strong objective basis. Accordingly, in the sequel, for evaluating clinically-based conversion, we will only experimentally evaluate by-CDR conversion, as used in
[4],
[5],
[13]. Specification of a principled combined CDR and MMSE-based conversion definition and validation of such a definition is a good subject for future work.
A fair performance comparison between by-trajectory and by-CDR classification requires: 1) using the same per-class training set size (i.e. 39) for both by-CDR and by-trajectory training, and 2) making the test set sizes the same for both classifiers. There are several different ways in which the data can be partitioned into training and test sets, consistent with these two conditions: i) we can perform simple random selection on a class-by-class basis, ensuring only that the two classifiers are given the same training/test set sizes (but not the same sets) – note that this means that the training sets for the converter and nonconverter classes of the conversion-by-CDR classifier are randomly selected from the yellow and white regions in , respectively, with no consideration of trajectory-based (i.e. red/white) labeling illustrated in ; ii) we can make the training sets of the two classifiers identical rather than merely same-sized, as well as make the test sets identical. This latter approach, though, will have some bias because, in selecting samples for the by-trajectory classifier, we will have to make use of knowledge of the samples' conversion-by-CDR status (and vice versa for the by-CDR classifier). The first approach, on the other hand, clearly does not have this bias. As both approaches are valid ways of dealing with by-CDR data limitations, we will compare generalization accuracies of by-CDR and by-trajectory classifiers under both these data selection schemes, respectively, referring to these approaches as “random” and “identical” in the sequel.
Our training/test set selection procedure for the “identical approach” is as follows. For the C-CDR-CT group (), randomly select

of the group (the yellow striped group of size 30 in ) such that a corresponding group within N-CDR (white portion in ) can be found that is both
NT and satisfies age-matching. This corresponding group is illustrated in as the white striped group (of size 30), placed opposite from the yellow striped area it is paired (matched) with. Likewise for the C-CDR-NT group (), randomly select

of the group (the yellow striped group of size 9 in such that a corresponding group within N-CDR can be found that is both
CT and satisfies age-matching. This corresponding group is illustrated in as the white striped group (of size 9). Notice by comparing this figure to that the two white striped areas are separated by the CT-NT border. We take the training set – shared by the by-CDR and by-trajectory classifiers – to be precisely the union of these four striped areas. (For the by-CDR classifier, the class membership of any of these four subsets of the training set is illustrated by the color being yellow or white in . Likewise, for the by-trajectory classifier, class membership is illustrated by red or white color in .) Subsequently, we take the test set – shared by the two classifiers – to be the subjects who are neither in 1) the training set (striped areas) nor in 2) the special set of subjects shown in solid gray in (also shown identically in ). We exclude this “special set” (in gray) from the
test set so that
all our experiments under the “identical approach” can have a shared, fixed test set (for fair comparison with each other), including, crucially, an experiment that will include this “special set” of samples in the
training set. That is, the test set is the tiled areas in (or, identically, in ).
Note above that
some random selection is being employed in choosing the training/test sets even in the “identical approach” (whereas, in the “random approach” the selection is completely random). Thus, for both approaches, the accuracy of performance comparison will benefit from averaging accuracy results over multiple training/test split “trials”, where the training and test sets will vary from trial to trial based on the random selection that is built into the data selection procedure (for both the “random” and “identical” approaches). This essentially amounts to a bootstrap procedure, which aims to work with a finite (limited) amount of data and, at the same time, both build accurate models and accurately
assess the model's generalization accuracy
[38]. Results averaged across 10 trials are given in for a linear-kernel SVM (for generating all classification results herein, including those in , we used SVM classifiers that were built by employing the common approach of bootstrap-based validation for selecting the classifier's (trial's) hyperparameter values
[26]);

notation is used to indicate the mean

and standard deviation

of quantities across the trials, which are shown rounded up. Note that by-trajectory's generalization performance is as high as 0.83, whereas by-CDR's generalization performance is very poor –
as poor as random guessing (see 0.5 and 0.56 table values) – due mainly to poor performance on
nonconverters-by-CDR. and show by-trajectory and by-CDR results, respectively, for one of the 10 trials (for the “random approach”), with each bar indicating distance to the classification boundary for an MCI subject in a test population of size 88 and nonconverters/converters shown in left/right figures, respectively. Positive/negative distance means nonconverter/converter side of the boundary, respectively. Among the 88 subjects, by-trajectory correctly classified 79 whereas by-CDR correctly classified only 40.
| Table 1Test set accuracy comparison of by-CDR and by-trajectory classification: Average test set classification accuracy using all features. |
Recently, similarly poor by-CDR classification performance was also reported in
[4], where it was found that the majority of (by-CDR) nonconverters “had sharply positive SPARE-AD scores indicating significant atrophy similar to AD patients”. Since the SPARE-AD score is produced by a classifier that was trained to discriminate Control and AD patients
[32],
[16], this comment and associated results are consistent both with our conjecture in the
Introduction and our above histogram results, which suggest that there may be a significant number of patients undergoing physiological brain changes consistent with conversion, yet without clinical manifestation.
The results above indicate that the conversion-by-CDR definition's two classes are not well-discriminated, and thus, clinical usefulness of this definition for our prognostic Aim is expected to be poor. The much greater generalization accuracy of the by-trajectory definition (coupled with its inherent plausibility as a conversion definition) indicates its greater utility.
Increasing the By-Trajectory Image-Based Feature Resolution: In a separate experiment, we evaluated using one of 27 subsamples (rather than one of 216 subsamples), i.e. a

10-fold increase in the number of (voxel-based) features, and found that the by-trajectory generalization accuracy rose to 0.91 in the “random” case. We then tried building 27 separate by-converter classifiers, one for
every 1/27th subsample (thus effectively using the whole 3D image), with majority-based voting used to combine the 27 decisions. This ensemble scheme again achieved 0.91 accuracy, i.e. there was no further accuracy benefit beyond that from a

10-fold increase in the number of voxel features.
Increasing the By-Trajectory Training Set Size: Note that the converter-by-CDR sample scarcity and class-balancing (via age-matching) in the experiments above had the effect of artificially limiting the
by-trajectory classifier training set size. Next we investigated how much the generalization accuracy of by-trajectory classification improves when this limitation is removed. The tiled areas in are identical, illustrating that in this new experiment ( and ) we used the same test set as previously, for fairness of comparison. However, as indicated by differences in the total striped area between these two charts, we now make the training set much larger than previously. Specifically, for the “identical” case, we used the previous 10 trials but simply augmented a trial's training set with the two large, previously-excluded gray sets, shown in , with size 60, as these two sets do age-match each other. The results, averaged across the 10 trials, are given in . Notice in this figure the now larger per-class training size (on average

rather than

), and that the random approach uses this size as well. The by-trajectory results in indicate that accuracy improved from

() to

for the “identical approach”, and modestly worsened for the “random approach” (from

to

).
| Table 2By-trajectory average test set classification accuracy for the larger training set size ( per class) and features. |
A Definition Based on Both Neuroimaging and Clinical Change: We now consider a third definition of conversion that combines the first two definitions as follows. Let “converters” consist of individuals who converted either by-trajectory or by-CDR (non-white areas in ), with the “nonconverter” class consisting of the remaining MCI individuals (white area). We give two points of view on this new definition, “conversion-by-union”. First, despite the disadvantages with CDR and MMSE pointed out in sections 1 and 2 of this paper, our strictly neuroimaging-based by-trajectory definition is not using any clinical information in defining the phenotype, i.e. psychological effects are not being taken into account. Second, this “union” definition is more inclusive in defining an MCI subpopulation at risk, which may benefit from early treatment or diagnostic testing. While from that perspective the new definition is reasonable, the fact that grouping individuals by CDR has a role in this definition may be its disadvantage, considering that by-CDR classification was previously shown to perform not much better than random guessing. Results, averaged across the same 10 trials used in , are given in and indicate that conversion-by-union generalizes somewhat worse than conversion-by-trajectory. Note that “by-union” is, by definition, an instance of the “identical approach”. To ensure fairness of comparison with the by-trajectory definition of conversion, our test sets, and training set sizes, in these two cases were identical. In fact, we chose the by-union training set to be as similar to by-trajectory's, in every trial, as possible. Referring to (which represents a trial example), the by-union training set was chosen to include 1) the two large striped groups (red and white); 2) the small “special” gray group (of size seven in this trial example) and its age-matched counterpart within small white-striped group, and; 3) a subset of the second small “special” gray group (two of five individuals in this trial example) and its age-matched counterpart within the small red-striped group. We do not further evaluate “conversion-by-union” here. We do, however, identify “optimal” definition of a multidimensional phenotype and associated conversion, based on neuroimaging, multiple cognitive measures ( e.g., CDR and MMSE), genetic markers, and CSF markers (if routinely measured) as a good direction for future work.
| Table 3Average test set accuracy of by-union classification for per-class training samples and features. |
3.2.2 Validation on Known AD Conversion Biomarkers To validate the proposed conversion definitions with respect to desideratum 3, we performed correlation tests on the MCI population between the binary class variable

and known AD conversion biomarkers consisting of: 1) volume in reported AD-affected regions (Table 2 in
[11]), which we measured for each individual's
final-visit MRI (As discussed in the supplemental
Document S1, we measured
normalized region volume. Note also that our regions are defined based on the atlas (Atlas2) we used. The correspondence between the regions in
[11] and our defined regions is given in . Finally, note that a subject's final visit is not always the sixth visit.); 2) the following CSF-based markers, as considered previously in
[12]: tau, p-tau, A

1-42; and 3) the clinical MMSE measure. The stronger the correlation, the more accurately the biomarker is predicted from the class variable and the greater the separation between the biomarker histograms, conditioned on the two classes. In particular, we would expect that a good converter definition should have statistically significant correlation between its class variable and region volume at final visit for known marker regions such as the hippocampus. We note, however, that for measuring correlation between the by-trajectory class label

and the final-visit MRI-derived region volume biomarkers, some care is required to avoid statistical bias. In particular, note that the by-trajectory label is obtained by applying the Control-AD classifier to each visit's MRI, with conversion declared if any of the visits (including the final one) is classified as “AD”. Since the final visit is also used to measure the region volume biomarkers we will use to
validate the by-trajectory labels, this “dual use” of the final visit would be a source of bias. There are 22 MCI subjects that by-trajectory convert only at the final visit. To avoid bias in validating the by-trajectory approach, we excluded these 22 subjects from brain region-volume based statistical validation of the by-trajectory definition. The full MCI population (including these 22 subjects) was used in all our other validation testing.
| Table 4Correspondence between the regions in [11] (left) (except “Total GM” and “Total WM”) and our defined regions (right). |
Before presenting correlation test results, we first illustrate in the increased separation of the histograms of hippocampus volume for the converter and nonconverter groups in the by-trajectory case, compared with by-CDR. Next, we performed comprehensive statistical tests for a number of suggested AD biomarkers. The R statistical computing package was used to perform all tests with statistical significance set at the 0.05 level. In , the correlation coefficients for by-trajectory and by-CDR are shown for each biomarker, along with their associated p-values
[39]. Note that for 10 out of 14 brain regions, the correlation with by-trajectory is greater than the correlation with by-CDR (in
bold), with by-trajectory meeting the significance threshold in 9 of these 10 regions. Further, for only two of the remaining four biomarkers - posterior cingulate and the clinical MMSE measure - does the correlation with by-CDR meet the significance threshold. Most notably, well-established markers for AD such as the hippocampus, lateral ventricles, and inferior parietal exhibited strong correlation with the by-trajectory definition. To further assess statistical significance of the
comparison between by-trajectory and by-CDR correlations, we performed a
correlated correlation test [40], the appropriate test given that the same MCI sample population (excepting 22 excluded subjects for the by-trajectory brain region volume tests) was used in measuring correlations for both by-CDR and by-trajectory. This test () reveals that the larger correlation of by-trajectory is statistically significant at the 0.05 level in six brain regions (in
bold), most notably the hippocampus, with a very low p-value (2.16e-10). By contrast, conversion-by-CDR does not achieve a statistically significant advantage for any of the brain regions, nor with respect to MMSE.
| Table 5Correlation coefficients and associated p-values: (a) Correlation test results; (b) Correlated correlation test results for each of the regions in (a). |
Statistical testing results for the CSF markers are shown in . As seen in the table, by-trajectory has larger correlation with tau and A

1-42 than by-CDR conversion. Moreover, for by-trajectory, these correlations are statistically significant. However, the correlated correlation test did not indicate that the
comparison of correlations reached a statistically significant level. Correlations with p-tau were comparable for the two conversion definitions.
| Table 6Correlation coefficients and associated p-values: (a) Correlation test results; (b) Correlated correlation test results for each of the CSF biomarkers in (a). |
To summarize, testing on both brain region and CSF-based markers validates that by-trajectory is more consistent with conversion to AD than the by-CDR definition.
3.2.3 Identification of prognostic brain “biomarker” regions In the previous section, we validated conversion definitions using established (diagnostic) AD biomarker brain regions (with volumes measured at final visit). In this section, we will
identify key
prognostic biomarker brain regions (from the baseline visit image) via supervised feature selection, aiming first to identify the “essential” subset of voxel features, i.e. the voxels (at initial visit) necessary for our classifier to well-discriminate the CT and NT classes. The brain regions (consistent with a registered brain atlas) within which these select voxels principally reside then identify our prognostic brain biomarker regions. Similarly, we will identify
diagnostic regions, critical for discriminating between AD and Control subjects (using our AD-Control classifier). In both cases, the accuracy of the selected brain region biomarkers rests heavily on the accuracy of the supervised feature selection algorithm we employ. In , we compare MFE and RFE feature elimination (i.e. feature selection via feature elimination) for both Control-AD classification and for CT-NT classification (for one representative, example trial). The curves show test set accuracy as a function of the number of retained features (which is reduced going from right to left). Note that the “MFE/MFE-slack” hybrid method
[26]) outperforms RFE for both brain classification tasks, achieving lower test set error rates, and with much fewer retained features. The circle, determined without use of the test set based on the rule in
[26], marks the point at which we stopped eliminating features by MFE, thus determining the (trial's) retained voxel set. This MFE-RFE comparison (and the previous comparison in
[26]) supports our use of MFE to determine brain biomarkers.
To relate the retained voxel set to anatomic regions in the brain, we overlaid the retained voxel set onto a registered atlas space. For CT-NT classification, to improve robustness, the final voxel set was formed from the union of the retained voxel sets from each of ten feature elimination trials (each using a different, randomly selected training sample subset). For AD-Control classification, the final voxel set came from a single trial (the only trial, from which the 10 CT-NT trials stemmed). For each of these two cases, overlaying the final voxel set onto the co-registered atlas (Atlas2, defined in the supplemental
Document S1) yielded between 70–80 anatomic regions. For data interpretation purposes, we then identified a subset of (biomarker) regions using the following procedure. First, for each brain region, we measured the percentage of the region's voxels that are retained,
sorted these percentages, and then plotted them. As shown in , the resulting curve for the AD-Control case has a distinct knee, which we thus used as a threshold (0.125) to select the final, retained (diagnostic) regions for AD-Control. We used the same threshold for the CT-NT curve, shown in . This choice of threshold yields a reasonable number of regions – 19 for the CT-NT (prognostic) case and 21 for the AD-Control (diagnostic) case.
The resulting sets of identified prognostic and diagnostic biomarkers are given in , along with their intersection. The diagnostic markers in the table include the majority of the known brain regions in the medial temporal lobe involved in AD pathology. For example, hippocampus atrophy and lateral ventricle enlargement, particularly in its anterior aspects of the temporal horn, are considered the most prominent diagnostic markers for AD. Entorhinal cortical regions, including the perirhinal cortex, are presumably the earliest sites of degeneration
[41]. Thus, independent identification by our AD-control classifier of known AD diagnostic biomarkers establishes a reasonable basis for applying the same approach to identify prognostic biomarkers. The brain regions listed as CT-NT prognostic markers include most known AD diagnostic markers (including 8 of the 12 regions from
[11] (marked by *), 4 of which are also diagnostic markers), indicating that some AD-linked pathological changes in these brain regions already occurred and remained active in a subset of MCI subjects who likely progress to AD rapidly. Conversely, the brain areas appearing only on the prognostic marker list are likely the most active areas of degeneration during this stage of progression to dementia. These structures tend to be the brain regions further away from the entorhinal cortex onto the parietal (Supramarginal gyrus, Precuneus) and temporal cortex (Superior temporal gyrus and Middle temporal gyrus) regions. All the brain structures listed in the table are known to be involved in AD
[41],
[42],
[43]. Thus, the markers in suggest an interesting anatomic pattern of trajectory for MCI conversion to AD which conforms with the Brak and Brak hypothesis and previous imaging findings
[42],
[43]. Moreover, the CT-NT regions uniquely found by our MFE-based procedure in may be viewed as “putative” prognostic markers, and may warrant further investigation.
| Table 7Brain regions identified as biomarkers using voxel-based features and MFE. |
Finally, we note that we have used a particular criterion (percentage of a region's voxels that are retained) to identify biomarker regions, starting from MFE-retained voxels. While our identified regions are plausible, it is possible that other (equally plausible) criteria may produce different biomarker region results. Thus, the biomarkers we identify should be viewed as anecdotal, identifying regions that figure prominently in our classifier's decisionmaking and also potentially assisting researchers in forming hypotheses about MCI-to-AD disease progression. However, we do not view the identified regions as definitive.