For all *local* predictive models, no interactions between ethnicity and other covariates were found to be significant, so no interaction term was added to any of the *local* predictive models. *local* ARIC and SAHS models with one combined race variable for Malays and Asian Indians had better model fit than their counterparts with two separate race variables. For *local* SAHS models, the Akaike Information Criterion (AIC) for models with one and two race variables were 588.6 and 590.6, respectively. For *local* ARIC models, the AIC were 587.9 and 589.8, respectively.

Table shows comparisons of the coefficients of the three multivariate predictive functions that were estimated using the NHS-92 cohort with the *published* coefficients from the SAHS, ARIC and Framingham models.

| **Table 2**Comparison of *published* and locally-estimated multivariate predictive functions |

For the ARIC model, the 95% confidence intervals of the following risk factors did not include the *published* effect size from the original study: age, fasting FPG at baseline, waist circumference and triglyceride.

For the Framingham model, all locally-estimated effect size agreed well with the *published* ones, except for the effect of overweight, which was found to be significantly higher in our *local* model.

However, at this stage, we did not know to what extent the estimates from the *local* models had been influenced by the exclusion of 942 subjects without FPG measurement at follow-up or due to incomplete baseline information.

Table shows the multiple imputation estimates for the three

*local* models. Interestingly, after taking into account the dropout effects, the discrepancies between the

*local* and

*published* models were less startling. In particular, for the SAHS and ARIC models, the discrepancies in terms of FPG and measures of adiposity (BMI or waist circumference) were no longer statistically significant. There were, however, significantly smaller age and gender effect sizes when compared to the

*published* models. With the Framingham model, we found a significantly larger effect of overweight, quite possibly because a Caucasian definition of overweight (BMI ≥ 25) had been used instead of the WHO recommendation for Singapore, which used a cut-off of BMI ≥ 23 to define overweight individuals [

22].

| **Table 3**Comparisons of *publishe**d *and locally-estimated multivariate predictive functions after inclusion of 942 subjects with incomplete baseline and follow-up measurements using multiple imputations |

Discrimination power

Table compares AUC for the various predictive functions. Although the AUC for all three locally-estimated multivariate models were slightly higher than the corresponding statistic for their *published* counterpart, only in the case of SAHS model did this difference achieve statistical significance. All locally-estimated multivariate models achieved better discrimination power when compared to model that used FPG only (all *P *< 0.001, Table ), while the locally-estimated Framingham model is the only one that was not statistically better than model that used only 2hPG (*P *= 0.110).

| **Table 4**Comparisons of area under the Receiver Operating Characteristic curve (AUC) for various predictive models evaluated using NHS-92 Cohort |

Out of the three *published* functions, the ARIC model had the highest discrimination power (AUC = 0.847), followed by the SAHS model (AUC = 0.839) and the Framingham model (AUC = 0.805). The performance of the *published* SAHS and ARIC models were not statistically different (*P *= 0.230), but both models had significantly higher discrimination power than the Framingham model (*P *= 0.028 and 0.007, respectively). More importantly, the *published* SAHS and ARIC models were statistically better at discriminating T2DM cases from non-cases when compared to the *local* model that used FPG only (*P *< 0.001) or 2hPG only (*P *= 0.021 and 0.011, respectively).

The NRI statistic revealed that overall, the *published* ARIC model was only marginally better than the *published* SAHS model (NRI = 0.127, *P *= 0.060). When we looked at cases and non-cases separately, the ARIC model was not significantly better than the SAHS model in terms of reclassifying cases (Figure ). Specifically, compared to the SAHS model, 32 cases were appropriately reclassified by the ARIC model at a cost of 21 cases being reclassified inappropriately (NRI = 0.100, *P *= 0.131). However, the ARIC model was better at reclassifying non-cases. In total, 110 non-cases were appropriately reclassified using the ARIC model, at a cost of 76 non-cases being reclassified inappropriately (NRI = 0.026, *P *= 0.013).

Calibration

The calibration inspections revealed that the *local* models showed good calibration properties (Table ). In particular, the H-L statistics for *local* models are all less than 11.5 and the predicted incidence rates under all *local* models agree well with observed incidence rates over the 13-year period in the NHS-92 cohort, which was 7.8%. However, the three *published* models showed poor calibration, with the Framingham model being the worst. In particular, the SAHS and ARIC *published* models overestimated the incidence rates, while the Framingham model underestimated the incidence rates. Specifically, the estimated incidence rates from the SAHS and ARIC models were 13.5% and 9.8%, respectively. Meanwhile, estimated incidence rates from the Framingham model is 2.0%.

| **Table 5**Calibration quality of various predictive models evaluated using NHS-92 Cohort (N = 1,401) |

Recalibration improved the calibration quality of ARIC model (Figure ), but the same cannot be said for the Framingham model. The recalibration procedure seemed to work reasonably well for the SAHS model for subjects in the lowest three quintiles (Figure ); however, the SAHS model still overestimated the number of cases in the two highest quintiles even after recalibration. The poor performance of the Framingham model could be due to the fact that the Framingham cohort used to derive the model consists almost exclusively of one race while the Singapore population consists of three races with Chinese being different from Malays and Indians. To investigate this possibility, we performed local fitting and recalibration of the three *published* models separately in the Chinese and non-Chinese populations, with the race terms removed from the ARIC and SAHS models. In the Chinese population, the H-L statistic for locally-fitted ARIC, Framingham and SAHS models is 4.22, 3.43 and 1.38 respectively, indicating good calibration properties. However, only recalibrated ARIC and SAHS models show acceptable calibration quality with H-L statistic of 7.31 and 2.23 respectively. The recalibrated Framingham model still shows poor calibration quality (H-L statistic = 26.12). Among non-Chinese population the story is very similar. The H-L statistic for locally-fitted ARIC, Framingham and SAHS models is 1.53, 2.74 and 3.34 respectively. Among the recalibrated *published* models, only ARIC shows acceptable calibration quality with H-L statistic of 6.87. The recalibrated Framingham and SAHS have poor calibration quality with H-L statistic of 19.29 and 140.17, respectively. Thus, the poorer performance of the Framingham model is unlikely only due to differences in race effects between the Framingham cohort and Singapore population. It is more likely that differences in the effect sizes of some of the risk factors also contribute to the poor performance.