Our models were able to discriminate between ESCC cases and controls in about 77%, and between individuals with and without squamous dysplasia in about 70% of the cases. The model, however, had only a 6.8% positive predictive value for diagnosing dysplasia, due to the low prevalence of this lesion.
Our results were better than a previous study done in another high risk area in China
8. In that study, Wei et al. observed a sensitivity of 57% and a specificity of 54% for predicting squamous dysplasia, in spite of the fact that many components of the model were significantly associated with the risk of dysplasia. The area under the curve was 0.58 in their study without using any validation method; the area under the curve was already very low and using any validation would have decreased it even further
8. We found better sensitivity and specificity, and a higher area under the curve despite using cross validation (even the lower bound of the confidence interval in our study was higher than 0.58). The cross-validation method tests the model in individuals other than the ones used for model building, and thus gives a more realistic estimate of the discriminating power of the model.
Studies using questionnaire data and symptoms to predict esophageal adenocarcinoma have had similar results. Models using symptoms and individual characteristics for the diagnosis of Barrett’s esophagus, a precursor lesion of esophageal adenocarcinoma had an AUC of 0.72
10 and 0.76
9. One big difference is that Barrett’s esophagus is closely related to gastroesophageal reflux disease (GERD), and thus GERD symptoms can be used in the model, while dysplasia is rarely symptomatic.
In validation studies, a model may have a good ability to discriminate between cases and controls, but may not be able to correctly predict the probability of the event.
20 Calibration refers to the agreement between predicted probabilities and the observed proportions.
16 In our data, the base models (without weight loss) showed good calibration according to the Hosmer-Lemeshow statistic. Interestingly, although the ESCC model with weight loss had the highest area under the curve, it had a poor fit. This model has also little application, since weight loss develops during the symptomatic stage of ESCC when endoscopy is strongly indicated and risk screening is no more useful.
One of the limitations in our study was the small number of dysplastic cases, which has led to wider confidence intervals for the dysplasia model compared with ESCC models. On the other hand, although the ESCC model had more cases, the prediction of ESCC is not the main purpose of risk screening. The ESCC cases in this series have advanced symptomatic disease, and by the time ESCC has reached this stage, it is usually too late for any intervention.
Unlike high-risk areas in China, where about 30% of the general population have squamous dysplasia
8, in our sample this rate was only about 4%. Disease prevalence determines posttest probabilities in different populations
21, so even with the best estimates of accuracy, given this low prevalence, the predictive values will be so low that the use of a risk factor model for individual risk stratification is not advisable. With a 62% sensitivity and a 70% specificity, about 93% of positive cases will be false positives, and the positive predictive value (PPV) for dysplasia will be only around 7%. This implies that, on average, out of 100 endoscopies performed in high-risk individuals (according to the model), only 7 will have dysplasia, while this number is 4 in a randomly-selected sample of the general population. On the other hand, if the prevalence of dysplasia were similar to China, we would have a positive predictive value of 42.9%, which would make it more suitable as an initial selection step for endoscopic screening. The reasons for such a low prevalence of dysplasia, despite the high incidence of ESCC, remains to be determined and may range from differences in ESCC pathogenesis between the two populations (e.g. faster progression from dysplasia to ESCC), to technical differences in the screening methods used.
Lack of reproducibility is a general problem with many risk models.
10 Different populations may have different risk factors for the disease (e.g. the risk factors used in China were different from those used in Iran to build the prediction model). Also, the studies validating these models usually use internal validation (like the cross-validation used in our study), rather than true external validation (which needs testing the model on new data collected from another population)
22. Besides, many clinicians are reluctant to use statistical models for risk assessment and stratification
21. These models, however, may prove useful in research studies, where a researcher is interested in selecting a group at higher potential risk for developing the precursor lesion and ultimately cancer.
The latest cancer registry data from Golestan show a very high incidence of ESCC, especially in the population above 50 years of age
23. This high incidence, together with the poor prognosis of ESCC in this region, underlines the importance of finding alternative strategies for early detection. Sponge balloon cytology has recently been shown to be effective for detection of patients with Barrett’s esophagus
24. Similarly, a non-endoscopic esophageal sampling technique coupled to a biomarker is a potential alternative which can be tested against chromoendoscopy for early detection of squamous dysplasia and ESCC. The risk model can be used to increase pretest probability for such screening strategies, by selecting a particularly high risk group.