For data set 1, using the start criterion SA and a termination threshold of 0 (i.e. no termination threshold), the algorithm classified 88.5% (n = 46) of the participants correctly. Of the misclassified participants (n = 6), half were of moderate literacy incorrectly classified as low literacy while the others were of high literacy classified as moderate.
All the misclassified cases appear to result from the same type of outlier question/response pair. For instance, one of the misclassification resulted when a participant with a score of 45 was misclassified to be of low literacy (true category: moderate literacy). An analysis of the response vector and state entropies revealed that this can be attributed to the participant's response to one particular question, for which the participant gave an incorrect response, in contrast to all other participants of moderate literacy. As the algorithm uses a leave-one-out approach, P(Qi = 0|moderate literacy) for this question would be equal to 0, and the participant's incorrect response results in a misclassification.
The use of a random moderate-difficulty start question (start criterion SB) did not result in any discernible difference in the number or type of errors produced by the algorithm.
As expected, with an increase in the threshold the number of questions to be answered decreases and the error rate increases. Figure shows the average number of question per participant and error rates for threshold values in the range of 0 and 0.2 in increments of 0.01. At the smallest attempted threshold of 0.01 the average number of questions to be answered was found to be 25 (42%) with one additional misclassification.
Entropy threshold vs. Average question count and Error rate using start criterion SA (dotted lines) and SB(solid lines).
To determine the largest possible reduction in questions without increasing errors, threshold values in the range of 0 and 0.01 were further investigated and the results are shown in Figure . A threshold value of 0.005 appears to be the most promising as this would reduce the number of questions by half without causing any additional misclassifications. For all participants, the predicted class at this threshold is identical to the predicted class when no termination criterion was specified.
Entropy threshold vs. Average question count and error count, for entropy thresholds in range 0 to 0.01 with start criterion SA.
Figure also shows the result of using start criterion SB in combination with different entropy thresholds. As can be seen in the figure, the reduction in questions is comparable to our earlier results. However, the method tends to misclassify a slightly larger number of participants.
Data set 2 has a larger number of subjects but a small item pool. Figure shows the changes in the average number of questions to be answered and error rate for the two class classification scheme. Without a threshold, 96.9% of the participants were found to be correctly classified. The number of misclassifications remains constant with increase in threshold, up to an entropy threshold of 0.17 and the average number of questions to be answered reduces to 5.14 (64.25%).
Entropy threshold vs. Average question count and error count using start criterion SA, for data set 2 with two-class classification scheme.
A three-class classification can be expected to have more uncertainty than the two-class classification scheme, producing a higher error rate and a smaller reduction in question count. Figure shows that for the three-class classification scheme on dataset 2, without a threshold, 93.8% of the subjects were correctly classified. Though this is less than the corresponding value in the two-class classification scheme, it is better than the observed errors of the three-class classification scheme on data set 1. The number of questions to be answered, 6.59 (82.3%), is also higher than that observed in the two-class classification scheme.
Entropy threshold vs. Average question count and error count using start criterion SA, for data set 2 with three-class classification scheme.
In order to estimate the sensitivity of the performance of the method to calibration data, we tried two calibration alternatives to the leave-one-out approach described above: (a) calibration with a random half of the population; (b) calibration using the online sample. In both cases, only subjects not used for calibration were used for testing. For example in scheme (b), P(Qj = 1 | Li) and P(Li) were computed using the subjects recruited online (n = 100) and used to classify subjects recruited at the clinic (n = 62). Table and Table list P(Qj = 1 | Li) and P(Li) observed in the three calibration schemes.
Actual P(Qj = 1 | Li) for various subsets of Dataset 2 - complete sample, a random half of the sample, online sample (n = 100) and clinic sample (n = 62)
Actual P(Li) of the calibration sets for Dataset 2 using three different calibration schemes - leave-one-out, a random half of the sample, and online sample (n = 100)
Using the leave one-out approach, the average number of questions to be answered was 6.6 at 93.8% classification accuracy (accuracy possible if no threshold were used). For calibration scheme (a), 6.8 questions needed to be answered and a higher accuracy of 97.5% was observed. Calibration scheme (b) resulted in a classification accuracy of 91.9% with the subject having had to answer 5.3 items on average. As can be seen in Table for scheme (b), the calibration sample has a very different distribution from the testing sample. For example, in the online sample, 16% of the subjects were of low numeracy and 37% of high numeracy whereas in the clinic population 52% were of low numeracy and only 5% of high numeracy. The decrease in performance using scheme (b) can probably be attributed to this difference.
Table shows the actual and predicted distribution of the numeracy classes in the testing sets of the three calibration schemes. The misclassifications resulted from overestimation of the numeracy of subjects of low numeracy group and when the calibration set is not representative of the testing sample, as in scheme (b), these misclassifications were found to increase.
P(Li) predicted for the testing sets using the three calibration schemes
As illustrated in the ROC curve in Figure , the sensitivity and specificity of the algorithm are very good. The top ROC curve shows the sensitivity and false positive rate on the numeracy data set at an entropy threshold of 0 (area under the curve = 0.96); the lower ROC curve shows very little decrement in performance at an entropy of 0.6 (area under the curve = 0.93).
Figure 6 Sensitivity and specificity of the algorithm for Data set 2 with two class classification scheme. The top ROC curve shows the sensitivity and false positive rate on the numeracy data set at an entropy threshold of 0 (area under the curve = 0.96); the (more ...)