Home  About  Journals  Submit  Contact Us  Français 
We proposed the Ispline Smoothing approach for calibrating predictive models by solving a nonlinear monotone regression problem. We took advantage of Ispline properties to obtain globally optimal solutions while keeping the computational cost low. Numerical studies based on three data sets showed the empirical evidences of Ispline Smoothing in improving calibration (i.e.,1.6x, 1.4x, and 1.4x on the three datasets compared to the average of competitorsBinning, Platt Scaling, Isotonic Regression, Monotone Spline Smoothing, Smooth Isotonic Regression) without deterioration of discrimination.
Learning models focused on maximizing discrimination (i.e., the ability to separate positive cases from negative cases) often ignore calibration, which relates to the correctness of predicted values. However, the latter aspect is important to medical decisionmaking, since clinicians may use predictive model estimates as surrogates to individualized risk scores2, 3. However, if the predictive model is not calibrated (e.g., raw outputs of Support Vector Machine are used to represent risk), the decisions may be wrong. As molecular markers from genomics and proteomics are increasingly considered in predictive models and become available to consumers4, 5, calibration is even more crucial to enable reliable risk assessment diagnosis, and prognosis based on individual genomics and proteomics6–8.
Such challenge is getting even more critical when medicine is becoming more and more “personalized”, for which predicted scores need to faithfully reflect the probability of outcomes of individual patients for best performance. Unfortunately, many popular predictive models (i.e., Decision Trees, and Naive Bayes classifiers) do not optimize calibration9. To improve on a predictive model calibration without deteriorating its discrimination, we need to develop novel and practical approaches.
There are a number of attempts towards improving the calibration of predictive models. To understand the pros and cons of each of them, we briefly review stateoftheart methods. A most intuitive idea is called binning10, which sorts and groups predicted scores into bins, and replaces the predicted scores as the fraction of positive cases within each bin, so as to reduce the discrepancy between predictions and the unknown true probabilities. This intuitive approach, although capable of improving calibration, may decrease discrimination due to the loss of rankings within each bin. Platt suggested a rescaling model11 that uses an additional logistic regression model to refit predictions (i.e., predictor variables) against class labels (i.e., the target variable). This method can convert arbitrary predicted scores (e.g., outputs of a Support Vector Machine model) into estimated probabilities. However, the approach is parametric, and it has limited ability to calibrate predictive models. Zadrozny and Elkan proposed another approach12 named Isotonic Regression (IR), a model to minimize the mean squared errors while respecting monotonic constraints, i.e., keeping the order of the predictions and hence no altering the ROC curve. These authors showed that IR can achieve superior performance over some baseline methods. A problem with this approach is that the lack of smoothness might decrease its generalization performance13.
Yet another approach using a monotone spline smoothing technique proposed by Wang and Li14 offers both smoothness and nonparametricity. The method is theoretically sound but it is complicated to implement in practice (high dimension of the problem due to large number of spline knots, complicated constraints for the monotonicity of the estimation, etc.), and as acknowledged by the authors: more efficient ways for choosing the penalty term parameters are necessary. A most recent attempt by Jiang et al13 uses a twostep approach to obtain a smoothed isotonic regression: (1) fitting an isotonic regression model to obtain the knot points; (2) use these knot points to refit a Piecewise Cubic Hermite Interpolating Polynomial (PCHIP) model. However, this heuristic approach has no theoretical guarantees of optimality. The following table summarizes the characteristics of a number of popular calibration approaches. The last row of the table summarizes the properties of Ispline Smoothing, a new calibration method introduced in this paper.
All above mentioned techniques11–14, except for binning, aim at solving a monotonic regression problem
where p_{k} is the precalibrated estimate from the model, and c_{k} is the observed binary outcome. Therefore, another way to understand differences and challenges among various methods is to look at their specific assumptions of functions. For example, Platt assumes f (·) to be an inverse logit function11, Isotonic Regression assumes a freeform f (·) that only has values at p_{k}, and Monotone Spline Smoothing assumes f (·) to be a natural cubic spline14. Due to these assumptions, each method has its own challenge, as we discussed earlier. This paper intends to introduce a new approach for which we assume the function f (·) to be a member of the cubic spline family with an Ispline basis. Thanks to compelling properties, the computation of Ispline Smoothing can be made much easier when compared to optimizing natural cubic splines using Monotone Spline Smoothing14. The following figure illustrates the adjusted estimations of probabilities using five different calibration approaches and the predictions of a LR model on a linearly separable data set.
Isplines are monotone splines that have the most obvious applications in monotone nonlinear regression problem, as discussed by Ramsay15. Though attractive in their simple expression and theoretical properties, very few articles described real applications of Ispline techniques. Lu et al16 and Wu17 used Isplines in solving Maximum Likelihood Estimation (MLE) problems. In this paper, we identified a new application of Isplines to solve calibration problems. Let us start with an introduction of Ispline basic concepts. The lth order Isplines based on a knot sequence are defined by Ramsay15 as
with L ≤ s ≤ U, where L and U are the left and the right end knots of the knot sequence, respectively. The number of Isplines is q = n^{(knot)} + l +1, where n^{(knot)} is the number of interior knots (i.e., the knots that are not end knots in the knot sequence). Note that i corresponds to the index of Isplines. Wu17 identified an interesting relationship: each is related to a Bspline18 such that , and therefore Bsplines could be used to construct Isplines
It is easy to compute Isplines using formula (2), since Bsplines can be efficiently computed, and are already available in statistical packages. De Boor19 showed that , and that each is nonnegative, which implies that Isplines in (1a) and (1b) are monotone and have function values between 0 and 1. The nonparametricity, monotonicity, and the range constraint between [0,1] altogether make Isplines good candidates to model distribution functions.
Given precalibrated prediction probabilities P = {p_{1},, p_{n}} and class labels C = {c_{1},, c_{n}}, we now show how to use Ispline based smoothing techniques to calibrate predictive models by solving a nonlinear monotone least square regression problem.
Define as the space of Ispline functions. The monotone least square regression finds f^{*} Ω that minimizes:
As mentioned before, Isplines are monotone and their values are between 0 and 1, the constraints for in Ω guarantee each f Ω is monotone with function values lay between 0 and 1. Given a knot sequence, Isplines are fixed, hence this monotone regression problem is actually minimizing (3) with respect to Ispline coefficients with constraints α_{i} ≥ 0 for i =1,,q and . We can rewrite this problem as a maximization problem:
The next question is how to pick interior knots, which is always critical to any splinebased technique. Intuitively, there are two general rules:
The second rule is easy to follow, after deciding the number of interior knots we could position according to the sample percentiles. But how to decide the number of interior knots is really decided on a casebycase basis. Ramsay15 mentioned very few interior knots are necessary, say, 1 or 2, for Ispline based regression problems. However, both Lu et al.16 and Wu17 chose the cube root of sample size as the number of interior knots for the MLE based spline estimations, and their experiments supported this choice. Given our sample size n, we used max{1, (n^{1/3} − 4)} as the number of interior knots, which works best for the proposed estimation in this paper when cubic splines are applied.
The computing for the maximization problem (4) can be done by a generalized gradient projection algorithm1. First we rewrite the constraints in (4) as Xα ≤ y, where X = (x_{1}, x_{2},, x_{q}_{+1})^{T} with x_{1} = (−1, 0,, 0)^{T}, x_{2} = (0,−1, 0,, 0)^{T},, x_{q} = (0,, 0,−1)^{T}, x_{q}_{+1} = (1,,1)^{T} ;α = (α_{1},,α_{q})^{T} ; and y = (0,0,1)^{T} . If some Ispline coefficients equal 0 or all coefficients sum up to 1, then we say their according constraints are active and let α = represent all active constraints, where rows of and are from a subset of rows of X and y . is used to facilitate the computation.
Initially we put integers representing active constraints in vector Λ (including indexes of Ispline coefficients for those equal to 0 and (q +1) when all coefficients sum up to 1). The vector Λ with r scalars corresponds to an r × q matrix . For example, if Λ = (2,1,(q +1)), then = (x_{2}, x_{1}, x_{q}_{+1})^{T} .
We denote the target function in (4) as F(α) with. Let F(α) and H(α) be gradient and Hessian matrix of F(α) with respect to α, respectively. Let W = −H(α) + γI, where I is an identity matrix, and γ is set to be large enough to make W positive definite. With that introduced, the generalized gradient projection algorithm is implemented as Algorithm 1.
To compare different calibration methods, we used two indices, the Area Under the ROC Curve (AUC)20 and the decilebased HosmerLemeshow goodnessoffit test (HLtest)21, to assess model’s discrimination and calibration, respectively. We compared the original logistic regression model and four calibration approaches: Platt Scaling (PS), Isotonic Regression (IR), Smooth Isotonic Regression (SIR), and Ispline Smoothing (IS). Because Ispline Smoothing is strictly monotonic, we expect it would not decrease the AUC of an input model. The experiment is to verify our expectation, and evaluate if IS has the potential to improve calibration.
We used three realworld data sets to evaluate the performance of proposed Ispline Smoothing calibration method. Table 2 summarizes the data in terms of their feature dimension, sample size, training to test set ratio, and short descriptions of the data.
Step 1 (Computing the feasible search direction) Compute 
= (d_{1},d_{2},,d_{q}) = {I – W^{−1} ^{T} (W^{−1} ^{T})^{−1} }W^{−1} F(α). 
Step 2 (Forcing the updated α to fulfill the constraints) Compute 

The execution guarantees that α_{i} + γd_{i} ≥ 0 for i = 1,2,q, and . 
Step 3 (Updating the solution by StepHalving line search1) Find the smallest integer ω starting from 0 such that 

Replace α by = α + min{1/2}^{ω} γ,0.5} · . 
Step 4 (Updating Λ, ) If k = 0 and γ ≤ 0.5, modify Λ by adding indexes of new Ispline coefficients when these new coefficients become 0, or adding (q + 1) when becomes 1, and modify accordingly, 
Step 5 (Checking the stopping criterion) If ‖‖ ≥ ε, for small ε, go to Step 1, otherwise compute λ = (W^{−1} ^{T})W^{−1}F(α). 

Following aforementioned ratios, we randomly divided the data into training and test sets, 100 times, to evaluate the performances of methods. Figure 1 below shows the box plots of AUCs in the first row, where each color represents a method and every subplot corresponds to one data set, as denoted in the caption. In the second row of Figure 2, we illustrate the rate of ‘passing’ the HLtest at the significance level of 0.05 for each method in all three data sets.
The figure visually demonstrates that all five models were comparable in terms of AUCs, while Ispline Smoothing stood out in terms of calibration performance. Table 3 lists actual values of these comparisons. For GSE2034 data, Ispline Smoothing ranked second in calibration since its HLtest passing rate was 43%, compared to LR (28%), PS (17%), IR (44%), and SIR (16%). The AUCs of Ispline Smoothing were not significantly smaller than any of the other methods. Note that we used onetailed paired ttest to compare different AUCs.
The results on Edin (MI) data showed similar patterns for calibration. Ispline Smoothing had the third highest rate of passing the HLtest (29%), which is close to SIR (30%) and IR (30%), and better than LR (14%) and PS (9%). Regarding discrimination, the AUCs of Ispline Smoothing were not significantly smaller than any other models. Finally, the experiments using PIMATR data, Ispline Smoothing outperformed all the other methods in calibration with a HLtest passing rate of 73%, followed by PS (66%), LR (57%), IR (37%), and SIR (32%). The AUCs of Ispline Smoothing are not smaller than any of those of other models.
In this paper, we introduced a novel method called Ispline Smoothing (IS) for calibrating predictive models as an alternative to existing approaches. The advantages of IS lie in the following aspects: (1) IS is a nonparametrically monotonic transformation, that provides more flexibility in calibrating predictive models, when compared to parametric approaches like Platt Scaling. (2) IS is globally optimized, as opposed to Smooth Isotonic Regression which is a heuristic approach. (3) IS is easy to implement, compared to Monotone Spline Smoothing. The results using three realworld data sets showed advantages of IS in both discrimination and calibration, empirically. In these experiments, IS demonstrates superior calibration without significant deterioration of discrimination. Although these experiments were conducted at small scale, they suggest that future research on IS is warranted and may improve calibration.
A limitation of this technique is that we need to set the number of interior knots heuristically. Even though it worked well in our experiments, a theoretical result describing systematic ways to choose the number of interior knots is needed.
The authors were funded in part by the National Library of Medicine (R01LM009520) and NHLBI (U54 HL10846). We thank Dr. Fraser and Dr. ElKareh for making the data sets available for this study.
PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. 