|Home | About | Journals | Submit | Contact Us | Français|
Demand forecasting is the area of predictive analytics devoted to predicting future volumes of services or consumables. Fair understanding and estimation of how demand will vary facilitates the optimal utilization of resources. In a medical laboratory, accurate forecasting of future demand, that is, test volumes, can increase efficiency and facilitate long-term laboratory planning. Importantly, in an era of utilization management initiatives, accurately predicted volumes compared to the realized test volumes can form a precise way to evaluate utilization management initiatives. Laboratory test volumes are often highly amenable to forecasting by time-series models; however, the statistical software needed to do this is generally either expensive or highly technical.
In this paper, we describe an open-source web-based software tool for time-series forecasting and explain how to use it as a demand forecasting tool in clinical laboratories to estimate test volumes.
This tool has three different models, that is, Holt-Winters multiplicative, Holt-Winters additive, and simple linear regression. Moreover, these models are ranked and the best one is highlighted.
This tool will allow anyone with historic test volume data to model future demand.
Forecasting the future demand for medical services is a key component of health-care planning. This becomes increasingly important in laboratory medicine where unsustainable increases in service requests have occurred in recent years.[1,2,3,4,5] Annual increases in test volumes are the norm in clinical laboratories. However, medical utilization data also often exhibits a strong element of periodicity, meaning that volumes exhibit a repeating temporal pattern, with the baseline tending to increase on a yearly basis. The association of these patterns is a crucial element in predicting future volumes because the traditional method for assessing trends and predicting future volumes (i.e., linear regression) is sensitive only to the baseline change and cannot be used to model short-term variations in volumes.
Time-series forecasting methods have been applied heavily in many fields, for example, economics, bio-medical, meteorology, and electricity consumption.[7,8] Time-series methods are used to analyze historical data and estimate the future values. They have become an essential tool in the modern industrial environment for making decisions.
Time-series methods can be classified as parametric and nonparametric. The parametric approach emphasizes representing the time-series using a statistical model. Modeling a time-series using a statistical approach, for example, Holt-Winters, requires the validation of the model assumptions that describe the structural statistical norms of the process generating the time-series, that is, the residual error is random and normally distributed. If the data can comply with the model assumptions, then the model under investigation can be used to detect future values of the data. If the assumptions cannot be validated then nonparametric time-series analysis models, for example, neural networks, can be used to represent the data and predict the future values. A comprehensive classification of various time-series forecasting methods is available.
Figure 1 illustrates a flow diagram to model a given time-series using the tool described in this paper. The starting point is to understand the underlying characteristics of the time-series under investigation. The time-series characteristics indicate the appropriate selection from among the candidate models. The characteristics may include: (1) the time-series trend, for example, linear, multiplicative, or additive, (2) the seasonality index that describes if the value is above or below the time-series trend, (3) the periodicity of the time-series that describes if a pattern in the data has a specific frequency. These characteristics may indicate candidate parametric or nonparametric models to fit the data. Each model is trained using part of the data and the model's performance parameters are calculated and then the best model is selected and used to forecast the future values of the time-series. If the predicted values are within the 95% prediction interval (PI), the selected model can be used to forecast the future values of the time-series, otherwise, the new recorded actual values are appended to the raw data of the time-series and the whole process restarted. In forecasting, any percentage may be used as a PI, however, it is common to calculate 80% and 95% PI to check for wide ranges of variation around the predicted values.
In this paper, we present a new web-based open-source software based on the R statistical package which is designed to (1) provide user-friendly clinical laboratory volume forecasting, (2) compare different models head-to-head and select the one that best fits the users’ data, and (3) provide downloadable predicted test volume data for the time span chosen by the user. It is intended that this publication serves as the citable reference to this software in the published literature.
In this section, we describe the models that we use to develop the forecasting tool, the data characterizations that should lead to selection of a certain model, and the selection/ranking criteria of the models.
The Holt-Winters forecasting model includes triple exponential smoothing models. Exponential smoothing model is forecasting model that estimates the predicted values on the history of the time-series data. Exponential smoothing models assume that the historical and predicted data of the time-series data are relatively continuous and have common repeated patterns, and thus, the exponential soothing models are well-matched to short-term predictions. The exponential smoothing models employ smoothing parameters to base the future values on the past ones. Different values of the smoothing parameters will give different exponential decreasing emphasis to the recent values compared to the more distant values in the time-series data.
The Holt-Winters models for time-series analysis have three data components level, trend, and seasonality. The goal of the exponential smoothing model is to estimate the value of the level, trend, and seasonal pattern. These values are then used to construct the Holt-Winters models for future values prediction. The time-series components are time varying components and may have different values at the beginning and end of the time-series. This is in addition to a random noise component that is completely independent of the time-series components.
An exponential smoothing model for a high variation and low noise time-series requires high values for the smoothing parameters. This is mandatory to emphasis more on the most recent values as these values can represent the future values more accurately compared to past values. However, exponential smoothing model for a noisy time-series requires more historical data to cancel out the noise to accurately estimate the future values.
There are two types of the Holt-Winters models namely; additive and multiplicative models. The additive models generate constant seasonal variations independent of the time-series trend and multiplicative models generate seasonal patterns that fluctuates as the trend increases/decreases.
Holt-Winters is a statistical method of modeling, applied to time-series that exhibit a trend and seasonality, which is founded on the basis of the exponential moving average. The Holt-Winters model has three parts; an equation of the forecasting model characterizes each. The model has two types: (1) additive seasonality (i.e., linear trend) and (2) multiplicative seasonality (i.e., nonlinear trend). In the case of multiplicative models, the seasonality index increases with an increase in the level of the time-series. The additive Holt-Winters model can be used if the seasonal index does not depend on the current level of the time-series.
The following equations represent the multiplicative Holt-Winters model:
Trends: bt = β × (Lt Lt-1) + (1 β) × bt-1 (2)
Forecast: Ft + k = (Lt + k × bt) × St+k−m (4)
The following equations represent the additive Holt-Winters model:
Level: Lt = α (Yt − St−m) + (1 − α) × (Lt−1 − bt−1) (5)
Trends: bt = β × (Lt − Lt−1) + (1−β) × bt−1 (6)
Seasonal Index: St = γ × (Yt − Lt) + (1−γ) × St−m (7)
Forecast: Ft+k = (Lt + k × bt)+ St+k−m (8)
Where m is the number of data points of the seasonal cycle, k is an index, t is the time of recording, and Yt is the recorded data at time t. The smoothing factors are α, β, and γ where 0≤ α ≤1, 0≤ β ≤1 and 0≤ γ ≤1. The seasonal index represents the differences between the current level and the data at the seasonal cycles.
The root mean square error (RMSE) measure is used to validate the goodness-of-fit and is calculated by the following equation:
Where n is the total number of data points.
The RMSE the goodness-of-fit of the model, which describes the magnitude of the error in terms that would be relatively more useful to decision makers compared to other error measures.
The coefficient of determination (R2) is used to measure the relative enhancement in the forecasting of the future values of the regression model compared to the mean model (i.e., the average value of the observations). R2 can have values from 0 to 1, where zero indicates the failure of the model to improve the forecasting over the mean model and one indicates perfect forecasting. R2 can be calculated as:
where is the average value of the observations.
Linear regression is a method for modeling the linear relationship between a scalar dependent variable (response variable) denoted as Y and one or more independent variables (explanatory variables) denoted as X. The case of one explanatory variable is known as simple linear regression.
The simple linear regression model assumes a linear relationship between the independent and dependent variables. The linearity assumption can be visually tested with a scatter plot between the independent variable on the X-axis and the dependent variable on the Y-axis. The simple linear regression analysis requires the independent variable to be normally distributed. If the independent is not normally distributed a nonlinear transformation, e.g., log-transformation, may be used to transform the independent variable to normally distributed variable. This is in addition to the assumption of independence of the residual error that must be independent from the explanatory variable. Moreover, simple linear regression analysis requires that there is little or no autocorrelation in the data.
A linear regression model represents the relationship between two variables (X and Y) by fitting a line to the recorded data. The X variable is the explanatory/independent variable, and the Y variable is the predicted/dependent variable. A linear regression line can be described as:
Y = a + b × X (11)
Where X is the explanatory/independent variable and Y is the predicted/dependent variable. The intercept of the line is a and the slope of the line is b.
The least-squares method is used to calculate the model parameters by finding the best line that can fit the recorded data by minimizing the sum of the squares of the error from each data point to the line.
In the development of the time-series forecasting model, we train three different models (i.e., Holt-Winters multiplicative, Holt-Winters additive, and linear models). Too use these models for forecasting, it is required to select the optimal model, the initial values, and the values of the parameters α, β, and γ.
Akaike information criterion (AIC)[16,17] is a method used to calculate the likelihood/probability of the model to predict the future values. We calculate the AIC per model and select the one that minimizes the AIC value.
Bayesian information criterion (BIC) is another method for model selection. BIC measures the trade-off between model fit and complexity. A lower AIC or BIC value indicates a better fit.
The following formulas are used to calculate the AIC and BIC of a model:
AIC = − 2 × ln (L) + 2* k (12)
BIC = − 2 × ln (L) + 2 × ln (N) × k (13)
Where L is the value of the likelihood function calculated at the parameter estimates, N is the number of observations, and is the number of estimated parameters.
Forecasting model validation is the process of testing a model against unseen samples and recording of the prediction error. The prediction error can be used as a criterion to select among different models. The validation process is a method of measuring the predictive performance of a statistical model. Model goodness-of-fit statistics, that is, RMSE, is not an ultimate indicator on how well a model will predict the future values as it is easy to over-fit the training dataset to minimize the goodness-of-fit error. However, the predictions from the model on unseen dataset will generally get worse.
To construct a predictive model, the dataset is first divided into training and validation datasets. The training dataset is used to estimate the model parameters and decide upon the models complexity to mitigate the effect of overfitting. The validation dataset is then used to test the model against unseen dataset and record the generalization error (prediction accuracy) of the predictions. The predictive accuracy of a model can be measured by RMSE on the validation dataset (testing dataset).
There are many method for predictive models validation, among them are: k-fold, leave-one-out, and hold-out validation methods.[18,19,20] These methods assume that the observations in the input dataset are independent of each other. However, the observations in time-series are not, and thus, the validation process becomes more difficult as leaving out random observations do not remove all the associated information because of the time-dependency between observations.
In this paper, the time-series forecasting models are trained and validated as follows:
In this paper, we used three different datasets to illustrate the usage of the forecasting software tool with real-life use cases (see the Result and Discussion section for model training and testing results per dataset)
A dataset of the test volumes of all different clinical tests are recorded monthly for the period of April 2011–March 2015 from all medical facilities located at the Province of Alberta, Canada. This dataset was collected by the Alberta Health Services Laboratory Utilization Office in Alberta, Canada. The dataset consists of forty observations and the first 24 observations are used as training while the remaining 16 observations are used for validation. This dataset can be downloaded from the software (see section using the software). There are many parameters that influence clinical laboratory test orders, amongst them are: Patient severity, patient assurance, number of patient visits, etc., that should be used to normalize the clinical laboratory test volumes. However, these parameters are not possible to collect in the scope of this paper as there are concerns for patient privacy.
This dataset represents a monthly time-series (January 1887–December 1950) with high level of noise. This dataset has 768 observations and is divided into training dataset (first 718 observations) and validation dataset (last fifty observations).
This dataset represents the number of international passengers per month on an airline in the United States and were obtained from the Federal Aviation Administration for the period 1946–1960. This dataset has exponential raising trend. This dataset has 135 observations and is divided into training dataset (first 85 observations) and validation dataset (last fifty observations).
The forecasting software tool is implemented using the R statistical packages and the Web interface is built using the Shiny R package. In the following section the layout of the Web interface, functionalities, and the tool usage are described.
The forecasting tool software is freely available from the authors. The software can be accessed online through the following link: https://github.com/ClinicalLaboratory/Clinical-Laboratory
Figure 2 shows a linear regression model fitted to the monthly clinical laboratory test volumes for the period of April 2011–March 2015 from all medical facilities located in the Province of Alberta, Canada. The vertical dotted line represents the starting point to forecast future values. Figure 3 shows a Holt-Winters multiplicative model fitted to the same data; however, the fitted values are closer to the actual values compared to the data illustrated in Figure 2. Moreover, Figure 2 shows that the predicted values have a wider 95% PI compared to the predicted values shown in Figure 3. It is obvious that for predictions at the level of monthly test volumes, linear regression is inadequate, whereas the Holt-Winters multiplicative model can provide more accurate results. This is due to the fact that the linear regression model fitted/predicted value at time t is completely independent of the fitted/predicted value at time t – 1, however, the Holt-Winter models, that is, multiplicative and additive, provide this dependency using the smoothing parameters, that is, α, β, and γ. Moreover, the independent variable (X) in the linear regression model is represented as a numerical time index, which does not reflect the seasonality measure that exists in the dependent variable (Y). However, if the X-axis is restructured to reflect the seasonality index, for example, using repeated categorical values such as name of the month instead of the numerical time index, the model can capture only the seasonality variation and miss the variation in the year over year trend represented by the data. By contrast, the Holt-Winters models include separate representation for the level, trend, and seasonality of the data, which makes it a better model to represent clinical laboratory test volume data.
Time-series analysis has been employed by a number of authors to model epidemiology,[23,24,25,26,27,28] physiology,[29,30,31] and resource utilization. Although its usage in modeling laboratory test volumes was suggested over 35 years ago it is rarely used for clinical laboratory test volume prediction.
Indeed, the choice of the best statistical model to use in a given solution is often difficult. In addition to linear regression, there are several variations of time-series model from which to choose.[16,34] Moreover, the use of these models generally involves advanced programming knowledge for the open-source versions or the purchase of proprietary software packages.
This software is primarily designed to be used in medical laboratory settings to estimate clinical test volumes. However, in this section the results of applying the forecasting models are illustrated using 3 different datasets representing different data characteristics case studies.
The models used in the forecasting software tool (Holt-Winters multiplicative, Holt-Winters multiplicative, and linear regression) are trained with the three different datasets (see data preparation section for details). Figures Figures33–5 show the fitting and prediction results of the best model per dataset. The predictions are then used to compute the RMSE per model per dataset and the results are illustrated in Table 1.
Table 1 shows the RMSE per model per dataset. The clinical laboratory dataset is best modeled by the Holt-Winters multiplicative model as the dataset shows multiplicative trend. This is the same case for the airlines passenger dataset where the dataset shows multiplicative and exponentially rising trend.
The Holt-Winters models are best fit the dataset when it has a continuity property as illustrated by the clinical laboratory test volumes and airlines passenger datasets. This continuity is not achieved in the precipitation dataset and the sudden changes in the trend and seasonal patterns cannot be captured correctly by Holt-Winters models and the linear regression model is the best model to fit the data in this case with minimum RMSE.
In this section the software architecture is explained. It is designed in a multi-tier architecture and is comprised of two tiers. These tiers are illustrated in Figure 6 and are explained in the following:
The client tier interacts with the users to obtain the prediction results. Since the software application conforms to a two layered services application it hosts the presentation layer components, that is, web interface/browser. For the forecasting web application, the client tier comprises the user workstations/computers, and other devices that host a web browser, e.g., tablets. The data are stored on the local file system of the client tier.
The servers used in the application tier are responsible for hosting all the application's libraries and the Web servers are provided by the RShiny server. In this case a user does not have to install the RStudio or any forecasting packages, e.g., the CARET package. Moreover, the RShiny server is responsible for instantiating the application per user and running the user commands.
Separating the client computer from the application logic supports the development and distribution of thin-client applications that require minimum software at the client tier, for example, a web browser.
The initial version of the forecasting tool was deployed on the RShiny server; however for privacy concern of the data, we chose to upload code of the GitHub repository as described below.
R and RStudio must be installed on the user machine before the tool can be used. The next step is to download the project files from the following GitHub repository: https://github.com/ClinicalLaboratory/Clinical-Laboratory.
After downloading all the files, run the “ClinicalLaboratory. Rproj” file and press the “Run App” button in the RStudio interface and finally click on “Open in Browser” to use all the functionalities of the tool as described below.
The start-up screen illustrated in Figure 7 shows the following areas:
Simple models are easier to build, implement, interpret and update. Increasing model complexity leads to complex implementation and interpretation. In most cases, the ability to understand the model and it's parameters is preferred over a complex model that may not be easier to interpret. Linearity and continuity are common assumptions for time-series modeling, which are considered as weak assumptions. Weak assumptions that are coupled with complex algorithms are more inefficient than using more data with simpler algorithms. This is because a training dataset is a subset of relevant data and with more data, the estimates of the future values can be more accurate under the weak assumptions. With much more data, the sample variation accurately represents the underlying population and the future estimates tend to be more accurate.
Readymade algorithms are used as a “black box” that is impossible to understand or modify, and therefore, leads to very complex training phase and model validation that may not be user-friendly for many users. R and RStudio provide a programming environment to design and implement different time-series prediction algorithms. However, it requires trained personnel to design, implement, validate, select the best model, and interpret the model parameters. The open-source software described in this paper provide a user-friendly interface and make it easier to load a time-series dataset, build three different models to predict the future values of the time-series data and choose the best model.
In this paper, we present a new open-source program for future demands prediction based on a comparison of linear regression and two forms of time-series analysis, that is, Holt-Winters multiplicative and additive models. This software fills an important gap in the available open-source software and greatly simplifies the process of demand forecasting. Although the software was developed with the clinical laboratory in mind, the software could be equally useful in other areas of medicine or business.
In clinical laboratories the authors foresee two main applications. First the tool can be used to predict future test volumes for the purpose of reagent, staffing, and analyzer needs. This may help to reduce waste, staff overtime, and testing delays due to inadequate resources.
A second and more innovative use involves the evaluation of utilization management initiatives. Measures designed to promote the cost-effective use of medical laboratory tests are widespread in regions of Europe and North America.[2,4] These “utilization management” initiatives often result in changes in overall test volumes in the range of 5%–10%. However, as seen in Figure 2, actual observed test volumes may vary by up to 20% from month to month, potentially completely masking any effect of a utilization management initiative. The use of the new demand forecasting tool can detect utilization management effects as small as 1%–2% in some instances. To do this, the user would need at least 24 months of historical data to establish the pattern of predicted future volumes. Forecasting is simplified if the planned intervention begins on the first of a month. The period of the historic forecasting would then include the month immediately prior to the start of the intervention and the predicted demand would begin on the 1st day of the intervention. As the software generates 95% PI, it is a simple matter to compare the observed intervention volumes with the predicted volumes. If the observed volumes fall outside of the 95% PI, it could be concluded that the intervention had a significant effect. The percentage change attributable to the intervention could then be determined by comparing the observed and predicted values. This method may detect intervention effects as small as a few percentage points as soon as 1 month after the start of a utilization management intervention.
The forecasting software tool has the following advantages compared to the popular tool WEKA:
We examine the software tool using two other use-cases of real-life data and show how to validate the models performance.
The time-series methods described in this article are of the parametric type. The model assumptions must be verified to consider a model to be valid. Another limitation of these models are the sensitivity to outliers, which may cause significant errors in the predicted values. The parameters of the Holt-Winters models required by the forecasting tool must be entered manually, for example, “The cycle time of the data.” This mandates that the user is aware of the characteristics of the time-series data.
The future enhancement of this tool is to fully automate the data characterization process, i.e. the software should be able to identify the periodicity and handle the outliers.
There are no conflicts of interest.
Available FREE in open access from: http://www.jpathinformatics.org/text.asp?2017/8/1/7/201109