|Home | About | Journals | Submit | Contact Us | Français|
A new approach for developing multimodel streamflow forecasts is presented. The methodology combines streamflow forecasts from individual models by evaluating their skill, represented by rank probability score (RPS), contingent on the predictor state. Using average RPS estimated over the chosen neighbors in the predictor state space, the methodology assigns higher weights for a model that has better predictability under similar predictor conditions. We assess the performance of the proposed algorithm by developing multimodel streamflow forecasts for Falls Lake Reservoir in the Neuse River Basin, North Carolina (NC), by combining streamflow forecasts developed from two low-dimensional statistical models that use sea-surface temperature conditions as underlying predictors. To evaluate the proposed scheme thoroughly, we consider a total of seven multimodels that include existing multimodel combination techniques such as combining based on long-term predictability of individual models and by simple pooling of ensembles. Detailed nonparametric hypothesis tests comparing the performance of seven multimodels with two individual models show that the reduced RPS from multimodel forecasts developed using the proposed algorithm is statistically significant from the RPSs of individual models and from the RPSs of existing multimodel techniques. The study also shows that adding climatological ensembles improves the multimodel performance resulting in reduced average RPS. Contingency analyses on categorical (tercile) forecasts show that the proposed multimodel combination technique reduces average Brier score and total number of false alarms, resulting in improved reliability of forecasts. However, adding multiple models with climatology also increases the number of missed targets (in comparison to individual models' forecasts) which primarily results from the reduction of increased resolution that is exhibited in the individual models' forecasts under various forecast probabilities.
Seasonal to interannual (long-lead) streamflow forecasts based on climate information are essential for short-term water management and for setting up contingency measures during years of extreme climatic conditions. Interest in the development and application of long-lead streamflow forecasts has grown tremendously over the last decade primarily because of the improved monitoring of sea-surface temperature (SST) in the tropical Pacific resulting in better understanding of hydroclimatic teleconnections as well as because of the issuance of experimental climate forecasts from General Circulation Models (GCMs) by various centers and research institutions on a monthly basis. Since GCM-predicted fields of precipitation and temperature are usually available at large spatial scales (2.5° × 2.5° - typically 275 Km × 275 Km in the tropics), one needs to apply either dynamical or statistical downscaling to develop streamflow forecasts. Dynamical downscaling nests a regional climate model (RCM) with GCM outputs as boundary conditions to obtain precipitation and temperature at watershed scale (60 Km × 60 Km). The downscaled precipitation and temperature at watershed scale could be used further as inputs into a watershed model to obtain seasonal streamflow forecasts [Leung et al., 1999; Roads et al., 2003; Seo et al., 2003, Carpenter and Georgakakos, 2001]. An alternative would be to use statistical downscaling, which maps the GCM precipitation and temperature forecasts to observed streamflow forecasts at a given point through a statistical relationship [Robertson et al., 2004a, 2004b; Landman and Goddard, 2002; Gangopadhyay et al., 2005]. One could also develop a low-dimensional statistical model without using GCM outputs by relating the observed streamflow to identified climatic precursors (e.g., El Nino Southern Oscillation (ENSO) indices) that influence the streamflow potential at the given site [Grantz et al., 2005; Hamlet and Lettenmaier, 1999; Souza and Lall, 2003; Sankarasubramanian and Lall, 2003]. Seasonal streamflow forecasts obtained using these approaches are represented probabilistically in the form of ensembles to represent the uncertainty, due to changing boundary conditions (SST) and initial conditions (atmospheric and land surface conditions). Apart from these uncertainties resulting from initial and boundary conditions, the model that is employed for developing streamflow forecasts could also introduce uncertainty in prediction. In other words, even if streamflow forecasts obtained by dynamical downscaling are forced with observed boundary and initial conditions (perfect forcings), it is inevitable that the simulated streamflows will have uncertainty in prediction, which is otherwise known as model error/uncertainty. A common approach to reduce model uncertainty is through the refinement of parameterizations and process representations of the model under consideration (e.g., GCMs or RCMs or hydrologic models). Given that developing and running GCMs is time consuming, recent efforts have focused on reducing the model error by combining multiple GCMs to issue experimental climate forecasts [Rajagopalan et al., 2002; Robertson et al., 2004a, 2004b; Barnston et al., 2003; Doblas-Reyes et al., 2000; Krishnamurti et al., 1999]. Similarly, studies have shown that developing multimodel forecasts by combining different low-dimensional streamflow forecasting models results in considerable improvement over the performance of individual models [Regonda et al., 2006]. Thus combining streamflow forecasts from multiple models seems to be a good alternative in improving the overall predictability of seasonal streamflow forecasts.
The main goal of this study is to develop and apply a new scheme for combining forecasts from multiple models, which could be either streamflow forecasts from low-dimensional models or GCM forecasts available at large spatial scales, by assessing the model's predictive skill conditioned on the predictor state. The basic reason leading to better performance of multimodel ensembles is due to the incorporation of realizations from various models, thereby increasing the number of ensembles to represent the conditional distribution of hydroclimatic attributes. Recent studies on improving seasonal climate forecasts using optimal multimodel combination techniques assign weights for individual models based on their ability to predict the climatic variable over the entire period for which the GCM simulations are available [Rajagopalan et al., 2002; Robertson et al., 2004a, 2004b; Barnston et al., 2003]. Given that each model's predictive skill could also vary depending on the state of the predictor (SSTs for GCMs), we develop a new methodology for multimodel combination that assigns weights to each model by assessing the skill of the models contingent on the predictor state. The proposed methodology is employed upon two low-dimensional seasonal probabilistic streamflow forecasting models that primarily use tropical Pacific and Atlantic SST conditions to develop multimodel ensembles of streamflow forecasts. Though the proposed approach is demonstrated by combining two statistical streamflow forecasting models, we believe that the approach presented in section 3.2 could also be extended to combine multiple GCMs. Since the streamflow forecasts are represented probabilistically, we use rank probability score (RPS) [Candille and Talagrand, 2005] as a measure to assess the models' predictive skill.
Section 2 provides a brief background on multimodel combination techniques that is currently pursued in the literature for developing climate and seasonal streamflow forecasts. Section 3 presents the proposed multimodel combination scheme that assesses the skill of the model contingent on the predictor state. In section 4, we briefly discuss the two low-dimensional streamflow forecasting models that were employed for developing probabilistic streamflow forecasts for predicting the summer flows (July-August-September, JAS) into Falls Lake, Neuse River Basin, NC. Finally, in section 5, we illustrate the application of the proposed multimodel combination presented in section 3 to develop improved probabilistic streamflow forecasts for predicting JAS inflows into the Falls Lake, NC.
Efforts to address model uncertainty through combining outputs from multiple models have been investigated in climate and weather forecasting [Doblas-Reyes et al., 2000; Rajagopalan et al., 2002; Krishnamurti et al., 1999] and in streamflow simulation through calibration [Boyle et al., 2000; Marshall et al., 2006; Ajami et al., 2006, 2007]. Perhaps the simplest approach to develop multimodel forecasts is to pool the predicted values or the ensembles from all the models, thus giving equal weights for all the models [Palmer et al., 2000]. Recent research from PROVOST (PRediction Of climate Variations On Seasonal to interannual Time-scales) and from International Research Institute for Climate and Society (IRI) show that multimodel ensembles of climate forecasts provides improved reliability and resolution than the individual model forecasts [Palmer et al., 2000; Doblas-Reyes et al., 2000; Barnston et al., 2003]. Though the improved predictability of multimodel ensembles partly arise from increase in the sample size, studies have compared the performance of single models having the same number of ensembles as the pooled multimodel ensembles and have shown that the multimodel approach naturally offers better predictability because of the ability to incorporate outcomes from multiple models, thereby encompassing underlying different process parameterizations and schemes [Hagedorn et al., 2005]. Since the advantage gained through multimodel combination is a better representation of conditional distribution of hydroclimatic attributes, it is important to evaluate probabilistic forecasts developed from multimodel ensembles through various performance evaluation measures and to analyze the predictability for various geographic regions [Hagedorn et al., 2005]. Recent studies have also considered climatology as one of the candidate forecasting scheme in developing multimodel ensembles [Rajagopalan et al., 2002; Robertson et al., 2004a, 2004b].
Another approach that is currently gaining attention is to develop a strategy for combining multimodel ensembles using either optimization methods [Rajagopalan et al., 2002; Robertson et al., 2004a, 2004b] or statistical techniques [Krishnamurti et al., 1999]. Under optimal combination approach, weights are obtained for each model as a fraction, such that the chosen skill/performance measure of the multi model ensembles obtained by using these fractions is maximized [Rajagopalan et al., 2002; Robertson et al., 2004a, 2004b; Regonda et al., 2006]. The easiest approach to assign weights for multimodel ensembles is to give a higher weight for a model that has lower forecast error. Methods based on statistical techniques such as linear regression have also been employed so that the multimodel forecasts have better skill than single models [Krishnamurti et al., 1999]. However, the application of optimal combination approach using either statistical or optimization techniques require observed climatic or streamflow attributes at a particular grid point or station. Studies have also used advanced statistical techniques such as canonical variate method [Mason and Mimmack, 2002] and Bayesian hierarchical method [Stephenson et al., 2005] for developing multimodel combinations. Hoeting et al.  show that the mean of the posterior distribution of the predictand obtained by averaging over all the models with its probability of occurrence provides better predictive ability (measured by logarithmic scoring rule) than the mean of the posterior distribution of the predictand obtained from a single model. For a detailed review of Bayesian averaging on various statistical models (e.g., generalized linear models and nonlinear regression models), [see Hoeting et al., 1999].
The multimodel combination method proposed here is motivated by the fact that the skill of the GCM forecasts or downscaled streamflow forecasts depends on the predictor conditions that determine the conditional distribution of the hydroclimatic attributes. Studies focusing on the skill of GCMs show that the overall predictability of GCMs is enhanced during ENSO years over North America [Brankovic and Palmer, 2000; Shukla et al., 2000; Quan et al., 2006]. Similarly, studies have also shown the importance of various oscillations or climatic conditions in influencing the predictability of GCMs over various parts of the globe. For instance, Giannini et al.  show that tropical Atlantic variability plays a preconditioning state in the development of ENSO related teleconnection in determining GCM's ability to predict rainfall over North East Brazil (Nordeste), which is a region shown to have significant skill in seasonal climate prediction [Moura and Shukla, 1981; Ropelewski and Halpert, 1987 and references therein]. Giannini et al.  show that the predictability of Nordeste rainfall using Community Climate Model Version 3 (CCM3) GCM [Kiehl et al., 1998] is poor particularly if the North Atlantic SSTs exhibit opposite anomalous conditions to the tropical Pacific SSTs. More precisely, the predictability of Nordeste rainfall by CCM3 is negative with positive SST anomalies in tropical Pacific (warm) and negative SST anomalies (cold) in North Atlantic as well as with cold tropical Pacific (negative SST anomalies) and warm North Atlantic (positive SST anomalies) conditions. Naturally, under these predictor conditions, one would prefer to use climatology instead of climate forecasts, since the model simulations are negatively correlated with the observed rainfall. On the basis of this, we propose that for post-processing of individual model's forecasts to develop multi-model ensembles, one needs to assess the skill of the individual model ensembles based on the predictor state. By considering climatology as one of the candidate forecasts, we develop a multimodel combination scheme that formally assesses and compares the skill of the competing models under given predictor conditions so that lower weights are assigned for a model that has poor predictability under such conditions. The next section formally develops a multimodel combination scheme using rank probability score (RPS) as the basic measure for evaluating the forecasting skill.
Error resulting from climate forecasts is primarily of two types: (1) Uncertainty in initial and boundary conditions and (2) Model error [Hagedorn et al., 2005]. The first source of error is typically resolved by representing the uncertainties in initial and boundary conditions in the form of ensembles. The second source of error arises from process representation, which could be reduced by combining forecasts from multiple models that incorporate various process representations and model physics to develop an array of possible scenarios of outcomes. Developing multimodel ensembles result in reducing both sources of error. However, even after developing multimodel ensembles, observations could lie outside the realm of these models [Hagedorn et al., 2005]. Similarly, the performance of individual models and multimodel ensembles may be poor during certain boundary/SST conditions owing to limited relationship between SST conditions and precipitation/temperature over a particular location/grid [Goddard et al., 2003]. Under these climatic conditions with all models having poor predictability, it may be useful to consider climatology as a forecast.
Figure 1 demonstrates the motivation behind the proposed methodology by employing a mixture of regression models that depends on two predictors with the dominant predictor (x1) influencing the predictand only if it crosses a certain threshold (|x1| > 1.0). On the basis of the mixture model, a total of n = 1000 realizations is generated. The estimated correlation between y1 and x1 is 0.671 and y1 and x2 is 0.134, which would suggest one to give higher importance to predictor x1. Figure 1b shows the skill of the regression model, which is expressed as correlation between fitted y1 and y1 conditioned on x1, over the entire range of x1. To estimate this conditional correlation, we consider a bandwidth of 1 on the given x1 such that the fitted values of y1 obtained using the predictor x1 within that bandwidth are only considered. We can clearly infer from Figure 1b, because of the limited influence of x1 on y1 between x1t = −1 to 1, the skill of the model during those predictor conditions is very low. Developing a model based on the predictor x1 alone would result in poor prediction particularly when |x1| is below the threshold value.
Our approach of multimodel combination addresses this issue by assessing the model performance based on the predictor state. For instance, if the predictability of all models is really bad during a particular condition, then one would replace model forecasts with climatology by assigning higher weights for climatological ensembles. The following section formally describes the multimodel combination methodology that could be employed for a given set of forecasts from multiple models and the predictors that influence those forecasts.
Let us suppose that we have streamflow forecasts, , where m = 1, 2, …, M denotes the forecasts from M different models, i = 1, 2, …, Nm represents ensembles of the conditional distribution of streamflows with Nm denoting the total number of ensembles under each model, and t denotes the time (season/month) for which the forecast is issued. Assume that we have a total of t = 1, 2, …, n years for which the forecasts, , are available and the models also have a common predictor vector, Xt, which influences the conditional distribution of hydroclimatic attributes represented using the ensembles. The predictor Xt =[x1t x2t … xpt] could be either SST conditions modulating the flow/moisture pathways for the given site or it could be winter snow pack that could potentially determine the spring-melt flows. Figure 2 provides a flow chart indicating the steps in implementing the proposed multimodel combination conditioned on the predictor state. It is important that the proposed approach requires at least one common predictor among the M competing models. Even if the models do not have a common predictor particularly in the context of GCM forecasts, one could use the leading principal components (PC) of the underlying boundary conditions (for instance, SSTs) as the common predictor across all the models. As mentioned before, developing multimodel ensembles based on optimal combination method requires the observed climatic/streamflow variables Ot, using which one could assess the skill of the probabilistic forecasts based on rank probability score (RPS) [Murphy, 1970; Candille and Talagrand, 2005; Anderson, 1996] to obtain the weights . It is important to note that RPS is evaluated for each year using the ensembles (Nm = 10000) representing the conditional distribution, which is quite different from correlation for which one needs a minimum of two years of forecasts. The main advantage of using RPS is that it quantifies the error in estimating the entire conditional distribution, whereas measures such as correlation and Root Mean Square Error (RMSE) consider only the error in predicting conditional mean. One could also employ continuous rank probability score, which compares the observed event with the forecasted probabilities using a parametric functional form [Candille and Talagrand, 2005; Wilks, 1995]. The rank probability skill score (RPSS) represents the level of improvement of the forecast RPS in comparison to the reference forecast RPS, which is usually assumed to be climatology. Appendix A provides details on obtaining RPS and RPSS for given probabilistic forecasts.
Let us denote the RPS and RPSS of the probabilistic forecasts, , for each time step as and , respectively. For the purpose of multimodel combination, we assess the skill of the model by analyzing its predictability under similar climatic conditions. One could identify similar predictor conditions in the predictor state space by choosing a metric that computes the distance between the current predictor state, Xt, and the historical predictor vector, X. For instance, if the current state of the predictors indicates E1 Nino conditions, then past E1 Nino conditions could be considered as similar conditions. The distance metric could be either Euclidean distance or a more generalized distance measure such as the Mahalonobis distance, which is more useful if the predictors exhibit correlation among them. Compute the distances dt1 between the current conditioning state Xt, and the historical predictor vector Xl using Mahalonobis distance:
where denotes the variance-covariance matrix of the historical predictor vector X. One can note that if l = t, the distance metric, dtl, reduces to zero. Similarly, if Xt is the principal components of the original SST fields, then the Mahalonobis distance boils down to the Euclidean distance. Using the distance vector d, the ordered set of nearest neighbor indices J can be identified. Thus the jth element in the distance vector metric provides the jth closest state Xl to the current state Xt. Using this information, we assess the performance of each model in the predictor state space as
where denotes the skill of the forecasting model, m, for the year that represents the jth closest condition (obtained from J) to the current condition Xt. In other words, summarizes the average skill of the forecasting model, m, by choosing K years that resemble very similar to the current condition, Xt. Using obtained for each model at each time step, we obtain the weights for the multimodel combination so that a model with better performance during particular climatic conditions needs to be represented with more number of ensembles in comparison to a model with lower predictability under those conditions. It is important to note that RPS is a measure of error in predicting the probabilities and it is evaluated based on the entire ensembles that represent the conditional distribution of streamflows. We can use the average skill of the model, , to obtain the weights for each model based on equation (3).
If is zero for a subset of models M1 ≤ M, then the weights are distributed equally (1.0/M1) between the models for which is zero and the remaining models (M − M1) are assigned zero weight. For instance, if two models out of five, have an average RPS () of zero taken over K neighbors, then these two models will be assigned a weight of 0.5 and the remaining three models will be assigned a weight of zero. The multimodel forecasts for each time step could be developed by drawing ensembles from each model to constitute the multimodel ensembles. Thus one has to specify the number of neighbors K to implement this approach. It is also important to note that choosing fewer K does not imply that the multimodel forecasts are developed using the observed predictand and predictors based in the identified K similar conditions. In fact, are forecasts developed based on the observed values of the predictand and predictor over a particular training period (for leave-one-out cross-validated forecasts, we use n − 1 years of record as training period; for adaptive forecasts, we develop the model using n1 years and the remaining (n − n1) are considered for validation). Thus we use the weights, , only to draw the ensembles from , which is in fact obtained based on the chosen training period for developing the forecasts. The simplest approach for selecting the number of neighbors is to find a fixed K that corresponds to the lowest-average RPS from the multimodel forecast. The other approach would be to choose a different number of neighbors Kt for each year. Under this approach, “Varying Kt”, the chosen Kt corresponds to the minimum RPS that could be obtained from the multimodel ensembles for that year. Obviously, Varying Kt1 will ensure the lowest-average RPS for the entire forecast, but it is computationally intensive. The performance of multimodel ensembles is compared with individual model's predictive skill using various verification measures such as average RPS (), average RPSS (), and anomaly correlation ().
For the purpose of demonstrating the multimodel approach proposed in section 3, we first develop probabilistic seasonal streamflow forecasts from two different models for the Falls Lake, Neuse River Basin in North Carolina (NC). We develop streamflow forecasts based on two low-dimensional statistical models, one based on parametric regression approach and another using a nonpara-metric approach based on resampling [Souza and Lall, 2003]. We first provide brief baseline information about the Neuse Basin and its importance to the water management of the Research Triangle Park area of NC.
Falls Lake (location shown in Figure 3a) is a multi-purpose reservoir authorized for flood control, water supply, water quality, recreation and for fish/wildlife protection. Given that the water demand in the Triangle area has been growing rapidly in the last decade, multi-year droughts (1998–2002) and ensued restrictions has increased the importance of long-lead forecasts toward better management of water supply systems. Observed streamflow information at Falls Lake is available from 1928 to 2002 from the United States Army Corps of Engineers (USACE) (http://epec.saw.usace.army.mil/fall05.htm). Figure 3a provides the seasonality of inflow into Falls Lake. Typically, 46% of the annual inflow occurs during January–February–March (JFM), and the low flows during July-August-September (JAS) contribute 14% of the annual inflows. From a water management perspective, developing streamflow forecasting models for the low flow season is important since maintaining the operational rule curve of 251.5′ is challenging during those months resulting in mandatory restrictions [Golembesky et al., 2008], which arises because of increased demand for water quality and water supply releases in the summer (in comparison to the winter demand). (http://epec.saw.usace.army.mil\Falls_WC_Plan.pdf). In addition to this, the demand in the Wake County, NC has also grown by about 20%–62% from 1995–2000 [Golembesky et al., 2008] resulting in three severe droughts (summers of 2002, 2005 and 2007) in the past five years. Thus managing the reservoirs during the summer is very important from the perspective of invoking restrictions to meet the end of season target storage conditions without compromising the flood risk arising from hurricanes [Golembesky et al., 2008]. The following section provides a brief overview of climate and streamflow teleconnection in the US.
Climatic variability, at interannual and interdecadal time scales, resulting from ocean-atmosphere interactions modulate the moisture delivery pathways and has significant projections on continental-scale rainfall patterns [Trenberth and Guillemot, 1996; Cayan et al., 1999] and streamflow patterns at both global and hemispheric scales [Dettinger and Diaz, 2000] as well as at regional scales [e.g., Guetter and Georgakakos, 1996; Piechota and Dracup, 1996]. Efforts in understanding the linkages between exogenous climatic conditions such as tropical sea-surface temperature (SST) anomalies and regional hydroclimatology over the U.S. have offered the scope of predicting the rainfall/ streamflow potential on season-ahead and long-lead (12 to 18 months) basis [Hamlet and Lettenmaier, 1999; Georgakakos, 2003; Wood et al., 2002, 2005]. Interannual modes such as the El Nino-Southern Oscillation (ENSO) resulting from anomalous SST conditions in the tropical Pacific Ocean influences the interannual variability of precipitation and temperature over many regions of the globe [Rasmusson and Carpenter, 1982; Ropelewski and Halpert, 1987]. Most of the studies focusing on climatic variability over the South Eastern US have shown that warm tropical Pacific conditions during October–December lead to above-normal precipitation during winter and below-normal precipitation during summer if warm pool conditions prevail in the topical Pacific during the spring [Schmidt et al., 2001; Lecce, 2000; Hansen et al., 1998; Zorn and Waylen, 1997]. Studies have also reported ENSO related teleconnection between precipitation and temperature over NC during both winter and summer seasons [Roswintiarti et al., 1998; Rhome et al., 2000]. We basically develop a low-dimensional model by identifying SST conditions that influence the seasonal streamflow forecasts into Falls Lake during July–September (JAS).
Our objective is to estimate the conditional distribution of streamflows, f (Qt|Xt), which would occur in the summer season based on the climatic conditions Xt using the chosen statistical models. The next two sections (sections 4.3.1 and 4.3.2) discuss in detail about predictor selection and the performance of forecasts developed from the two low-dimensional statistical models.
To identify predictors that influence the streamflow into Falls Lake during JAS, we consider SST conditions during April–June (AMJ) which could be obtained from International Research Institute for Climate and Society (IRI) data library (http://iridl.ldeo.columbia.edu/expert/SOURCES/.KAPLAN/.EXTENDED/.ssta). Figure 3b shows the Spearman rank correlation between the observed streamflow during JAS at the Falls Lake and the SST conditions during AMJ. From Figure 3b, we see clearly that SST girds over ENSO region (170E–90W and 5S–5N), the North Atlantic region (80W–40W and 10N–20N), and the NC Coast region (75W–65W and 22.5N–32.5N) influence the summer flows into Falls Lake. This is in line with earlier findings [Roswintiarti et al., 1998; Rhome et al., 2000] suggesting that warm conditions in the tropical Pacific and tropical North Atlantic result in above-normal inflow conditions in Falls Lake. It is important to note that we consider SST regions whose correlations are significant and greater than the threshold value of where n is the total number of years (n = 78 years for Falls Lake) of observed records used for computing the correlation. We didn't consider atmospheric conditions such as mean sea level pressure, geo-potential height for developing the streamflow forecasting model, since the National center for Environmental Prediction (NCEP) reanalysis data are available only from 1950. Further, adding atmospheric predictors for predicting post-1950 flows also did not result in any significant improvement in forecast skill (results not shown).
Given that the SST fields are correlated to each other, we apply Principal Components Analysis (PCA) to identify the dominant modes in the SST grids. PCA, also known as empirical orthogonal function (EOF) analysis, transform the correlated predictors (e.g., SST grids) to uncorrelated principal components by applying singular value decomposition (SVD) on the covariance/correlation matrix of the predictors. Importance of each principal component is quantified, which is usually summarized by the scree plot, based on the fraction of the variance the principal component represents with reference to the original predictor variance. Mathematics of PCA and the issues in selecting the number of principal components could be found in Wilks .
We applied PCA by performing SVD on the covariance matrix of the grid points of SSTs over the three different regions. We did not spatially average the SSTs over the three regions before performing PCA. Instead, we pooled SST grids that were significantly correlated from three regions and then performed PCA. Figure 4a shows the percentage of variance explained by each principal component and the first two components accounts for 73% of the total variance shown in the predictor field in Figure 4a. On the basis of the eigenvectors obtained from PCA (figure not shown), the first component representing the ENSO region has correlation of 0.36 with observed streamflow and the second component representing both the Pacific and the Atlantic has a correlation of −0.23 (significance level ± 0.21 for 78 years of record) with the inflows at Falls Lake. We employ these two principal components to develop seasonal streamflow forecasts for JAS for the Falls Lake.
We consider two non-linear models, parametric regression (with the predictand being cube root of the flows) and semiparametric resampling models [Souza and Lall, 2003], in developing multimodel forecasts. The linear relationship between the cube root of the flows and the dominant principal component (Figure 4b) suggests regression as a favorable candidate for developing streamflow forecasts for the Falls Lake. However, the conditional distribution of the parametric regression model (cube root transformed flows) follows normal distribution which is purely dependent on the conditional mean and its point-forecast error. Hence we consider the semiparametric resampling model, which inherently addresses this issue (being not dependent on two moments alone) by having the entire conditional distribution resampled from yesteryear flows. Using the observed streamflow Qt and the predictors Xt, the first two principal components from PCA, the conditional distribution of streamflows is estimated based on the parametric regression and semiparametric resampling methods.
Given that the observed summer inflows had skewness of 1.9, we applied cube root transform to bring the data to normal distribution. After transformation, the skewness of the transformed flows reduced to 0.25. Figure 4b suggests that the relationship between the first principal component (PC1) and the transformed flows (cube root) is linear with the exception of few outlying observations, which were p primarily contributed by the summer hurricanes (). Thus, for the parametric regression model, we consider the cube root transformed flows as the predictand and the first two principal components of SSTs as the predictors. Residual analyses of regression based on quantile plots and skewness (=0.08) tests on the residuals showed that the normal assumption of the transformed flows is valid. The estimate of the conditional mean and conditional standard deviation of the transformed flows are obtained from the regression estimate and from the point forecast error respectively. Using the conditional mean and conditional standard deviation of the transformed flows, we generate ensembles from the normal distribution and transform it back into the original space, by taking cube, to obtain the conditional distribution of flows, .
The second approach, the semiparametric resampling algorithm developed by Souza and Lall , is data-driven and estimates the conditional distribution by resampling the observed flow values that were under climatic conditions similar to the current predictor conditions. To identify similar climatic conditions or the neighbors, the semiparametric method employs regression coefficients (estimated between cube root transformed flows and the predictors for Falls Lake inflow forecasts) as weights to compute the distance between the current conditions and the past predictor conditions. For additional details, see Souza and Lall . A total of thirty-three number of neighbors was considered for obtaining the forecasts from the semiparametric model. This was chosen by selecting the neighbor that gave the highest correlation between observed flows and the conditional mean of the leave-one-out cross-validated forecasts, whose procedure is detailed below.
Leave-one-out cross- validation is a rigorous model validation procedure that is carried out by leaving out the predictand and predictors from the observed data set (Qt, Xt, t = 1, 2, …, n) for the validating year and the model is developed using the remaining n − 1 observations [Craven and Whaba, 1979]. For instance, to develop retrospective leave-one-out forecasts from parametric regression, a total of n regression models are developed by leaving out the observation in each validating year. By employing the developed forecasting model with n − 1 observations, the left out observation (Q−t, with −t denoting the validating year) is predicted by using the state of the predictor/principal components (X−t) in that validating year. By utilizing the two principal components from PCA, we develop both leave-one-out cross-validated retrospective streamflow forecasts and adaptive streamflow forecasts for the (JAS) season using the two statistical models. The conditional distribution of streamflows forecasts from each model is represented in the form of ensembles.
To obtain adaptive streamflow forecasts, we develop the forecasting models based on the observed streamflow and the two dominant principal components from 1928– 1975 and employ the developed model to predict the streamflow for a 30-year period from 1976–2005. Table 1 gives various performance measures of the cross-validated forecasts and adaptive forecasts for both models. Figure 5 shows the adaptive streamflow forecasts for both parametric regression and the semiparametric model. The correlation between the observed streamflows and the ensemble mean of the cross-validated forecasts for resampling and regression approach is 0.38 and 0.35 respectively, which is significant for the 78 years of observations. From Table 1, we infer that the correlation between the observed stream-flows and the ensemble mean of the adaptive forecasts is 0.44 and 0.49 for resampling (Figure 5a) and regression (Figure 5b) approach, respectively. Table 1 also shows other performance evaluation measures such as , and for adaptive and leave-one-out cross-validated forecasts for both models. On the basis of and , we find that regression model seems to perform better than the resampling model under both cross-validated and adaptive forecasts. Since the correlation between the observed and the ensemble mean is significant for both models under leave-one-out cross-validated forecasts and adaptive forecasts, we combine the forecasts from both these models for developing multimodel ensembles for the Falls Lake system.
In this section, we apply the multimodel combination algorithm discussed in section 3.2 to combine the forecasts from individual models – parametric regression and semi-parametric resampling - with climatological ensembles. For this purpose, we consider seven different multimodels by combining: (1) resampling and climatology ensembles (MM1), (2) regression and climatology ensembles (MM2), (3) MM1 and MM2 ensembles (MM3), (4) resampling, regression and climatology ensembles at one step (MM4), (5) resampling and regression ensembles (MM5), (6) resampling and regression ensembles with equal weights (MM6), and (7) resampling and regression ensembles based on their long-term predictability (MM7). Multimodel MM6, which is analogous to pooling of ensembles without optimal combination and multimodel MM7 provide the baseline comparison with the current state of the art in multimodel ensembles development. For MM6, we give equal weights by enforcing for each year in equation (3). To develop ensembles from MM7, we assign ; , where and denote the average RPS for resampling and regression models (from Table 1) indicating their long-term predictability.
The motivation in considering climatology as one of the candidates in multimodel combinations (except MM5, MM6, and MM7) is based on the presumption that if the observation falls outside the predictive ability of all the models under certain predictor conditions, then climatology should be preferred over individual model forecasts. Further, recent studies have shown that a two step procedure of combining first each individual model forecasts separately with climatology and then combining the resulting “M” combinations at the second step improves the skill of multimodel ensembles [Goddard et al., 2003]. Since implementing the proposed multimodel algorithm requires selection of neighbors, we primarily employ the “Varying Kt” approach for comparing the performance of seven multimodels with two candidate models. In the next section, however, we employ the fixed number of neighbors just for demonstrating the proposed multimodel combination methodology.
The primary motivation in the proposed approach for multimodel combination is to evaluate competing models' predictability in the neighborhood of the predictor state and give appropriate weights based on equation (3) for all the models to develop multimodel ensembles. Figure 6 analyzes the performance of regression and resampling streamflow forecasts shown earlier (Figure 5) from the predictor state space by considering a fixed K = 15. The performance of the streamflow forecasts are evaluated using average RPS and correlation over the identified neighbors in the predictor state space. From Figure 6a, one may prefer to choose forecasts from resampling model instead of forecasts from parametric regression since the correlation of the regression is negative if the dominant/first principal component (PC1) < −4. This could also be seen from Figure 6b with the RPS of resampling being lesser than the RPS of regression. Figures 6c and 6d show the relative performance of both models against each other. From Figure 6c, we can see that one would prefer climatological ensembles over forecasts from candidate models, particularly when correlations estimated from the neighborhood on both models are negative. From Figure 6d, we can also identify conditions (above the diagonal line) with the RPS of regression model being higher than that of RPS of resampling indicating the poor performance of the regression ensembles. Thus the multimodel combination algorithm in section 3.2 primarily identifies these conditions based on RPS using equation (2) and develops a general procedure for multimodel combination. The next section compares the performance of seven different multimodels in developing cross-validated and adaptive forecasts using “Varying Kt” approach.
The performance of seven different multimodel combinations is compared in Table 1 with the performance of two individual models under leave-one-out cross-validated forecasts and under adaptive forecasts for the period 1976–2005. For developing climatological ensembles for multimodel combinations, we simply bootstrap the observed JAS streamflows into Falls Lake assuming each year has equal probability of occurrence. This is a reasonable assumption since there is no year to year correlation between the summer flows.
Under leave-one-out cross-validated forecasts, we can clearly see that forecasts from MM3 perform better than the regression and resampling models' forecasts based on all the three statistics considered (Table 1). MM3 also performs better than the rest of the six multimodels. To ensure the skill exhibited by MM3 is statistically significant from the rest of the six multimodels and two statistical models, we perform detailed hypothesis tests based on . The hypothesis tests employed is based on the nonparametric approach under which we resample the computed from each model for each year to develop the null distribution [Hamill, 1999]. Details on hypothesis test and the test statistic suggested by Hamill  are briefly discussed in Appendix B.
Table 2 provides the percentile values of the test statistic based on the constructed null distribution using (B4) and (B5) for each pair of models A and B. Thus, if , then we would like the observed test statistic, , to fall under the right (left) tail of the null distribution to reject the null hypothesis. On the basis of given in Table 1 for the chosen models A and B, one could reject or accept the null hypothesis based on the chosen level of significance, α. For instance, to test whether at α = 10% significance level, from Table 2, we can see that the percentile value of the test statistic is >0.99 implying a p-value < 0.01. Hence we can reject the null hypothesis and accept the alternate hypothesis , which implies that the skill exhibited by MM1 is significantly greater than that of the skill of the resampling model. Thus we utilize the information given in Table 2 to compare Model A (in columns) with Model B (in rows) to test whether the increased skill exhibited by the multimodel forecasts are statistically significant compared to the skill of individual model forecasts. We also discuss various related issues that include utility of climatological ensembles in improving the skill and on the effectiveness of the two-step multimodel combination approach.
From Table 2, we can clearly see that the reduced of the multimodels (in Table 1) are statistically significant from the of the resampling and regression models. This could be inferred from the percentiles of the test statistic from Table 1 (Model A: Resampling or Regression) falling on the right tails of the constructed null distribution with the p-value being less than 0.10 for all the multimodels (Model B: MM3-MM7). Similarly, we also see that there is no significant difference between of the regression model and of the resampling (α = 0.10) forecasts. Further, we also understand that combining individual models with climatological ensembles alone (Model B: MM1 and MM2) leads to an improved representation of the conditional distribution.
To compare the performance of the proposed multimodel combination algorithm in section 3.2 with existing multimodel techniques, we have included multimodel forecasts based on equal weights (MM6: pooling of ensembles) and based on each model's long-term predictability (MM7). From Table 2, we infer that of MM5 (Resampling and Regression alone) is significantly lesser than of MM6 and MM7. Similarly, comparing the performance of MM6 and MM7 with MM3 and MM4, we observe that of MM3 and MM4 are statistically smaller than of MM6 and MM7. This shows that combining multiple models based on predictor conditions performs better than combining multiple models by pooling and based on long-term predictability.
Multi-models (MM1, MM2, MM3 and MM4) use climatological ensembles to develop multimodel forecasts. An interesting fact that we can observe from Table 2 is that adding climatological ensembles with individual models to develop multimodel forecasts (Model B: MM1, MM2) leads to a that is statistically smaller than the individual models' . Similarly, comparing the performance of MM1 and MM2 (Model A) with MM5, MM6 and MM7 (Model B), we see that the percentiles of the test statistic are around 0.2–0.4 implying no statistical significance, which suggests that combining multiple models without climatology may not lead to significant reduction in . However, adding climatology to multiple models which is represented by MM4 (Model B) reduces , but it is not statistically significant compared to of MM1 and MM2. Further, comparing the performance of MM4 (Model B) with MM5 (Model A) shows that there is certainly an improvement in adding climatology ensembles (p-value = 0.16). To summarize, adding climatological ensembles with individual model forecasts certainly leads to improvement in the conditional distribution representation, but improvements are much more substantial upon combining climatological ensembles with forecasts from multiple candidate models.
To understand the effectiveness of the two-step procedure (individual models are combined first with climatological ensembles and the resulting “M” forecasts are combined at the second step) of multimodel combination, we compare the performance of MM3 (Model A – two step combination) with MM4 (Model B – One step combination). From Table 2, the computed test statistic has a percentile value of 0.18 under the constructed null distribution. This indicates that the null hypothesis can be rejected if one chooses a significance level of α = 0.10.
Thus we conclude that though the difference in between MM3 and MM4 is not significant (p-value is 0.18), the two-step procedure of developing multimodel forecasts with climatology overall improves the conditional distribution estimation, as indicated by smaller average , than one-step combination. To recapitulate, for a significance level of α = 0.10, average of MM3 is significantly lesser than the average of MM1, MM6 and MM7. Upon comparing MM3's performance with the rest of the multimodels (MM2, MM4 and MM5), we find that average of MM3 is weakly significant against average of MM2 (p-value = 0.17), MM4 (p-value = 0.18) and MM5 (p-value = 0.17). This could possibly be due to the slightly better performance of regression model in comparison to the resampling model.
We also compared the performance of resampling, regression models with the best performing multimodel MM3 under different flow regimes. On the basis of this, the average RPS of resampling, regression and MM3 are 0.524, 0.480 and 0.460 respectively under below-normal flow conditions. Under above-normal flow conditions, the average RPS of resampling, regression and MM3 are 0.276, 0.237 and 0.214 respectively, whereas the average RPS of resampling, regression and MM3 are 0.487, 0.509 and 0.480 respectively under above-normal flow conditions. This indicates clearly that the regression (resampling) performs better than resampling (regression) during below-normal (above-normal) flow conditions. But, multimodel (MM3), since it constitutes more number of ensembles from the best performing model contingent on the predictor state, performs better than the two individual models under all flow conditions.
One main purpose of performing multimodel combination based on predictor state is to reduce false alarms and missed targets in the issued forecasts. Figure 7 summarizes the comparison between two candidate models, regression and resampling, and five multimodels MM1, MM2, MM3, MM4 and MM5, based on number of hits, false alarms, missed targets and correct rejections for below-normal and above-normal tercile categories. We are not reporting the performance of MM6 and MM7, since they are exactly the same as that of MM5 under both tercile categories. We choose a threshold probability of 0.5 to issue a forecast alarm under a particular tercile category. It is important to note that contingency analysis of forecasts (in Figure 7) is very sensitive to the chosen threshold and the number of forecasts [Wilks, 1995]. From Figure 7b, we clearly see that the false alarms are clearly dropping with all the three multimodel forecasts. However, multimodel forecasts tend to increase the missed targets and reduce the number of hits. In the case of MM3, the drop in the number of hits is 4 and 6 for below-normal and above-normal categories, but the number of false alarms reduces considerably. This suggests that adding climatology to individual models certainly reduces the number of hits and tends to produce multimodel forecasts (MM1, MM2, MM3 and MM4) with reduced resolution (ability of the model to distinguish from climatology). This could also be inferred from the plot for MM1 and MM2 from Figure 7. Reasons for reduced false alarms and increased missed targets are discussed further in the next Section.
A better measure to evaluate categorical probabilistic forecasts would be to quantify the average Brier score (Table 3). Table 3 also provides average reliability () and average resolution () along with average ignorance () for regression, resampling and seven multimodel forecasts for below-normal and above-normal categories. Brier score is similar to RPS, but more appropriate for dichotomous events (e.g., whether the observed event is below-normal or not). Brier score, specifying the squared error in categorical forecasts, could be split into reliability, resolution and uncertainty. For additional details, see Wilks . For to be close to zero, it is important that the reliability term should be close to zero and resolution should be large [Wilks, 1995]. Average ignorance, otherwise known as log-scoring rule, is a double-valued function of average Brier score and is a better measure in evaluating the probabilities forecasts, since it generalizes the categorical forecasts beyond the binary case [Roulston and Smith, 2002]. The forecasts are considered to be useful, if of forecasts is lesser than the of climatology.
From Table 3, we clearly understand that the average Brier scores of MM3 and MM4 are smaller than the average Brier scores of resampling and regression models. Since Brier scores represent squared error in probabilities, even a small difference could help in reducing the error in estimating the probability under a particular category. For instance, the difference between regression and MM3 average Brier scores is only 0.0024 in identifying the below-normal category, but this amounts to an average error reduction of 5% () in quantifying the below-normal category. Thus, based on average Brier scores, we infer that multimodel forecasts (MM3) improve the probability of predicting the appropriate tercile category of occurrence.
To ensure the difference between of MM3 and of individual models are statistically significant from the of resampling and regression models, we perform a hypothesis testing on Brier scores similar to the hypothesis testing on as suggested by Hamill . The p-values of hypothesis tests on comparing of regression and resampling models (Model A) with of MM3 (Model B) are 0.02 and 0.25 for resampling and regression respectively under below-normal category. Similarly, the p-values for above-normal category upon comparing of two candidate models' forecasts with MM3 forecasts are 0.01 and 0.02 for resampling and regression respectively. We did not perform separate hypothesis tests for comparing the Brier score of MM4, MM5, MM6 and MM7 with the individual model forecasts. Thus we infer that the of MM3 forecasts is significant in comparison to the two candidate models in identifying the above-normal category.
Figures 8a and 8b compare the reliability of MM3 forecasts with the reliability of individual model forecasts for below-normal and above-normal categories. Reliability diagrams provide information on the correspondence between the forecasted probabilities for a particular category (e.g., above-normal, normal and below-normal) and how often (frequency) that category is being observed under that forecasted probability [Wilks, 1995]. For instance, if we forecast the occurrence of below-normal category as 0.9 over n1 years (n1 ≤ n), then we expect the actual outcome to fall under below-normal category for 0.9 * n1 times over the long-term. To construct Figure 8, we utilized leave-one-out cross-validated forecasts and divided the forecasted probability for each category into percentiles. Table 4 provides the total number of forecasts that were issued with different forecast probabilities under below-normal and above-normal categories.
Though MM3 and MM4 show overconfidence (increased observed relative frequency) for high forecast probabilities (0.6 for below-normal and 0.6 for above-normal) (Figure 8), we can infer from Table 3 that the average reliability scores of MM3 and MM4 are smaller than that of the rest of the models. The increased observed relative frequency could also be due to smaller number of forecasts issued under those forecast probabilities. For instance, we can infer from Table 4 that the number of forecasts issued with a forecast probability of 0.6 for MM3 under above-normal category is just 1. Similarly, under below-normal category, we see that the number of forecasts issued for MM3 and MM4 are 4(2) and 3(3) for a forecast probability of 0.6 (0.7) respectively. For predicting above-normal category, MM1, MM2 and MM3 have the same reliability, but MM4, combining multiple models with climatology, has the lowest Brier score. This is primarily because MM4 shows higher resolution than MM1, MM2 and MM3.
An interesting observation from Table 3 is that combining multiple models with climatology always reduces the resolution, but increases the reliability of forecasts. For instance, resampling is over-confident (high observed relative frequency) for a forecast probability of 0.1 under below-normal category, which leads to high resolution. Similarly, under above-normal category, we see that individual models have good resolution for a forecast probability of 0.7, whereas multimodels do not have any forecasts issued under that forecast probability. Reduced resolution from multimodel forecasts will naturally lead to reduced hits. This is in line with some of the earlier studies on multi-model combination [Barnston et al., 2003; Robertson et al., 2004a, 2004b]. This reduced resolution leads to increased missed targets under multimodel combinations with climatology, but reduces the false alarms which results from the overconfidence of individual models (Figure 7).
On the basis of average ignorance, we understand that all multimodels and individual models have reduced average ignorance in comparison to the ignorance of climatology. Further, under below-normal category, multimodels (MM1, MM2, MM3, MM4 and MM5) developed based on the algorithm in section 3.2 have lower ignorance score in comparison to MM6 and MM7 with MM3 having the smallest ignorance score of all the models. Under above-normal category, MM2 has the smallest average ignorance score with MM3 and MM4 performing almost similarly. Average ignorance score also suggests that adding climatology is important in improving the performance of multimodels, since MM5 has a higher ignorance score than other multimodels (MM1, MM2, MM3 and MM4).
Utilizing the probabilistic streamflow forecasts from these three models, resampling, regression and multimodels, for invoking restrictions to improve the end of season target storage show that multimodels perform consistently in identifying appropriate (above-normal or below-normal) storage conditions in September and thus reducing false alarms in invoking restrictions [Golembesky et al., 2008]. To relate how the reduced RPS from multimodels would result in improving Falls Lake operation, Golembesky et al.  show that a 5% reduction in the risk of not meeting the end of season target storage (corresponding to 251.5 ft) during below-normal inflow years would require more than 10% reduction in water supply releases during the season. To summarize the discussion, the proposed multimodel algorithm in section 3.2 overall improves the predictability of tercile forecasts in comparison to the individual model forecasts and multimodel forecasts developed by pooling (MM6) and based on long-term predictability (MM7).
The proposed algorithm in section 3.2 employs average RPS over the chosen K neighbors to evaluate the performance of given forecasting model. In this section, we evaluate the sensitivity of the algorithm to two different performance metrics, squared error between the conditional mean (SECM) and observed flows and L = 1 norm, in developing multimodel forecasts. Given ensembles of streamflow forecasts from an individual model, we estimate SECM from the conditional mean and L = 1 norm (using equation (A2)) for each year of forecasts. Thus we replace RPS in equation (2) with estimates of SECM and L = 1 norm to quantify the average performance of the given model over the considered “K” neighbors.
Table 5 summarizes the average RPS and average RPSS for seven multimodels under the two performance evaluation metrics. From Table 5, we understand that multimodels, MM2, MM3 and MM4, developed from this study show reduced RPS than the RPS of individual models and over existing techniques on multimodel combination (MM6 and MM7). However, comparing the performance of multimodel forecasts from SECM and L = 1 norm (Table 5) with the performance of multimodel forecasts obtained using RPS (in Table 1), we clearly see that multimodel forecasts developed using SECM and RPS (as performance evaluation metric) perform similarly in developing improved multimodel forecasts. Under L = 1 norm, the reduction in RPS is slightly less indicating only a marginal improvement in developing multimodel combination. One possible reason for improved performance under SECM and under RPS as performance metric is in penalizing poor forecasts more severely by considering the squared error in flows and cumulative probabilities respectively. This suggests that the algorithm is sensitive to the choice of the performance evaluation metric of individual forecasts which therefore needs to be selected carefully to ensure improved multimodel forecasts. In the next section, we further evaluate the proposed algorithm by developing adaptive streamflow forecasts for the period 1991–2005.
Figure 9 shows the adaptive streamflow forecasts from MM3 for the period 1991–2005 by training the individual models based on the data available during 1928–1976. Table 1 provides various performance measures of adaptive forecasts for all the models based on the 30-year validation period. From Table 1, we can clearly see that the skill of MM3 and MM4 forecasts is much higher than the skill of individual models and the rest of the multimodels. We did not perform any hypothesis tests on the reported skill measures, since the skill of multimodel (MM3 and MM4) are shown to be significant than that of individual model forecasts based on cross-validated forecasts. From Figure 9a, we can clearly see that MM3 forecasts perform better than individual model forecasts shown in Figure 5. This could be understood by focusing on the forecasts from three models, resampling (Figure 5a), regression (Figure 5b) and multimodel (Figure 9a), for years 2004 and 2002. For instance in 2004, an above-normal inflow year, regression incorrectly suggests it as a normal year (forecasted probabilities for below-normal (BN), normal (N) and above-normal (AN) are 0.36, 0.39 and 0.35 respectively), whereas resampling (forecasted BN, N, AN probabilities are 0.18, 0.38 and 0.44 respectively) and multimodel (forecasted BN, N, AN probabilities are 0.25, 0.36 and 0.39 respectively) correctly identifies it as an above-normal year. On the other hand in year 2002, a below-normal inflow year, regression (forecasted BN, N, AN probabilities are 0.34, 0.36 and 0.30 respectively) incorrectly suggests it as a normal inflow year, whereas resampling (forecasted BN, N, AN probabilities are 0.44, 0.32 and 0.24 respectively) and multimodel (forecasted BN, N, AN probabilities are 0.40, 0.31 and 0.29 respectively) correctly identifies it as a below-normal inflow year.
The model did not predict accurately the high summer flows in years 1996 and 1999, mainly because these flows were primarily due to hurricanes (1996: Hurricane Frank; 1999; Hurricane Floyd). Further, in both these years, more than 60% of seasonal flows occurred during the last 20 days of the season because of hurricanes. Thus, to better predict the flows in this year, one needs to develop updated forecasts all through the season [Sankarasubramanian et al., 2007]. Analyses of the weights (figure not shown) showed that, as expected, weights of regression and resampling models were higher than the weights of climatology during below-normal and above-normal inflow years. However, weights of climatological ensembles were almost similar to the weights of resampling and resampling models particularly when flow values are in the normal category. Figure 9b compares the average RPSS between the individual model forecasts and MM3 and MM4 forecasts based on the number of years of validation of the adaptive forecasts. For each validation period (e.g., 30 years), we used the remaining data from 1928 for developing the forecasts. From Figure 9b, we can also see that as the length of validation period increases, the performance of multimodel forecasts improves.
Since we obtain the number of neighbors by “Varying Kt”, which is going to vary depending on the predictor state, we plot the distance between the chosen neighbors and the predictor state PC1t. We also show the distance of the chosen neighbor from the predictor state represented by each model. It is important to note that PC1 primarily denotes ENSO conditions (correlation between PC1 and Nino3.4 = 0.36), thus positive (negative) PC1 denoting the El Nino (La Nina) conditions. From Figure 10, we can clearly see that during normal conditions, the distance between the current predictor state and the chosen neighbor's predictor state is small. During extreme predictor conditions, the distance is large since the nearest neighbor that results in reduced RPS could be far away from the conditioning state of the predictor. This shows that the multimodel algorithm developed in this study identifies similar predictor conditions in improving the performance of the individual model forecasts.
To understand how the prediction intervals of multimodel forecasts compare with the prediction intervals of individual model forecasts, we show the ratio of interquartile range of the conditional distribution to the median of the conditional distribution (IQRM) in Figure 10b (analogous to coefficient of variation) for streamflow forecasts from three models, resampling, regression and multimodel (MM3), contingent on the principal component PC1. The reason to express this as a ratio, instead of a measure of the spread of the distribution, is that it incorporates the shift in the conditional distribution (conditional bias) across the models. On the basis of this, we observe the following based on the 78 years of forecasts: The IQRM of MM3 forecasts were in between (greater than) the IQRM of regression in 45 years (31 years). In general, the IQRM of multimodel forecasts were slightly higher during below-normal years (low PC1 values), since the median flow values are much smaller during those years.
A new methodology for developing optimal multimodel combinations is presented and demonstrated by combining probabilistic streamflow forecasts from two low-dimensional streamflow forecasting models developed for the Falls Lake reservoir of the Neuse River Basin, NC. The proposed approach develops multimodel combinations by evaluating the skill of the candidate forecasting models' contingent on the predictor(s) state. By identifying “Kt” similar conditions (to the current predictor state), the algorithm evaluates the skill of the model by computing average RPS under “Kt” neighboring conditions. Using the average RPS, we obtain weights for drawing ensembles from each model to develop multimodel ensembles. This will lead to increased representation of ensembles from a candidate model if its performance is relatively better than the other candidate models under a given predictor conditions. To evaluate the proposed scheme, we consider a total of seven different multimodels that includes multimodels with no optimal combination (pooling of ensembles) and multimodel combination based on long-term predictability. The study also evaluates the ability of multimodels in improving tercile forecasts using Brier scores and using reliability plots.
By comparing the performance of these seven multimodels with individual statistical (regression and resampling) models based on various measures such as correlation, average RPS and average rank probability skill score, we show that the forecasts from the proposed methodology have improved predictability compared to: (1) forecasts from individual models, (2) multimodel ensembles with no optimal combination (pooling), and (3) over multimodel forecasts based on long-term predictability. To ensure that the improved skill exhibited by the multimodel scheme is statistically significant over the skill of individual models, we perform detailed nonparametric hypothesis tests as suggested by Hamill  by comparing the average RPS of the proposed scheme with other schemes. Results (p-value) from hypothesis tests show that the reduced RPSs shown by the seven multimodels are statistically significant than the RPSs of individual models. Comparing the performance of the proposed methodology with multimodel forecasts based on long-term predictability show that developing optimal model combinations contingent on the predictor certainly leads to improved predictability. Further, the study also shows that adding climatological ensembles with individual statistical models also leads to significant reduction in . However, depending on the threshold probability chosen, adding climatology could also potentially reduce the number of hits and increase missed targets, which primarily arises because of reduced resolution of multimodel forecasts. The two-step procedure of multimodel combination (MM3) – combining individual models first with climatology and then the resulting combinations are combined at the second step – and one step combination (MM4), reduce the considerably and perform better than the rest of the multimodels and low-dimensional statistical models. Both MM3 and MM4 also improve the reliability of forecasts and reduce the error, in terms of average Brier score, in developing tercile forecasts. The study also shows that adaptive forecasts from the proposed multimodel approach perform better than the adaptive multimodel forecasts obtained based on long-term predictability.
The proposed multimodel algorithm aims at combining multiple models by analyzing the skill of the models contingent on the predictor state. As shown in Figure 6, if the predictability of all the models is poor under a particular condition, then our approach will eventually replace the multimodel ensembles with climatological ensembles. Though the proposed approach is demonstrated by combining two low-dimensional streamflow forecasting models, it could be extended even to combine outputs from multiple GCMs. One difficulty in implementing this for multiple GCMs is on the necessity of having at least one dominant climatic mode that influences all the considered models. An obvious candidate for such a common predictor is to consider the leading principal components of global SSTs. However, as pointed out by Hagedorn et al. , it is important to develop multimodel ensembles based on the region as well as by identifying SSTs or climatic modes that influence the streamflow/precipitation potential for the region. Our future studies will investigate the scope of combining multiple GCMs to develop multimodel ensembles of precipitation forecasts and in combining streamflow forecasts obtained by different downscaling schemes.
This study was supported by the North Carolina Water Resources Research Institute. We also would like to thank the three anonymous reviewers and the associate editor whose valuable comments led to significant improvements in our manuscript. Thanks to Tom Fransen, NC DENR, and Terry Brown, USACE, for sharing the Falls Lake inflow data.
Given that seasonal forecasts are better represented probabilistically using ensembles, expressing the skill of the forecasts using correlation requires summarizing the forecasts using some measures of central tendency such as mean or median of the conditional distribution, which does not give any credit to the probabilistic information in the forecast. Rank Probabilistic Skill Score (RPSS) computes the cumulative squared error between the categorical forecast probabilities and the observed category in relevance to a reference forecast [Wilks, 1995]. Here category represents dividing the climatological/observed streamflow, Q, into d = 1, 2Ω D divisions and expressing the marginal probabilities as Pd(Q). Typically, the divisions are made equal probabilistically with D = 3 categories known as terciles with each category having 1/3 probability of occurrence. These three categories are known as below-normal, normal and above-normal whose end points provide streamflow values corresponding to that particular category. Thus, for a total of D categories, the end points based on climatological observations for dth category could be written as Qd, Qd+1 (For d = 1, Q1 = 0; d = D; QD+1 = Qmax). Given streamflow forecasts at time “t” from mth model with i = 1, 2Ω N ensembles, , then the forecast probabilities for the dth category could be expressed as by computing the number of ensembles between . To compute RPSS, the first step is to compute rank probability score (RPS). Given D categories and for a forecast, we can express the RPS, which is otherwise known as L = 2 norm, for a particular year “t” from mth model as
where is the cumulative probabilities of forecasts up to category d and COd is the cumulative probability of the observed event up to category d. Similar measures for evaluating the entire conditional distribution of forecasts have also been proposed [Muller et al., 2005]. For instance, the L = 1 norm (equation (A2)) could be written as the absolute sum of deviation in cumulative probabilities of forecasts and the observed event.
Thus if Qt, the observed streamflow falls in the dth category, COq = 0 for 1 ≤ q ≤ d−1 and COq = 1 for d ≤ q ≤ D. Given RPS, we can compute RPSS in relation to a reference forecast, which is usually climatological forecasts having equal probability of occurrence under each category
Low RPS indicates high skill and vice versa. The relative measure RPSS, if it is positive, then the forecast skill exceeds that of the climatological probabilities. RPSS could give an overly pessimistic view of the performance of the forecasts and it is a tough metric for evaluating probabilistic forecasts [Goddard et al., 2001]. For a detailed example on how to compute RPS and RPSS for given forecast, see Goddard et al. . In this study, we have computed RPS and RPSS for each year and both regression and resampling ensembles by assuming D = 3.
This appendix briefly summarizes the hypothesis tests employed for finding whether the skill of the streamflow forecasts, indicated by average RPS, from any of the two different models given in Table 1 are statistically significant. For testing the null hypothesis on the probabilistic forecasts skill measure, average RPS, we employ the nonparametric hypothesis tests based on resampling approach. For a detailed discussion of the methodology employed, see Hamill . The null hypothesis for testing the average RPS between Model A and B could be written as:
Using and , we now generate the null distribution. To develop this, we generate an indicator variable It, taking on a value of 1 (Model A) or 2 (Model B) with equal probability. The indicator is used to select either Model A or B every year using which the resampled statistic is developed as: where
Basically, equations (B4) and (B5) ensure that the resampled statistic is computed using both Models A and B with equal probability. By obtaining 10000 such resampled statistics, we construct the null distribution and use that to calculate the percentile value of the estimated test statistic, , for the two candidate models A and B. On the basis of the reported percentile value, one could get the p-value to accept or reject the null hypothesis for the chosen significance level, α.