Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Technometrics. Author manuscript; available in PMC 2010 June 2.
Published in final edited form as:
Technometrics. 2009 November 1; 51(4): 464–474.
doi:  10.1198/TECH.2009.08126
PMCID: PMC2879656

Simultaneous Determination of Tuning and Calibration Parameters for Computer Experiments


Tuning and calibration are processes for improving the representativeness of a computer simulation code to a physical phenomenon. This article introduces a statistical methodology for simultaneously determining tuning and calibration parameters in settings where data are available from a computer code and the associated physical experiment. Tuning parameters are set by minimizing a discrepancy measure while the distribution of the calibration parameters are determined based on a hierarchical Bayesian model. The proposed Bayesian model views the output as a realization of a Gaussian stochastic process with hyperpriors. Draws from the resulting posterior distribution are obtained by the Markov chain Monte Carlo simulation. Our methodology is compared with an alternative approach in examples and is illustrated in a biomechanical engineering application. Supplemental materials, including the software and a user manual, are available online and can be requested from the first author.

Keywords: Gaussian stochastic process model, Hierarchical Bayesian model, Kriging


Complex computer simulation codes (computer experiments) are used extensively in scientific and engineering studies (see Santner, Williams, and Notz 2003; Fang, Li, and Sudjianto 2005; and Forrester, Sobester, and Keane 2008). Many computer codes were developed for applications where the results of physical experiments are also available to describe the true input/output relationship of interest. In this article we are interested in settings where both types of data are available, but the number of computer code runs is limited because they have lengthy run times. In such a case, statistical models are needed to extend the set of inputs at which output is available by predicting the response from the computer and physical experiments based on a limited number of training runs.

Such statistical models must incorporate different types of input variables. This article focuses on computer experiments having three types of inputs: control variables, tuning parameters, and calibration parameters. Control variables are (design) inputs to both the simulation code and the corresponding physical experiment (Santner, Williams, and Notz 2003, p. 15). A primary goal of the physical experiment is to study the relationship between the response and control variables.

Tuning parameters are inputs that are present only in the simulation code. Tuning parameters are typically numerical quantities that control the solution of a numerical algorithm implemented by a computer experiment. Thus, tuning parameters have no meaning in the physical experiment. For example, Rawlinson et al. (2006) described a finite element analysis (FEA) computer code simulating the forces and movements of a knee prosthesis under a given loading regimen. The two tuning parameters for their model were the mesh density of the tibial tray and the amount of discretization in the curve used to describe the knee loading. Tuning is the process of determining the values of the tuning parameters so that a computer simulation best represents a physical experiment.

Calibration parameters are also present only in the simulation code. Unlike tuning parameters, calibration parameters have meaning in the physical experiment, but their values are either unknown or unmeasured during the running of the physical experiment. For example, Kennedy and O’Hagan (2001) described a computer code for simulating the deposition of ruthenium 106 in the Tomsk-7 chemical plant, which experienced an accident in 1993. In their code, one calibration parameter was the deposition velocity. Calibration is the process of determining plausible values of the calibration parameters to match the results of the physical experiment.

Some computer experiments have both tuning and calibration parameters. For example, in the FEA code of Rawlinson et al. (2006) two calibration parameters were position of the femur relative to the tibia in the initial gait cycle and friction between the bone and the prosthesis.

Although considerable research has been dedicated to setting either tuning or calibration parameters, there has been little study on the problem of setting both simultaneously. In this article, we introduce a methodology for the simultaneous determination of tuning and calibration parameters. Section 2 reviews current methods for tuning and calibration and discusses their limitations. Section 3 introduces a hierarchical Bayesian model describing the response from a physical experiment together with the output from the computer simulation code having both tuning and calibration parameters. Section 4 proposes a methodology for simultaneously setting these parameters. Section 5 compares the proposed methodology with an approach that treats all parameters as calibration parameters.


Previous proposals for calibration have been primarily Bayesian. Their goal was to simulate the posterior distributions of the calibration parameters and to use this posterior distribution to better predict unknown responses and assess the uncertainty in the prediction. Craig et al. (1996, 2001) developed a Bayesian linear forecasting method, which can be seen as a pragmatic approximation to a fully Bayesian analysis. Kennedy and O’Hagan (2001) described a Bayesian calibration framework based on a model having a bias function and a (modular) Bayesian analysis. Higdon et al. (2004) proposed a fully Bayesian implementation for the model in Kennedy and O’Hagan (2001). Higdon et al. (2008) developed a methodology for calibration when the outputs are multivariate. Gattiker (2005) implemented the models in Higdon et al. (2004, 2008) in MATLAB. Loeppky, Bingham, and Welch (2006) proved that the previous procedures would lead to correct estimation of the calibration parameters asymptotically, provided the true values of the calibration parameters could reduce the bias of the computer simulation to zero. We discuss a lemma in Loeppky, Bingham, and Welch (2006) in Section 5.

Classical methods used to set tuning parameters, such as Park (1991) and Cox, Park, and Singer (1996), select these inputs to make the computer code output fit the physical observations as closely as possible in an integrated prediction error sense. These proposals for selecting tuning parameters do not provide estimated uncertainties associated with the selected values. However, several recent proposals for “validating” computer experiments are similar to the methodology for Bayesian calibration. For example, Bayarri et al. (2007b) presented a framework with a Bayesian model for validation and Bayarri et al. (2007a) used wavelet decomposition based on the same Bayesian model to validate computer experiments having multivariate outputs.

Because the methods above focus on either tuning or calibration, they may suffer limitations in applications where both tuning and calibration parameters are present. Specifically, if one applies classical tuning methodology to a problem involving both tuning and calibration parameters, one cannot quantify the uncertainties of the predicted calibration parameters. If one applies calibration methodology to such a problem and treats all parameters as calibration parameters, one can obtain biased parameter predictions and/or undesirable estimated uncertainties of the tuning and calibration parameters as we will show below. These misleading predictions and estimated uncertainties can lead to large prediction errors.

To demonstrate the limitations alluded to above, we used the Bayesian calibration program (gpmsa) described in Gattiker (2005) to set tuning and calibration parameters for an application in Rawlinson et al. (2006). Originally, the goal of this project was to compare the damage in two knee implants [the Install-Burstein (IB) manufactured by Zimmer, Inc. and the Optetrak produced by Exactech, Inc.] under certain loading regimens.

Here, we consider only the IB knee implant and a physical experiment in which the measurements consist of kinematics and kinetics of knees tested in an Instron-Stanmore KC1 testing device, a “knee simulator.” The loading (magnitude, angle, and rate), joint stiffness, and knee design were modeled in a finite element analysis (FEA) computer code whose output included the kinematics, kinetics, and stresses experienced by the knee component. The output we considered was the anterior–posterior displacement (APD) of the femoral component with respect to the tibial tray, which is an important kinematic output of fore-aft motion during knee function. The APD was measured in both the knee simulator and the computer simulation; it is roughly proportional to the anterior–posterior force that, coupled with the vertical joint load, contributed to damage of the prosthesis. Thus this computer experiment had one control variable (the percentile in gait cycle), two tuning parameters (finite element mesh density and load discretization), and two calibration parameters (friction and initial position).

The design of the computer and the physical experiments were roughly maximin Latin hypercube designs (McKay, Beckman, and Conover 1979). A pilot study of the APD for this design acquired 439 observations from the computer code and 36 observations from the physical experiment for the purpose of tuning and calibration. Figure 1 depicts the training data. A previous sensitivity analysis (Saltelli, Chan, and Scott 2000) found that APD was neither sensitive to friction nor mesh density. Therefore, we study load discretization (a tuning parameter) and initial position (a calibration parameter) regarding the percentage of the gait cycle as a design input to both physical and computer experiments.

Figure 1
Scatterplot of the measured APD against percentile in the gait cycle, from the knee simulator (triangles) and the FEA computer code (dots).

Using gpmsa (Gattiker 2005), we treated both load discretization and initial position as calibration parameters whose prior distributions were near uniform distribution. We ran gpmsa with 8000 burn-in iterations and 2000 production runs, and based our inferences on 100 equally spaced values of each parameter taken from the production runs. Figure 2 shows the simulated posterior distributions of these two parameters. The posterior distribution of load discretization is bimodal and that of initial position has a lower value in its range as mode, but has large variation. Thus, plausible values of the tuning and calibration parameters remain unclear. We demonstrate in Section 5.2 that our methodology can significantly improve the selection of the tuning and calibration parameters in this application.

Figure 2
Simulated posterior distributions of load discretization (the left panel) and initial position (the right panel).


We let x denote (the vector of) control inputs, t denote the vector of tuning parameters, and t[large star] denote the “best” values of the tuning parameters, where t[large star] will be defined in Section 4. We let c denote the vector of calibration parameters and c[large star] denote the unknown true values of the calibration parameters. We assume x [set membership] [0, 1]px, t [set membership] [0, 1]pt, and c [set membership] [0, 1]pc or they can be so scaled. Information about the value of the unknown c[large star] will be contained in its (marginal) prior, which will be denoted by [c[large star]].

We let {(xis,ci,ti),ys(xis,ci,ti);i=1,2,,ns} denote the training data from the computer experiment and {xjp,yp(xjp);j=1,2,,np} denote the training data from the physical experiment. Here ns and np are the numbers of runs for each type of data. A more complete notation for the physical experiment response is yp(x, c[large star]); throughout this article, yp(x) and yp(x, c[large star]) are equivalent. We let ys be the ns × 1 vector with ith element ys(xisci,ti) and yp be the np × 1 vector with jth element yp(xjp).

The model proposed here is similar in spirit to the ones in Kennedy and O’Hagan (2001) and Higdon et al. (2005) except that it explicitly incorporates both the tuning and calibration parameters. In other words, the response from the physical experiment is the sum of the “true response” and a random noise; the output from the computer experiment is the difference between the true response and the code bias.

Formally, we assume that the code output at (x, c, t), ys(x, c, t), is a realization of the Gaussian stochastic process


where fZ(x,c,t) is a vector of known regression functions and βZ is a vector of unknown regression parameters. If Ys(•) is assumed to have a constant mean, then fZ(x,c,t)βZ=βZ,0. In Equation (1), Z(•) is a stationary Gaussian stochastic process with mean 0, variance σZ2, and product Gaussian correlation


with correlation parameters ρZ = (ρZ,x,1, …, ρZ,x,px, ρZ,c,1, …, ρZ,c,pc, ρZ,t,1, …, ρZ,t,pt)[top top]. Thus ys is a realization of the random vector Ys whose ith element is Ys(xisci,ti).

We regard the output of the physical experiment at x, yp(x), as a realization of


where η(x, c[large star]) = E(Yp(x)) is the expected response at x and ε is a white noise (Gaussian) process with mean 0 and variance σε2. Thus yp is a realization of the random vector Yp whose ith element is Yp(xip).

We define


to be the code bias at (x, c, t). Our model assumes δ(•) is a draw from the Gaussian stochastic process


where fD(x, c, t) is a vector of known regression functions and βD is a vector of unknown regression parameters; if Δ(•) is assumed to have a constant mean, then fD(x,c,t)βD=βD,0. Similar to Z(•), D(•) is defined to be a stationary Gaussian stochastic process with mean 0, variance σD2, and correlation function


where ρD = (ρD,x,1, …, ρD,x,px, ρD,c,1, …, ρD,c,pc, ρD,t,1, …, ρD,t,pt)[top top]. Finally we assume that, given their parameters, Z(•) and D(•) [hence Ys(•) and Δ(•)] are mutually independent processes. Some authors call δ(•) the “code discrepancy” or “code inadequacy.” Throughout this article, δ(•) is referred to as code bias and the word “discrepancy” is used in Section 4.1 to designate an objective function used in the determination of the tuning parameters.

In conclusion, conditional on the parameters (βZ,σZ2,ρZ), our model assumes that for given inputs (x, c, t), the simulator output ys(x, c, t) is a draw from


Similarly conditional on unknown parameters, the physical experiment output at control input x, yp(x), can be viewed as a draw from


for any given vector t of tuning parameters. The unknown parameters being conditioned in Equation (8) are those of Ys(x, c[large star], t), Δ(x, c[large star], t), and ε, which are (c,βZ,σZ2,ρZ),(c,βD,σD2,ρD), and σε2, respectively. Thus all conditional statements given Ys = ys and Yp = yp are computed with respect to Equations (7) and (8). Appendix B lists the variance–covariance matrix for the vectors Ys and Yp and their cross-covariance matrix.

Next we introduce our prior model for the unknown model parameters and derive the posteriors given the tuning parameters t. The proposed priors are similar to the ones in Higdon et al. (2004, 2005), and Gattiker (2005). For a generic parameter (or a vector of parameters) v, we let [v] denote the (joint) prior distribution.

We assume that the priors for the calibration parameters, c[large star], the process means, β=(βZ,βD), the process variances, σ=(σZ2,σD2,σε2), and process correlations, ρ=(ρZ,ρD), are independent so that


If expert knowledge about the priors is available it should be used. For example, if the researcher knows (roughly) the magnitude of the noise term, it should be used to construct the prior for σε2. Lacking such knowledge, our methodology is based on the following priors.

The examples and software take fD()βZ=βZ,0 and fD()βD=βD,0. Our methodology for selecting (βZ,0, βD,0) is modular in the spirit of Bayarri et al. (2007a) in that these mean parameters are determined by non-Bayesian arguments in Section 4.2. All other parameters are determined from draws of the posterior.

Because the calibration parameters are scaled to [0, 1]pc, all the elements of c[large star] are taken to have independent and identical truncated normal distributions with mean 0.5, standard deviation 2, and support [0, 1]. This prior is close to a uniform distribution on [0, 1]pc.

The variance parameters are taken to have independent inverse Gamma prior distributions. Letting IG(α, γ) denote the inverse Gamma distribution with mean 1/[γ(α – 1)] and variance 1/[γ2(α – 1)2(α – 2)], we take σZ2IG(αZ,γZ),σε2IG(αε,γε), and σD2IG(αD,γD) where the parameters are set data-adaptively, as follows. We first compute the sample variances of ys and yp, which we denote by σs2 and σp2 respectively. We take σZ2 to have a (mildly) informative prior distribution by setting (αZ,γZ)=(10,0.1/σs2). We select the σε2 prior to have a small magnitude and variation by taking (αε,γε)=(1,1000/σp2) as the default. The supplemental software package allows the user to specify a multiplier for σε2 so that the program can suit physical experiments with random noise of different magnitudes. We use the value of (σp2σs2) to determine the prior for σD2; if σp2>σs2,σD2 is set to have a mildly informative prior by taking (αD,γD)=(10,0.1/(σp2σs2)) while σp2σs2,σD2 is set to have a prior with small magnitude and variation by selecting (αD,γD)=(1,100/σs2).

Finally, we take the priors of the correlation parameters to be independently and identically distributed Beta(1, 0.5) distributions. This Beta distribution has mean 2/3, mode close to 1, and a rather diffuse support over (0, 1); it encourages dependence of the output on each input, but the broad support allows the data a strong voice in determining the posteriors (Linkletter et al. 2006 and Higdon et al. 2008).

Let ϕ = (β, σ, ρ) [or ϕ = (σ, ρ) when β is fixed as below] and [c[large star], ϕ] denote the joint prior of c[large star] and ϕ. For a fixed t and given the data (ys, yp), the joint posterior distribution can be written as



(See Appendix A for details of the posterior distribution.) We simulate draws from [c[large star], ϕ|ys, yp, t] by implementing a Metropolis–Hastings (MH) algorithm. Appendix A sketches this algorithm (see Higdon et al. 2004, 2005 for details about related MH algorithms). Using this model and the relationship in Equations (10) and (11), draws can be made from the joint posterior distribution [c[large star], ϕ|ys, yp, t] which forms the basis for our proposed methodology for simultaneous tuning and calibration in Section 4.


4.1 The Discrepancy Function

Our definition of the ideal tuning parameters is based on a function that measures the discrepancy between the physical experiment response and the computer code output. The discrepancy is a function of (c, t), but can be viewed as a function of t for given c = c[large star]. Three of the many possible discrepancy functions are L2 discrepancy, ∫(η(x, c[large star]) – ys(x, c, t))2 dx, L1 discrepancy, ∫ |η(x, c[large star]) – ys(x, c, t)|dx, and L discrepancy, maxx |η(x, c[large star]) – ys(x, c, t)|; here (and below) the integrals are over [0, 1] raised to an appropriate power. In this article, we use L2 discrepancy, but our methodology can be applied with any other discrepancy function.

If c[large star] is known, the L2 discrepancy is defined by


regarded as a function of t. If c[large star] is unknown, but has a posterior distribution [c[large star]|ys, yp, t], the L2 discrepancy becomes


Our objective in tuning is to estimate


Our objective in calibration is to compute the posterior distribution [c[large star]|ys, yp, t[large star]]. We achieve the two objectives by minimizing an estimated squared discrepancy (defined in Section 4.2) and simulating the posterior distribution of c[large star] given the minimizer.

4.2 Simultaneous Tuning and Calibration

To review, the main idea is to choose the most plausible values of the tuning parameters so that the computer code can best represent the physical experiment and use these tuning parameters in the inferences about calibration parameters. Next, we describe our method for estimating δ2(x, c[large star],t), t[large star], and [c[large star]|ys, yp, t[large star]] which we call the STaC procedure, which stands for simultaneous tuning and calibration. Then we summarize the methodology in a three step procedure.

Our first of the three steps is to estimate δ2(x, c[large star],t) by the posterior mean E2(x, c[large star], t)|ys, yp, t), which is the Bayes predictor of δ2(x, c[large star],t) under squared error loss (Santner, Williams, and Notz 2003). Notice that E2(x, c[large star], t)| c[large star], ϕ, ys, yp, t) is equal to


Closed form expresssions for both E(Δ(x, c[large star], t)|c[large star], ϕ, ys, yp, t) and Var(Δ(x, c[large star], t)|c[large star], ϕ, ys, yp, t) can be derived analytically based on the probability density of the multivariate normal distribution (see Appendix B).

The two terms in Equation (14) can be interpreted intuitively, as follows. Large values of [E(Δ(x, c[large star], t)|c[large star], ϕ, ys, yp, t)]2 imply that there is a large discrepancy between the computer and the physical experiments; thus [E(Δ(x, c[large star], t)|c[large star], ϕ, ys, yp, t)]2 can be interpreted as squared prediction bias. The term Var(Δ(x, c[large star], t)|c[large star], ϕ, ys, yp, t) is a measure of the uncertainty of the predicted bias. Thus Equation (14) favors a t that minimizes the sum of the squared prediction bias and its predicted uncertainty.

The right-hand side of


can be approximated using Equation (14), and based on draws (c,ϕ) from [c[large star], ϕ|ys, yp, t], which can be obtained using the MH algorithm. The law of large numbers guarantees that


for a sufficiently large NMC. Thus, we can approximate E2 (x, c[large star],t)|ys, yp, t) asymptotically.

The estimation of δ2(x, c[large star],t) using Equation (15) requires knowledge of (βZ,0, βD,0); these parameters are set as follows. The parameter βD,0 can be interpreted as the difference between the average of the physical experiment and the average of the computer experiment outputs when the calibration and tuning parameters are equal to (c[large star],t). The value of βD,0 can impact both calibration and tuning and thus must be carefully chosen. Assuming expert knowledge about the difference βD,0 is not available, we suggest setting βD,0 based on the assumption that, for any given t, the code inadequacy is decreased as c gets closer to c[large star] so that βD,0 becomes smaller. Thus βD,0 is minimized when c = c[large star]. Hence we take βD,0 to be the minimized difference, over c values, between the sample mean of the np physical experiment responses and the sample mean of np predicted code outputs whose corresponding inputs are {(xjp,c,t)}j=1np, where {xjp}j=1np, are the inputs to the physical experiment and t is given. The c minimization must be over the calibration parameter input space. Operationally, this is accomplished by using a space-filling maximin Latin hypercube design to select a set of ns c values over which the minimization is carried out. For each choice of c, we predict np outputs {ys(xjp,c,t)}j=1np based on the training data from the computer experiment using a MATLAB version of the parametric empirical kriging (PErK) program in Santner, Williams, and Notz (2003) and Appendix C. We then compute the difference between the sample mean of {yp(xjp)}j=1np and the sample mean of the predicted {ys(xjp,c,t)}j=1np. Repeating this operation with each choice of c yields ns differences. We set βD,0 to the smallest difference and then set βZ,0 so that βZ,0 + βD,0 is the average of the np physical experiment responses.

In the second step, we estimate t[large star] by


The integral over x in Equation (16) can be approximated by a further Monte Carlo integration that averages Equation (15) over a grid of x inputs. In our examples, we took


with Nx= 101 and an equally spaced grid of values of x. By the law of large numbers, t converges to t[large star] almost surely as Nx and NMC go to infinity. We regard t in Equation (17) as the estimator of t[large star] and the quantity i=1Nx=1NMCE[Δ2(xi,c,t)|c,ϕ,ys,yp,t]/(NxNMC) as the estimated squared discrepancy.

In the third step, we estimate the posterior distribution [c[large star]|ys, yp, t[large star]]. Because t converges to t[large star] almost surely, [c[large star]|ys, yp, t] converges to [c[large star]|ys, yp, t[large star]] in distribution. We therefore estimate the distribution [c[large star]|ys, yp, t[large star]] using the draws of c[large star] from [c[large star], ϕ|ys, yp, t].

In summary, we conduct simultaneous tuning and calibration as follows:

  • Step 1. For each possible t in a grid of the tuning parameter vectors, set the values of (βZ,0, βD,0), make draws {(cl,ϕl);l=1,,NMC} from [c[large star],ϕ |ys, yp, t], and compute the estimated squared discrepancy at a set of inputs {xi:i=1, …,Nx}.
  • Step 2. Compute t=argmint1Nx1NMCi=1Nx=1NMCE(Δ2(xi,c,t)|c,ϕ,ys,yp,t)
  • Step 3. Estimate [c|ys,yp,t].

4.3 Prediction

In addition to tuning and calibration, we can use t and the draws {(c,ϕ)} from [(c,ϕ)|ys,yp,t] to predict ys(•) and η(•) at arbitrary inputs and to construct predictive intervals for these quantities. Given (x, c, t), the best linear unbiased prediction (BLUP) of ys(x, c, t) is E(Ys(x,c,t)|ys,yp,t)=E[c,ϕ|ys,yp,t][E(Ys(x,c,t)|ys,yp,t,c,ϕ)]. Thus, by the law of large numbers,


is an approximation to the BLUP of ys(x, c, t) for large NMC. Similarly, given x,


is an approximation to the BLUP of η(x, c[large star]) by Equations (3) through (5).

We quantify the uncertainty in the ys(x,c,t) predictor using


which can be approximated using draws from the MH algorithm. We provide details in Appendix C. Similarly, we quantify the uncertainty in the [eta w/ hat](x, c[large star]) predictor using Var(Ys(x,c,t)+Δ(x,c,t)|ys,yp,t). Details of this computation are also given in Appendix C.

Based on Equations (18), (19), (20), and the results in Appendix C, we can compute point-wise 100(1 – α)% prediction bands for ys(x, c, t) and η(x, c[large star]) as




where zα/2 is the upper α/2 critical point of the standard normal distribution.


5.1 Introduction

In this section, we compare Bayesian calibration as implemented by the gpmsa software with STaC. Recall that the former performs a fully Bayesian calibration for the tuning as well as the calibration parameters, whereas STaC sets the tuning parameters to minimize a predicted L2 discrepancy between the computer and the physical experiment outputs. Lemma 1 in Loeppky, Bingham, and Welch (2006) shows that, roughly, for their Gaussian stochastic process model, under the conditions (a) the discrepancy between the computer and the physical experiments can be minimized to be zero and (b) infinitely many runs from the computer simulation and the physical experiment are available, the maximum likelihood estimator (MLE) of the discrepancy is zero and the MLE of the calibration parameters minimizes the discrepancy asymptotically. Based on this lemma, Bayesian calibration and STaC can give comparable estimates of t[large star] in our model if (A) the discrepancy δ(•) can be made sufficiently close to zero, (B) the number of observations from the computer experiment is reasonably large, and (C) the posterior mode of every calibration parameter is close to the MLE of the parameter.

However, (A), (B), and (C) can be hard or impossible to verify/implement in many applications for at least three reasons. It may not be possible to drive the the discrepancy δ(•) to zero because of the simplified physics or biology used to construct the simulation code. The number of computer and physical experiment runs are typically limited. The posterior mode of a calibration parameter can differ from the MLE if either the prior for c[large star] or Δ(•) is informative. Therefore, the Bayesian calibration approach may fail to minimize the difference between the computer and the physical experiments. Next we compare STaC with gpmsa in a numerical example where the bias can be minimized to zero, yet the training data are scarce.

5.2 An Illustrative Example

This example has three real-valued inputs to a computer code: one control variable x, one tuning parameter t, and one calibration parameter c. All three inputs have the support [0, 1]. The output from the computer code was generated as


and the response from the physical experiment as


where ε denotes noise generated from normal distribution having mean zero and standard deviation 0.01. In this example the bias term can be driven to zero by setting c = 0.1 and t = 0.8. Thus any approach that suggests t[large star] = 0.8 and c[large star] = 0.1 is favorable.

We generated 30 inputs for the computer experiment and 5 for the physical experiment with both designs being maximin Latin hypercube designs. The Bayesian calibration program gpmsa was run treating c and t as calibration parameters. Following 8000 burn-in iterations and 2000 production ones, we regarded 100 equally spaced samples drawn from the 2000 production runs as draws from the joint posterior distribution of (t[large star], c[large star]). The estimated posterior distributions of t[large star] and c[large star] are shown in Figure 3. The estimated [t[large star]|yp, ys] is not able to pinpoint that t[large star] = 0.8 and the estimated [c[large star]|yp, ys] fails to suggest values close to 0.1. We next used gpmsa to predict η(x, c[large star]) over the grid, x = 0, 0.02, …, 1, of equally spaced inputs. Figure 4 depicts the training data, the true response curve, and the predictions. We see that the predictions are biased when x is near 0. The root mean squared prediction error (RMSPE) was computed as a measure of the predictive accuracy; the RMSPE of the generic predictor [eta w/ hat](•) with respect to the x grid above is defined to be 151i=151(η(xi,c)η(xi,c))2. The RMSPE of the predictor obtained by gpmsa is 0.0693.

Figure 3
Simulated posterior distributions of t[large star] (the left panel) and c[large star] (the right panel) using the Bayesian calibration program.
Figure 4
The training data (solid circles), the true response curve (the solid line), and the predictions (pluses) obtained by gpmsa.

We applied STaC to this example with NMC = 100 and Nx = 101. The number of iterations in the burn-in period, the number of production runs, the sampling space, and the prediction inputs are the same as used for gpmsa. The left panel in Figure 5 plots the L2 discrepancy against t, for t = 0, 0.1, …, 0.9, 1; thus, STaC picks t=0.8. The right panel in Figure 5 shows the histogram of the simulated posterior distributions of c[large star] when t[large star] = 0.8. The posterior distribution has a center near 0.1. By running the program multiple times, we found that the smallest discrepancy occurred when t was between [0.7, 0.9] and the posterior distribution of c was centered at 0.1.

Figure 5
Results of tuning and calibration from the STaC program: The left panel plots of the estimated L2 discrepancy against t where t = 0, 0.1, …, 0.9, 1; the right panel is a histogram of the draws from the posterior distribution of c[large star].

Figure 6 plots the predictions and the 99% two-sided prediction bands for true input-output function η(x, c[large star]) using STaC. We see that the predictions are close to the true responses and the 99% prediction band contains the true curve for all x [set membership] [0, 1]. The RMSPE of the predictor obtained by the STaC program is 0.0382. The relative improvement of STaC compared with the gpmsa-based predictor is 44.8% [= (0.0693 – 0.0382)/0.0693 × 100%].

Figure 6
The training data (solid circles), true response curve (the solid line), predictions (pluses), and 99% prediction bands (dashes) using the STaC program.

Thus, for this example, the inferences about the tuning and calibration parameters and the predictions with STaC are more informative and more accurate than using a fully Bayesian procedure.

5.3 A Biomechanics Example

We apply STaC to the biomechanics application described in Section 2. The number of burn-in and production runs are identical to those used in our first example. The left panel of Figure 7 plots the estimated squared discrepancies for the tuning parameter equal to 15, 16, …, 30 (with NMC = 100 and Nx= 101); this figure shows that the computer simulation best matches the knee simulator when the load discretization is set to t=19. The right panel shows the histogram of the simulated posterior distribution [c|yp,ys,t=19]. The simulated posterior distribution of the initial position has mode −1.7 with substantially smaller uncertainty than does the Bayesian calibration in Figure 2 (right panel). Thus STaC indicates that the true initial position is on or below the lower bound of the current design space. These results can be used for future runs of the FEA code to study the anterior-posterior displacement.

Figure 7
Scatterplot of the estimated squared discrepancy versus the value of the tuning parameter (left panel); histogram of the simulated posterior distribution of the calibration parameter when t=19 (right panel).

Figure 8 depicts the predictions and 99% prediction bands for the anterior–posterior displacement in the physical experiment using the current data. Once the initial position and load discretization are set using STaC, the predictions clearly match those provided by the knee simulator.

Figure 8
The training data (triangles), predictions (pluses), and 99% prediction bands (dashes) using the STaC program.


This article introduces a modular Bayesian methodology for simultaneous tuning and calibration (STaC). It demonstrates, with examples, that STaC performs well for tuning, calibration, and prediction of the true input–output relationship. The examples also show that treating tuning and calibration parameters differentially is beneficial whether or not the code bias can be reduced to zero.

There are several areas of continuing research interest concerning STaC. Currently, determining the tuning parameters in the first step of STaC can be time consuming; additional work on speeding up the computation is needed. Second, a measure of the uncertainty in the estimated tuning parameter t, perhaps based on the curvature of the integral 0101δ2(x,c,t)[c|ys,yp,t]dcdx, is needed. Third, it is of interest to extend STaC to other settings, for example, to cases where outputs are multivariate and to applications where both the computer code and the physical experiment have mixed quantitative and qualitative inputs. Fourth, there are other estimates of δ that can produce superior results for tuning and thus deserve further investigation. For example, another estimate of δ(x, c[large star], t) is E(Yp(x)|c[large star], ϕ, ys, yp, t) – E(Ys(x, c[large star], t)|c[large star], ϕ, ys, t), where the second expectation is not conditioning on yp. If some of the tuning parameters are continuous, another extension of the current approach is to model the response surface of the estimated squared discrepancy and identify t using the response surface.

Lastly, we sketch two research areas related to calibration and tuning. One is to determine better experimental designs for the inputs to the computer and physical experiments when STaC is the objective, i.e., inputs {(xis,ci,ti);i=1,2,...,ns}. We use two separate Latin hypercube designs for the physical and computer experiments in our examples. We believe nested Latin hypercube designs, for example, orthogonal array based nested Latin hypercube designs proposed by Qian, Ai, and Wu (2009), Qian (2009), and Qian and Wu (2009), may better determine tuning and calibration parameters. The second research area is to solve the potential issue of nonidentifiability. Note that tuning parameters, calibration parameters, and the code bias δ may be correlated. For example, if yp(x) = x2 + 1 + ε and ys(x, c, t) = x2 + c + t, then c, t, and δ can be three arbitrary real values whose sum is 1. The detection and elimination of the nonidentifiability will be important for calibration, tuning, and validation for complex computer models.

Supplementary Material

Computer Code


This research was sponsored, in part, by the National Science Foundation under Agreements DMS-0112069 (SAMSI/Duke University) and DMS-0406026 (The Ohio State University). Research support for the development of the biomechanics code and knee simulator used to produce the data for the example in Sections 2 and 5.3 was provided by NIH grant AR049793, Exactech, Inc., and the Clark and Kirby Foundations. The authors thank the editor, associate editor, and two reviewers for their constructive suggestions and valuable comments.


The posterior density is proportional to the product of the joint prior density [c[large star], ϕ|t] and the likelihood [ys, yp|(c[large star], ϕ, t)]. We can determine [c[large star], ϕ|t] by computing the prior densities of [c[large star]], [β], [σ], and [ρ] and applying Equation (9). The density [ys, yp|(c[large star], ϕ, t)] is analytically derivable because ys and yp are viewed as observations from Gaussian stochastic processes. Specifically, our model is Ys(xis,ci,ti)=fZ(xis,ci,ti)βZ+Z(xis,ci,ti) and Yp(xjp)=Ys(xjp,c,t)+Δ(xjp,c,t)+ε for all i = 1, …, ns and j = 1, …, np. The mean values of Yp(xjp) and Ys(xis,ci,ti) are fZ(xis,ci,ti)βZ+fZ(xis,ci,ti)βD and fZ(xjp,c,t)βZ respectively. The covariance matrix between Ys and Yp given (c[large star]ϕ, t) is an ns × np matrix with (i,j)th element


which can be computed using Equation (2).

With the joint prior density and the likelihood, the Metropolis–Hastings algorithm updates each of the parameters using the specifications listed in Table 1. For all the parameters, the proposal distributions are uniform distributions with centers set to be the previous draws and widths specified by the last column of Table 1. However, for all correlation parameters ρ, the proposal draws are made on ξ = −4 log(ρ).

Table 1
Specification of the Metropolis–Hastings algorithm for each parameter. Columns 2–5, from left to right, correspond to the prior distributions, the lower and upper bounds of the parameters as the program iterates, the initial values of ...


Using Equations (5) and (6) we compute [E(Δ(x,c,t)|c,ϕ,ys,yp,t)]2=[fZ(x,c,t)βD+DZZZ1Dt]2 and Var(Δ(x,c,t)|c,ϕ,ys,yp,t)=σD2DZZZ1ΣDZ where DZ is the (np + ns) × 1 vector of covariance between Δ(x, c[large star],t) and (Ys, Yp), ZZ is the (np + ns) × (np + ns) variance–covariance matrix of (Ys, Yp), and Dt is the (np + ns) × 1 vector of differences between the observed data and its expectation; i.e., Dt = (Ys, Yp) – (E(Yp|t)E(Ys|t)). We list the elements if DZ, ZZ, and Dt below.

  • (B.1) For i = 1,2, …, np, the ith element of DZ is Cov(Δ(x,c,t),Yp(xip)|(c,ϕ,t))=Cov(Δ(x,t,c),Δ(xip,c,t)|(t,c,ϕ)). For i = np + 1, np + 2, …, np + ns, the ith element of DZ is Cov(Δ(x,c,t),Ys(xinps,cinp,tinp)|(c,ϕ,t))=0.
  • (B.2)
    • ΣPP is an np × np matrix whose (i1, i2)th entry is Cov(Yp(xi1p),Yp(xi2p)|(c,ϕ,t)), for i, i1, i2 = 1, 2, …, np;
    • ΣPS is an np × ns matrix whose (i,j)th entry is Cov(Yp(xip),Ys(xjs,cj,tj)|(c,ϕ,t)), for i = 1, 2, …, np and j = 1, 2, …, ns;
    • ΣSS is an ns × ns matrix whose (j1, j2)th entry is Cov(Ys(xj1s,cj1,tj1),Ys(xj2s,cj2,tj2)|(c,ϕ,t)), for j1, j2 = 1, 2, …, ns.
  • (B.3) For i = 1, 2, …, np, the ith element of Dt is yp(xip)fZ(xip,c,t)βZfD(xip,c,t)βD. For i = np + 1, np + 2, …, np + ns, the ith element of Dt is ys(xinps,cinp,tinp)fZ(xinps,cinp,tinp)βZ


By conditioning on (c,ϕ) and the law of large numbers,




for large NMC. Using Equations (1), (2), (4), (5), (6), and the following formulas for E(Ys(x,c,t)|ys,yp,t,c,ϕ),E(Ys(x,c,t)+Δ(x,c,t)|ys,yp,t,c,ϕ), Var(Ys(x,c,t)|ys,yp,t,c,ϕ), and Var(Ys(x,c,t)+Δ(x,c,t)|ys,yp,t,c,ϕ) one can compute Equations (18), ((19)), ((C.1)), and ((C.2)) by taking (t,c,ϕ) to be (t,c,ϕ).

  • (C.1) The conditional expectation E(Ys(x,c,t)|ys,yp,t,c,ϕ),isfZ(x,c,t)βZ+Σ0ZΣZZ1Dt where [Sigma]0Z is the (np + ns) × 1 vector of covariances between Z(x, c, t) and the entries of (Yp, Ys) given (t,c,ϕ). For i = 1, …, np, the ith element of Σ0Z is CovZ(x,c,t),Z(xip,c,t)|t,c,ϕ) while for i = np + 1, …, np + ns the ith element is Cov(Z(x,c,t),Z(xinps,cinp,tinp)|t,c,ϕ). Similarly, [Sigma]ZZ is Equation (B.1) and Dt is given by (B.3), both conditional on (t,c,ϕ).
  • (C.2) The conditional variance Var(Ys(x,c,t)|ys,yp,t,c,ϕ)=σZ2Σ0ZΣZZ1Σ0Z where [Sigma]0Z, [Sigma]ZZ are defined in (C.1) while σZ2 is taken from [phi with hat].
  • (C.3) The conditional mean of η(x, c[large star]) is E(Ys(x,c,t)|ys,yp,t,c,ϕ)+E(Δ(x,c,t)|ys,yp,t,c,ϕ), where E(Ys(x,c,t)|ys,yp,t,c,ϕ) and E(Δ(x,c,t)|ys,yp,t,c,ϕ) is derived in (C.1) and Appendix B, respectively.
  • (C.4) The conditional variance Var(Ys(x,c,t)+Δ(x,c,t)|ys,yp,t,c,ϕ)=σZ2+σD2(Σ0Z+ΣDZ)ΣZZ1(Σ0Z+ΣDZ), where σZ2 and σD2 are taken from [phi with hat], [Sigma]0Z and [Sigma]ZZ given in (C.1), and [Sigma]DZ is given by (B.1) where conditioning is given (t,c,ϕ).



A MATLAB Routine: The supplemental materials include a user guide to this MATLAB package (STaCguide.pdf), the software package (directory STaCprogram), and an example subdirectory having the driver file for running the example in Section 5.3. (, zip archive)

Contributor Information

Gang Han, H. Lee Moffitt Cancer Center & Research Institute, MRC/BIOSTAT, 12902 Magnolia Drive, Tampa, FL 33612, (gro.ttiffom@nah.gnag)

Thomas J. Santner, Department of Statistics, The Ohio State University, 415 Cockins Hall, 1958 Neil Avenue, Columbus, OH 43210, (ude.uso.tats@sjt)

Jeremy J. Rawlinson, C2-527 Clinical Programs Center, Department of Clinical Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY 14853, (ude.llenroc@4rjj)


  • Bayarri MJ, Berger JO, Cafeo J, Garcia-Donato G, Liu F, Palomo J, Parthasarathy RJ, Paulo R, Sacks J, Walsh D. Computer Model Validation With Functional Output. The Annals of Statistics. 2007a;35(5):1874–1906.
  • Bayarri MJ, Berger JO, Paulo R, Sacks J, Cafeo JA, Cavendish J, Lin C-H, Tu J. A Framework for Validation of Computer Models. Technometrics. 2007b;49(2):138–154.
  • Cox DD, Park JS, Singer CE. A Statistical Method for Tuning a Computer Code to a Database. Rice University, Dept. of Statistics; 1996. Technical Report 96-3.
  • Craig PC, Goldstein M, Rougier JC, Seheult AH. Bayesian Forecasting for Complex Systems Using Computer Simulators. Journal of the American Statistical Association. 2001;96:717–729.
  • Craig PS, Goldstein M, Seheult AH, Smith JA. Bayes Linear Strategies for History Matching of Hydrocarbon Reservoirs. In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian Statistics. Vol. 5. Oxford: Oxford University Press; 1996. pp. 69–95.
  • Fang KT, Li R, Sudjianto A. Design and Modeling for Computer Experiments. London: Chapman & Hall; 2005.
  • Forrester A, Sobester A, Keane A. Engineering Design via Surrogate Modelling: A Practical Guide. Chichester, U.K: Wiley; 2008.
  • Gattiker J. Using the Gaussian Process Model for Simulation Analysis (GPM/SA) Code. Los Alamos National Laboratory; 2005. technical report.
  • Higdon D, Gattiker J, Williams B, Rightley M. Computer Model Calibration Using High Dimensional Output. Journal of the American Statistical Association. 2008;103:570–583.
  • Higdon D, Kennedy M, Cavendish J, Cafeo J, Ryne R. Combining Field Data and Computer Simulations for Calibration and Prediction. SIAM Journal of Scientific Computing. 2004;26:448–466.
  • Higdon D, Williams B, Moore L, McKay M, Keller-McNulty S. Uncertainty Quantification for Combining Experimental Data and Computer Simulations. Los Alamos National Laboratory; 2005. Technical Report LA-UR 05-4951.
  • Kennedy MC, O’Hagan A. Bayesian Calibration of Computer Models. Journal of the Royal Statistical Society, Ser. B. 2001;63:425–464. (with discussion)
  • Linkletter C, Bingham D, Hengartner N, Higdon D, Ye KQ. Variable Selection for Gaussian Process Models in Computer Experiments. Technometrics. 2006;48:478–490.
  • Loeppky J, Bingham D, Welch W. Computer Model Calibration or Tuning in Practice. University of British Columbia; 2006. technical report.
  • McKay MD, Beckman RJ, Conover WJ. A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output From a Computer Code. Technometrics. 1979;21:239–245.
  • Park JS. Ph.D. thesis. Champaign/Urbana, IL: University of Illinois; 1991. Tuning Complex Computer Codes to Data and Optimal Designs.
  • Qian PZG. Nested Latin Hypercube Designs. Biometrika. 2009 to appear, doi: 10.1093/biomet/asp045.
  • Qian PZG, Wu CFJ. Sliced Space-Filling Designs. Biometrika. 2009 to appear, doi: 10.1093/biomet/asp044.
  • Qian PZG, Ai M, Wu CFJ. Construction of Nested Space-Filling Designs. The Annals of Statistics. 2009;37:3616–3643.
  • Rawlinson JJ, Furman BD, Li S, Wright TM, Bartel DL. Retrieval, Experimental, and Computational Assessment of the Performance of Total Knee Replacements. Journal of Orthopaedic Research. 2006;24(7):1384–1394. [PubMed]
  • Saltelli A, Chan K, Scott E. Sensitivity Analysis. Chichester: Wiley; 2000.
  • Santner TJ, Williams BJ, Notz WI. The Design and Analysis of Computer Experiments. New York: Springer Verlag; 2003.