|Home | About | Journals | Submit | Contact Us | Français|
Regression analysis is indispensible for quantitative understanding of biological systems and for developing accurate computational models. By applying regression analysis, one can validate models and quantify components of the system, including ones that cannot be observed directly. Global (simultaneous) analysis of all experimental data available for the system produces the most informative results. To quantify components of a complex system, the dataset needs to contain experiments of different types performed under a broad range of conditions. However, heterogeneity of such datasets complicates implementation of the global analysis. Computational models continuously evolve to include new knowledge and to account for novel experimental data, creating the demand for flexible and efficient analysis procedures. To address these problems, we have developed gfit software to globally analyze many types of experiments, to validate computational models, and to extract maximum information from the available experimental data.
Computational models play increasingly important roles in biology. Constructing a model that accurately represents the mechanism of a system, reliably simulates its behavior, and has well-defined parameter values is the ultimate goal of many research projects. Models are used for interpreting experimental observations, testing hypotheses, integrating knowledge, discovering components responsible for certain behavior, designing more informative experiments, and making quantitative predictions (1). Remarkably, computational models act both as tools for studying biology and as representations of the resulting knowledge. Indeed, quantitative mechanistic information incorporated into a model allows it to make predictions outside the domain of existing observations.
The focus of this chapter is on understanding experimental data and extracting useful information from it. The role of a model in this process is to postulate a relationship between conditions of experiments and the observed results. Using regression analysis, different models can be tested for their ability to explain the experimental observations, and their parameters can be estimated. Thus, regression analysis ties together models and data, validating the former and extracting information from the latter (2, 3).
Unfortunately, practical application of this procedure to biological systems can be complicated. As will be shown in this chapter, even relatively simple models may contain too many parameters to estimate based on a single experiment of any type. Therefore, to test whether the model is consistent with the data and to determine its parameters, data from multiple experiments need to be analyzed globally, while applying all known constraints to the values of parameters (4).
In this chapter, we discuss the challenges associated with practical application of regression analysis to biological systems. The problems we describe are exacerbated in complex models and experimental designs, and thus are especially frustrating for quantitative biologists. We describe our software, gfit, which helps to overcome these problems and illustrate its utility with three biological systems of increasing complexity.
Regression analysis includes a range of methods for establishing a model that accurately represents a system and makes accurate predictions of its behavior. The specific tasks include searching for optimal parameter values, testing whether the model agrees with experimental data, estimating parameter confidence intervals, testing whether more experimental data are needed, detecting outlier points, and selecting the preferred model from two possible ones. In regression analysis, model F is defined as a quantitative relationship between experimental measurements (dependent variables) and experiment conditions (independent variables) C
where x is a vector of model parameters (variables affecting behavior of the system that cannot be controlled or directly observed during experiment), and ε is a set of measurement errors (see Note 1) (5).
Goodness of fit, the closeness of model simulations to the measurements, is quantified by objective function S(x). The most commonly used objective function is a sum of squared residuals (see Note 2),
or, in case of nonuniformly distributed ε, a weighted sum of squared residuals
Curve fitting is a problem of finding parameters x that produce the best fit, that is minimize the objective function:
Curve fitting is an optimization problem, performed by optimization engines. Many tasks of regression analysis are based on curve fitting.
One common obstacle to broader application of regression analysis to biological problems is failure of many models to directly simulate the experimentally observed variable. For example, a typical system model may simulate concentrations of reacting species, values that are rarely observed in an experiment directly. One way of addressing this discrepancy is to convert measured values into the type simulated by the model. However, such conversions often introduce statistical errors and are not always possible. The better solution is to simulate exactly the same value type as measured in the experiment. To achieve that, separate experiment models may be required. Experiment models use the system model to simulate the system's response to manipulations and the experimentally measured signal (see Fig. 1). The approach of separating system models and experiment models is used in Virtual Cell software (6).
A curve fitting procedure for a heterogeneous dataset can be quite complex and require extensive communication between its entities, i.e., model, optimization engine, experiment conditions, measurements, parameters, and constraints (see Fig. 2A). Before a search for optimal parameter values can begin, the data for each experiment has to be examined:
Once the data have been examined, the optimization procedure can be initiated by passing a vector of starting parameter values to the optimization engine. Depending on the engine type, parameter constraints can be also provided. The engine conducts optimization by repeatedly changing parameters and recalculating the objective function on the basis of experimental measurements and simulations. To simulate each experiment, the input data for the model has to be assembled from applicable optimization parameters and experiment conditions. The input data also have to be checked against the constraints, since not all of them can be enforced by optimization engines. After simulating all experiments, the appropriate objective function can be computed and used by optimization engine to determine the direction of the search.
Curve fitting procedure follows complicated rules that depend on the computational model, experimental data, and optimization engine. In addition, parameter constraints need to reflect various considerations related to the research project. These factors make the analysis procedure not only complex, but also highly variable, making design and maintenance of project-specific software prohibitively expensive. Fortunately, the patterns of data flow during regression analysis are largely independent of the system under investigation. This fact allowed us to design software that solves the analysis problem generally and for any model type.
The purpose of gfit is connecting models with various types of experimental data. First, it simplifies the model's task of directly simulating experimentally observable variables. Second, during regression analysis, gfit maintains communications between the analysis components, acting as a mediator (see Fig. 2B). Third, by defining standard application interfaces for models, optimization engines, objective functions, and other entities, it facilitates customization of the analysis procedure.
Of all components, application interfaces of models represent the biggest problem. Almost every step of regression analysis procedure depends on what information is required and produced by the model. Yet, every model has different inputs and outputs. To be able to perform regression analysis with any kind of computational model, gfit uses a metadata approach. Any model used by gfit is expected to have an attached Model Description (see Note 3) defining its inputs and outputs as sets of variables (see Note 4). More information about Model Descriptions is provided later in this chapter. Once the rules for performing simulations with the model are known, the analysis process becomes more straightforward and independent of the model type (Fig. 3).
Regression analysis is a complicated process with many pitfalls. gfit strives to provide information that can help researchers avoid mistakes related to the analysis. In the protocols that follow, the reader will build simple models and use the existing models and experimental data for parameter estimation.
rsys library is used for solving ODEs for mass-action reaction systems, as described in Subheading 3.3.
A zip archive containing all model and data files mentioned in this chapter can be downloaded from http://gfit.sourceforge.net. The data files included are in tab/newline-delimited format. These files can be opened in a text editor, but it is more convenient to view them in a spreadsheet. Please check the readme.txt file for the most current information.
In this section, we will create a model for equilibrium binding of a protein, E, to a ligand, L, (Eq. 5) and use it for analysis of experimental data. This analysis is quite simple and can be accomplished with many existing programs (including commonly used spreadsheet applications). We will use it to illustrate the principles of data analysis and model validation used by gfit, and later apply them to more interesting examples.
If only total concentrations of E and L are known, ET = [E] + [EL] and [LT] = [L] + [EL], equilibrium concentration of the complex can be written as
Steps 1 and 2 create a MATLAB m-file function. Line 1 contains MATLAB keyword function followed by the name of the output variable signal, followed by the name of the function, eq_binding, and a list of input variables (Et, Lt, Kd). Line 2 is a comment, marked in MATLAB by the % character. Line 3 performs the calculation according to Eq. 6. Line 4 assigns the calculated, [EL], to the output variable.
The model we created has three input variables and one output variable. To perform a simulation, we need to supply values for input variables and store the result in the output variable. Variables in MATLAB (and in gfit) can contain a single value or arrays of numbers. The arrays may have 0, 1, or many dimensions. 0D array, scalar, stores a single number; 1D array, stores a vector of numbers; 2D array contains a matrix, etc. Models can take advantage of this fact. For example, our model can accept vectors of Et and Lt concentration and simulate an entire titration curve in one call. We will use the model in Listing 1 to simulate different experiments.
As well, mistakenly supplying negative values for the input parameters will produce a warning and a meaningless simulation result. Although we can keep track of correct variable sizes and values when manually executing a simple model, this task becomes tedious if many different experiments have to be simulated by a complex model. To create a connection between experimental data and a model that is practically useful, there has to be a method for automatically tracking the requirements of the model and for reconciling these requirements with existing experimental data. gfit learns about the requirement of the model and its expected output by reading the associated user generated Model Description (see Note 3).
|1||function signal = eq_binding(Et, Lt, Kd, signal)|
|2||%simulate equilibrium binding E + L <=> EL|
|14||EL = (Kd + Et + Lt - sqrt((Kd + Et + Lt). ^2 - 4 * Et.* Lt))/2;|
|15||signal = EL;|
The information in Model Description is organized in a tabular fashion. Any number of tables can be present. Tables in Model Description are space/tab/newline-delimited. Rows in each table should contain same number of elements. However, if an element needs to be empty (as in minVal of signal variable), an empty pair of brackets ((), , ‘’, or “”) can be used. Brackets should also surround multiword elements.
The first table (lines 4–8) should list all model I/O variables in the same order as in the input list of the function. The first column of every table contains variable names. The second column of the first table defines variable types. Type free for variable Et means that its value could either be known precisely and supplied with the experiment conditions, or not known and appear as one of the optimization parameters. Variable Lt is independent, meaning that its value should always be known exactly and appear in the experimental data. Variable Kd has para type and is sought as an optimization parameter. Variable signal has dependent type and therefore is expected to be calculated by the model.
The Model Description is also used to set bound limits on the input variables. Column three of the first table, (minVal) specifies the minimum value of zero for all variables except signal, which is dependent and cannot have its value constrained. This has been included in the first table but can also be placed in a separate table.
In this example, the size of variables is set in the second table of Model Description (lines 10–12). The length of Et variable vector is set equal to the length of Lt as required by the model. It also sets the length of signal produced by the model to have the length of Lt. Since the size of the output variable must be known by gfit, the latter expression guarantees that the results are equal to a known length. The table also states (line 12, in the third column-plotVs) that signal should be plotted versus Lt.
Once the data is acquired by gfit, it generates parameters for all para variables and for every free model variable in the model that is missing in the experimental data (Fig. 4B). Parameters are generated based on the input variables defined in Model Description.
During simulation of an experiment, input variables draw their values either from the supplied experimental data, or from current values of parameters. In gfit, parameters and input variables have a many-to-many relationship. When simulating different experiments, a variable containing an array of numbers can collect its values from many parameters. One parameter can be also connected to multiple variables as long as the variables have the same physical units (discussed below).
Linkage of parameters to different variables can be adjusted through gfit parameter table in the user interface (Fig. 4B). For every parameter, the table shows its name, optimization flag (pick), low bound constraint (smallest value allowed), start (current) value, and upper bound constraint. Parameters in the table can be sorted by several criteria. To switch between different sorting methods, click sort parameter table header button. All parameters can be selected or deselected for optimization (discussed below) by clicking on pick button. Bound constraints for each parameter are set based on the variable's constraints in the Model Description. The valid interval of constraints can be only reduced through the user interface. For example, minimal value for KD can be changed to 1.0, but not to −1.0.
The model we created can be conveniently used for simulating equilibrium concentration of EL complex under different experiment conditions. Unfortunately, concentration of a complex is seldom measured directly in an experiment. In the simplest case, experimentally observed values are proportional to [EL]. To avoid transformation of the data (even a linear one) prior to analysis, the model needs to simulate the measured signal.
|1||function signal = eq_binding(Et, Lt, Kd, signal, gain, c)|
|2||%binding equilibrium E + L < = > EL; simulate signal proportional to [EL]|
|5||% Et||()||0||1||uM||‘total concentration of E’|
|6||% Lt||independent||0||1||uM||‘total concentration of L’|
|7||% Kd||para||0||0.1||uM||‘dissociation constant’|
|8||% signal||dependent||()||()||()||‘complex formation’|
|9||% gain||para||()||1||()||‘signal gain’|
|10||% c||para||()||0||()||‘signal background’|
|16||EL = (Kd + Et + Lt - sqrt((Kd + Et + Lt). ^2 - 4 * Et.* Lt))/2;|
|17||signal = c + gain * EL;|
These changes make this model more flexible because it can now be used for any experiment that measures a value proportional to the equilibrium concentration of the complex. Parameters by default take more reasonable starting values. Human readable variable descriptions appear when mouse cursor hovers over a name in the parameter table. Units prevent mixing incompatible variables in the same parameter.
In this section, we have created a regular MATLAB m-file and added a gfit Model Description to it, which allowed us to connect it with experimental data to perform simulation and fitting. In the following sections, we will apply this technique to more complex biological systems.
The model created in the previous section can be used for studying relatively simple systems where binding properties are characterized only by the dissociation constant. However, binding processes are often more complex and characterized by multiple parameters. Binding of proteins to discrete positions on a linear lattice (DNA, RNA, microtubules, and microfilaments) plays important roles in many biological processes. In this section, we will discuss binding of the helicase from hepatitis C virus to single-stranded (ss) DNA substrates. The experimental data were obtained by titrating a constant concentration of the helicase with oligonucleotides of different lengths while monitoring the reduction of intrinsic fluorescence of the helicase caused by ssDNA binding (7). With this example we take regression analysis a step further and globally fit many titration curves to quantify helicase properties that are not apparent from any single experiment.
The model of equilibrium binding to a lattice, in addition to concentrations and the dissociation constant, uses parameters related to the geometry of the molecules, namely lattice length N, protein's minimal binding site (M, number of lattice units interacting with the protein), and protein's occlusion site (S, number of lattice units from which one protein molecule excludes the others) (Fig. 6A). The model assumes noncooperative and sequence-independent binding. Since the KD observed in lattice binding experiments changes with the lattice length, the model uses a more fundamental microscopic dissociation constant, , defined as the dissociation constant observed with lattice of length M. Thus, variables , M, and S keep same values in all experiments.
The model calculates concentration of bound protein, EB, from total protein concentration, ET, total concentration of lattice, LT, and N. Depending on relative values of M, S, and N, three cases can be considered. If lattice is shorter than the minimal binding site (Fig. 6B), N < M, and assuming equal contribution of each lattice unit to the binding free energy, the binding can be described by Eq. 6, where the observed dissociation constant
For longer lattices that can accommodate only one protein molecule (Fig. 6C), M ≤ N < 2S, the observed dissociation constant is inversely proportional to the number of distinct positions the protein can occupy.
If multiple proteins can bind to a single lattice molecule, Eq. 6 can no longer be used, and the extent of binding has to be calculated by numerically solving Eq. 9 (8). Alternatively, if the number of proteins bound to a lattice is large, a lower order equation can provide accurate results (9).
Although apparent dissociation constant, KD, can be estimated on the basis of a single titration experiment, true cannot be determined without prior knowledge of M and S. All three parameters can be determined simultaneously by globally fitting titration curves for many ssDNA substrates of different lengths.
The model for lattice binding lattice_binding_v1.m, provided in the data archive, follows a similar pattern as previously created eq_binding.m model. Both models start with Model Description. We will now test whether the model is consistent with the observed results, estimate the parameters, and determine their confidence intervals.
Using this model and dataset, gfit is expected to produce a good fit at the first attempt (Figs. 7 and and8).8). However, this result is not typical. In our experience with other models, local optimization algorithms seldom find the global minimum in the first attempt. This highlights the importance of global optimization methods (10, 11). Generally, finding a global minimum of a nonlinear problem is unattainable within finite time. Nevertheless, even if a good fit has been found, it is advisable to search the parameter space for alternative minima. Discovering distinct sets of parameters that produce “as good,” or “nearly as good” fits is important to diagnose overparameterized models or insufficient amount of experimental observations.
The method currently used by gfit for exploring parameter space is random restart. This simple method is implemented as a “globalizer” on top of the existing local optimization engine. Random restart repeatedly reinitiates local optimization with a randomly chosen set of parameters. The new starting parameter values are picked from a uniform distribution for doubly-constrained parameters, from a truncated normal distribution for singly-constrained parameters, and from a normal distribution for the unconstrained ones. Random restart procedure is implemented without a defined termination condition. It is supposed to be interrupted by the user.
More generally, the following rule-of-thumb procedure can be used to decide if the random restart search should be continued.
In this section, we describe modeling and analysis of a kinetic pathway responsible for loading sliding clamp proteins onto DNA during replication initiation. The clamp, PCNA, encircles DNA and binds to DNA polymerase, conferring processivity to the replication complex. PCNA is loaded onto DNA by the heteropentameric clamp loader protein, RFC, in a process that involves ATP binding and hydrolysis and conformational changes of the proteins (Fig. 9A) (12). Although the process has been extensively studied, understanding at the quantitative level is incomplete. Properties of individual species and reactions involved in this process are difficult to determine because almost any measurement technique is affected by a combination of many simultaneously occurring reactions. Computational modeling in conjunction with regression analysis provides a feasible solution to this complex problem.
Currently we are building, validating, and refining a mechanistic model of the process that can resolve many species and quantify the rates of individual reactions that may not be observed experimentally. A simplified reaction scheme, the first iteration of modeling process, is shown in Fig. 9C. The challenge of estimating the rates of many individual reactions can be addressed by monitoring the process from different perspectives. As noted earlier, each measurement is a function of many or all of the rates in the process; however, measurements from different perspectives are likely to be affected in distinct ways by individual reaction rates. Therefore, taken together, multiple measurements of presteady-state kinetics monitored by a few different methods can provide sufficient constraints to the parameters of the model.
Results of global regression analysis of data from two types of presteady-state experiments, one measuring ATP hydrolysis and the other phosphate (Pi) release by RFC protein, are shown in Figs. 10 and and11.11. ATP hydrolysis was monitored by a radiometric assay measuring formation of 32P-ADP over time from 32P-ATP, and Pi release was monitored by the change in fluorescence of a reporter, Pi Binding Protein, on binding the Pi released by RFC following ATP hydrolysis. Salient features of the experimental design in both cases are (Fig. 9B): (a) rapid mixing of a constant concentration of RFC (in the presence of excess PCNA clamp) with excess ATP; (b) incubation of RFC with ATP for varying times; (c) rapid mixing of the RFC, ATP, PCNA mix with excess DNA and measurement of product formation and release over time. Salient features of the model mechanism are (Fig. 9C): (a) RFC binding to ATP; (b) a proposed step that might account for the observed increase in ATPase activity with increasing RFC-ATP incubation time; (c) DNA binding to the RFC-ATP complex; (d) ATP hydrolysis; (e) release of ADP, Pi and clamp-DNA complex.
Simultaneous fitting of a large number of experiments requires efficient simulation. The most computationally intensive operation performed by RFC model is integration of ODE systems to simulate mass-action reactions. To accelerate this process, reaction kinetics was simulated using native rsys library. A combination of a flexible and user-friendly scripting language, such as MATLAB, with a native library for computationally intensive parts of simulation was found to be especially productive.
Given estimates of ATP and DNA-binding rate constants, the model produces a good fit for datasets from both ATP hydrolysis as well as Pi release experiments in a single seamless operation. Confidence in the parameters obtained from the fits is increased by using the random restart method, as described in Subheading 3.2.3. A highlight of the findings is that global fitting of the ATPase data validates the proposed step between ATP binding and hydrolysis (Fig. 9C), and reveals that it is a relatively slow, and thus mechanistically important, step in the clamp assembly reaction (rate constant: kRt_Act). We can now formulate specific, testable hypotheses regarding the nature of this step; e.g., an ATP binding-driven conformational change in RFC that enables productive interactions with the clamp and DNA.
It is clear that the reaction depicted in Fig. 9C represents a simplified model of the clamp assembly reaction, in which the number of possible species and steps are restricted. Such limitations are often introduced in models to focus on specific experimentally measured parameters, and thereby increase confidence in the fit. For example, this model omits RFC binding to the clamp, since the parameters defining this step are unknown and are not measured explicitly in the ATPase experiments. Under such circumstances, however, it is entirely possible that goodness of fit to the data becomes unrelated to the quality of the model. gfit enables model-based global analysis of a variety of experiments to find parameter values that are consistent across the board. Therefore, a more comprehensive model of clamp assembly can be developed from the start, and then continually validated and refined by data input from experiments measuring clamp binding, clamp opening, DNA binding, clamp-DNA release, etc., leading to discovery of the preferred mechanism of clamp assembly on DNA.
Authors thank members of the Center for Cell Analysis and Modeling: Pavel Kraykivski, Igor Novak, Jim Schaff, and Boris Slepchenko for their advice and critical discussions and Les Loew for his support. Experimental data for clamp loader protein was provided thanks to Siying Chen. This work was supported in part by grants NS15190 (NIH), RR13186 (NIH), RR022232 (NIH) and RR022624 (NIH) to J.H.C; GM55310 (NIH) to S.S.P; GM64514-01 (NIH) and MCB 0448379 (NSF) to M.M.H.
1In regression analysis literature, dependent variables may be referred to as response or observed variables; independent variables may be referred to as predictor or explanatory variables.
2Residual is the difference between the experimental measurement and the simulation produced by the model.
3Model Description is metadata attached to gfit models that defines their correct usage. It contains model name, version, general human-readable comments about the purpose of the model and its algorithm, and, most importantly, machine-readable descriptions of the model's input and output variables. For each variable, it specifies name, type, physical unit, dimensions, and a range of acceptable values. Variable dimensions are defined either as constants or in relationship to another variable dimension or index variable. Variables may change their size depending on experimental data and user input. Dimensions of each variable usually change in concert with dimensions of other variables.
4Variable (in gfit context) is an array of elements (numbers) defined in Model Description. A variable may contain a single element (scalar variable, 0D), a vector of elements (1D), a matrix (2D), etc. Variables are used for storing information about an experiment, for passing data to the model and for receiving data simulated by the model. Depending on the Model Description, each variable dimension may be fixed, or vary individually or in concert with other variable dimensions. This property of variables increases flexibility of gfit models.
5During calculations, all control elements of gfit interface are disabled with the exception of the button Cancel. Clicking this button prevents simulation of the next experiment. Simulation of the current experiment will not be aborted.
6During fitting, information about each iteration is displayed in MATLAB command window. If fitting is aborted, the last values of parameters appear in the right column of parameter table.
7Currently gfit uses an asymptotic method for determining confidence intervals of parameters. This method is known to be inaccurate for nonlinear models.
8In spreadsheet-arranged data, a colon following a variable name indicates that the variable is scalar and its value should appear in the cell to the right of the name.
9During a random restart run, starting parameter values and optimization results are accumulated in files Start_nnn.txt and Optim_nnn.txt, respectively, in folder C:/Mgfit/Temp/, where nnn are digits starting from 000 and incremented for each subsequent random restart search. The files contain tab/newline-delimited tables with each set of parameters occupying one row. The first row contains parameter names. In addition, the first column of Optim_nnn.txt file contains objective function values, S(x). To check the progress of the search without interrupting it, open either of the files in a simple text editor, select, and copy its entire contents, and paste it into a spreadsheet application.
10Column 6 of parameter table contains values produced by optimization (completed or interrupted). To be able to edit the values, to use them for simulation, or as starting values for fitting, copy them to column 4 by clicking table header button value.