Computational models play increasingly important roles in biology. Constructing a model that accurately represents the mechanism of a system, reliably simulates its behavior, and has well-defined parameter values is the ultimate goal of many research projects. Models are used for interpreting experimental observations, testing hypotheses, integrating knowledge, discovering components responsible for certain behavior, designing more informative experiments, and making quantitative predictions (1
). Remarkably, computational models act both as tools for studying biology and as representations of the resulting knowledge. Indeed, quantitative mechanistic information incorporated into a model allows it to make predictions outside the domain of existing observations.
The focus of this chapter is on understanding experimental data and extracting useful information from it. The role of a model in this process is to postulate a relationship between conditions of experiments and the observed results. Using regression analysis, different models can be tested for their ability to explain the experimental observations, and their parameters can be estimated. Thus, regression analysis ties together models and data, validating the former and extracting information from the latter (2
Unfortunately, practical application of this procedure to biological systems can be complicated. As will be shown in this chapter, even relatively simple models may contain too many parameters to estimate based on a single experiment of any type. Therefore, to test whether the model is consistent with the data and to determine its parameters, data from multiple experiments need to be analyzed globally, while applying all known constraints to the values of parameters (4
In this chapter, we discuss the challenges associated with practical application of regression analysis to biological systems. The problems we describe are exacerbated in complex models and experimental designs, and thus are especially frustrating for quantitative biologists. We describe our software, gfit, which helps to overcome these problems and illustrate its utility with three biological systems of increasing complexity.
1.1. Regression Analysis
Regression analysis includes a range of methods for establishing a model that accurately represents a system and makes accurate predictions of its behavior. The specific tasks include searching for optimal parameter values, testing whether the model agrees with experimental data, estimating parameter confidence intervals, testing whether more experimental data are needed, detecting outlier points, and selecting the preferred model from two possible ones. In regression analysis, model F
is defined as a quantitative relationship between experimental measurements (dependent variables)
and experiment conditions (independent variables) C
is a vector of model parameters (variables affecting behavior of the system that cannot be controlled or directly observed during experiment), and ε
is a set of measurement errors (see Note 1
Goodness of fit, the closeness of model simulations to the measurements, is quantified by objective function S
). The most commonly used objective function is a sum of squared residuals (see Note 2
or, in case of nonuniformly distributed ε, a weighted sum of squared residuals
Curve fitting is a problem of finding parameters x that produce the best fit, that is minimize the objective function:
Curve fitting is an optimization problem, performed by optimization engines. Many tasks of regression analysis are based on curve fitting.
1.2. Applying Regression Analysis to Experimental Data
One common obstacle to broader application of regression analysis to biological problems is failure of many models to directly simulate the experimentally observed variable. For example, a typical system model may simulate concentrations of reacting species, values that are rarely observed in an experiment directly. One way of addressing this discrepancy is to convert measured values into the type simulated by the model. However, such conversions often introduce statistical errors and are not always possible. The better solution is to simulate exactly the same value type as measured in the experiment. To achieve that, separate experiment models may be required. Experiment models use the system model to simulate the system's response to manipulations and the experimentally measured signal (see
). The approach of separating system models and experiment models is used in Virtual Cell software (6
Fig. 1 Application of scientific method to quantitative biology. Mechanistic Hypothesis about a biological system leads to a System Model, a quantitative description of system components and their interactions. To test the Hypothesis, the system is treated in (more ...)
A curve fitting procedure for a heterogeneous dataset can be quite complex and require extensive communication between its entities, i.e., model, optimization engine, experiment conditions, measurements, parameters, and constraints (see ). Before a search for optimal parameter values can begin, the data for each experiment has to be examined:
Components of regression analysis. Arrows indicate information flow between components. (A) Analysis procedure requires extensive interactions between components. (B) To streamline the procedure, gfit mediates all interactions between components.
- – To determine which variables need to be simulated and their sizes
- – To check that the data required for the simulation has been provided
- – To check against constraints on variable dimensions and values imposed by the model
- – To determine what parameters can be estimated and to choose their starting values
Once the data have been examined, the optimization procedure can be initiated by passing a vector of starting parameter values to the optimization engine. Depending on the engine type, parameter constraints can be also provided. The engine conducts optimization by repeatedly changing parameters and recalculating the objective function on the basis of experimental measurements and simulations. To simulate each experiment, the input data for the model has to be assembled from applicable optimization parameters and experiment conditions. The input data also have to be checked against the constraints, since not all of them can be enforced by optimization engines. After simulating all experiments, the appropriate objective function can be computed and used by optimization engine to determine the direction of the search.
Curve fitting procedure follows complicated rules that depend on the computational model, experimental data, and optimization engine. In addition, parameter constraints need to reflect various considerations related to the research project. These factors make the analysis procedure not only complex, but also highly variable, making design and maintenance of project-specific software prohibitively expensive. Fortunately, the patterns of data flow during regression analysis are largely independent of the system under investigation. This fact allowed us to design software that solves the analysis problem generally and for any model type.
1.3. Design of gfit
The purpose of gfit is connecting models with various types of experimental data. First, it simplifies the model's task of directly simulating experimentally observable variables. Second, during regression analysis, gfit maintains communications between the analysis components, acting as a mediator (see ). Third, by defining standard application interfaces for models, optimization engines, objective functions, and other entities, it facilitates customization of the analysis procedure.
Of all components, application interfaces of models represent the biggest problem. Almost every step of regression analysis procedure depends on what information is required and produced by the model. Yet, every model has different inputs and outputs. To be able to perform regression analysis with any kind of computational model, gfit
uses a metadata approach. Any model used by gfit
is expected to have an attached Model Description (see Note 3
) defining its inputs and outputs as sets of variables (see Note 4
). More information about Model Descriptions is provided later in this chapter. Once the rules for performing simulations with the model are known, the analysis process becomes more straightforward and independent of the model type ().
Fig. 3 Flow of information through regression analysis components. Communications between the components are controlled by gfit according to Model Description. During simulation of each experiment, independent variables from the experiment conditions and parameters (more ...)
Regression analysis is a complicated process with many pitfalls. gfit strives to provide information that can help researchers avoid mistakes related to the analysis. In the protocols that follow, the reader will build simple models and use the existing models and experimental data for parameter estimation.