Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2753754

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Maximum Likelihood Inference for a Biochemical Reaction Network
- 3. Simulated Numerical Example
- 4. Summary and Discussion
- References

Authors

Related links

Comput Biol Chem. Author manuscript; available in PMC 2010 October 1.

Published in final edited form as:

Published online 2009 August 6. doi: 10.1016/j.compbiolchem.2009.07.014

PMCID: PMC2753754

NIHMSID: NIHMS135718

Gheorghe Craciun, Department of Mathematics and Department of Biomolecular Chemistry, University of Wisconsin-Madison. Email: ude.csiw.htam@nuicarc, Phone: (608)265-3391, Fax: (608)263-8891;

The publisher's final edited version of this article is available at Comput Biol Chem

See other articles in PMC that cite the published article.

We present a novel method for identifying a biochemical reaction network based on multiple sets of estimated reaction rates in the corresponding reaction rate equations arriving from various (possibly different) experiments. The current method, unlike some of the graphical approaches proposed in the literature, uses the values of the experimental measurements only relative to the geometry of the biochemical reactions under the assumption that the underlying reaction network is the same for all the experiments. The proposed approach utilizes algebraic statistical methods in order to parametrize the set of possible reactions so as to identify the most likely network structure, and is easily scalable to very complicated biochemical systems involving a large number of species and reactions. The method is illustrated with a numerical example of a hypothetical network arising form a “mass transfer”-type model.

In modern biological research, it is very common to collect detailed information on time-dependent chemical concentration data for large networks of biochemical reactions (see survey papers Crampin et al., 2004; Maria, 2004). Often, the main purpose of collecting such data is to identify the exact structure of a network of chemical reactions for which the identity of the chemical species present in the network is known but *a priori* no information is available on the species interactions (see e.g. Margolini and Califano, 2007). The problem is of interest both in the setting of classical theoretical chemistry, as well as, more recently, in the context of molecular and systems biology problems and as such has received a lot of attention in the literature over last several decades as evidenced by multiple papers devoted to the topic (Bansal et al., 2007; Fay and Balogh, 1968; Himmeblau et al., 1967; Hosten, 1979; Karnaukhov et al., 2007; Rudakov, 1960, 1970; Schuster et al., 2002; Vajda et al., 1986).

In general, two very different reaction networks might generate identical mass-action dynamical system models, making it impossible to discriminate between them, even if one is given experimental data of perfect accuracy and unlimited temporal resolution. Sometimes this *lack of uniqueness* is referred to as the “fundamental dogma of chemical kinetics”, although it is actually not a well known fact in the biochemistry or chemical engineering communities (Crampin et al., 2004; Erdi and Toth, 1989; Epstein and Pojman, 2002). Necessary and sufficient conditions for two reaction networks to give rise to the same *deterministic* dynamical system model (i.e., the same *reaction rate equations*) are described in Craciun and Pantea (2008), where the problem of identifiability of reaction networks given high accuracy data was analyzed in detail. The key observation is that, if we think of reactions as vectors, it is possible for different sets of such vectors to span the same positive cones, or at least to span positive cones that have nonempty intersection (see Figure 1 for an example).

Identifiability of reaction networks given experimental data. Note that, for a deterministic mass-action model, even if we can estimate the vector *K* of parameter values with great accuracy, we cannot determine if the “correct” reaction **...**

On the other hand, it is often the case that experimental measurements for the study of a specific reaction network or pathway are being collected under many different experimental conditions, which affect the values of reaction rate parameters. Almost always, the reactions of interest are not “elementary reactions”, for which the reaction rates parameters must be constant, but they are so called “overall reactions” that summarize several elementary reaction steps. In that case the reaction rates parameters may reflect the concentrations of biochemical species which have not been included explicitly in the model. In such circumstances the reaction rate parameters are *not* constant, but rather depend on specific experimental conditions, such as concentrations of enzymes and other intermediate species. Therefore, the estimated vector of reaction rate parameters will not be the same for all experimental conditions, but each specific experimental setting will give rise to one such vector of parameters. However, the set of all these vectors should span a specific cone, whose extreme rays should identify exactly the set of reactions that gave rise to the data.

The purpose of the current paper is to propose a statistical method based on the above geometric considerations, which allows one to take advantage of the inherent stochasticity in the data, in order to determine the *unique* reaction network that can best account for the results of *all* the available experiments pooled together. The idea is related to the notion of an *algebraic statistical model* (as described in Pachter and Sturmfels, 2005, Chapter 1), and relies on mapping the estimated reaction parameters into an appropriate convex region of the span of reaction vectors of a network, using the underlying geometry to identify the reactions which are most likely to span that region. As shown below, this approach reduces the network identification problem to a statistical inference problem for the parameters of a multinomial distribution, which may then be solved for instance using the classical likelihood methods.

In this section we develop a formal way of inferring a most likely subnetwork of a given conic network (i.e., network represented by a cone like the one in Figure 1) of the minimal spanning dimension. For the inference purpose, in the network of *m* reactions and *d* species we assume that the empirical data = {*K _{i}*,

Throughout this paper we will use a common abuse of notation, where we think of the species *A*_{1}, … , *A _{d}* as the formal generators of a

Consider *d* species, and *m* possible reactions with reaction vectors *R* = {*R*_{1}, … , *R _{m}*}

Denote by *cone*(*R*) the positive cone generated by the reaction vectors in *R*. Let *S* be the partition of *cone*(*R*) obtained by all possible intersections of non-degenerate cones in _{d}. Suppose *S* contains *n* full-dimensional regions *S*_{1}, … , *S _{n}*; throughout we shall refer to these regions as

Let Δ_{m–1} be a probability simplex in ^{m} and let *θ* Δ_{m}–1 be a vector of probabilities associated with the reactions that give rise to *R*. We assume that these *m* reactions have the same source complex (i.e., form a conic network), since, as explained in Craciun and Pantea (2008), the identifiability of a network can be addressed one source complex at a time. Define the polynomial map

$$g:{\Delta}_{m-1}\to {\mathbb{R}}^{n}$$

where

$${g}_{i}\left(\theta \right)=\underset{C=\mathit{cone}({R}_{\sigma \left(1\right)},\dots ,{R}_{\sigma \left(d\right)})\in {\mathcal{R}}_{d}}{\Sigma}\frac{\mathit{vol}(C\cap {S}_{i})}{\mathit{vol}\left(C\right)}{\theta}_{\sigma \left(1\right)}\cdots {\theta}_{\sigma \left(d\right)}$$

(1)

for *i* = 1, … , *n*.

We take^{1} $\frac{\mathit{vol}(C\cap {S}_{i})}{\mathit{vol}\left(C\right)}=0\phantom{\rule{thickmathspace}{0ex}}\text{if}\phantom{\rule{thickmathspace}{0ex}}\mathit{vol}\left(C\right)=0$. Define $s\left(\theta \right)={\Sigma}_{\sigma}{\theta}_{\sigma \left(1\right)}\cdots {\theta}_{\sigma \left(d\right)}$ and

$$p\left(\theta \right)=({p}_{1}\left(\theta \right)\dots ,{p}_{n}\left(\theta \right))=({g}_{1}\left(\theta \right)\u2215s\left(\theta \right),\dots ,{g}_{n}\left(\theta \right)\u2215s\left(\theta \right)).$$

(2)

In this setting *p* ^{n} is our statistical model for the data, after we substitute ${\theta}_{m}=1-{\Sigma}_{j=1}^{m-1}{\theta}_{j}$. Note that we may interpret the monomials *θ*_{σ(1)} *θ*_{σ(d)} in (1) as the probabilities of a given data point being generated by the *d*-tuple of reactions σ(1), … σ(*d*). With this interpretation the coordinate *p _{i}* of the map

Let *u _{i}* denote the number of data points in

$$l\left(\theta \right)=\underset{i=1}{\overset{n}{\Sigma}}{u}_{i}\phantom{\rule{thinmathspace}{0ex}}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{i}\left(\theta \right).$$

(3)

. Our inference problem is to find

$$\widehat{\theta}={\mathrm{argmax}}_{\theta}l\left(\theta \right)\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\text{subject to}\phantom{\rule{1em}{0ex}}\underset{i=1}{\overset{m}{\Sigma}}{\theta}_{i}=1\phantom{\rule{1em}{0ex}}\text{and}\phantom{\rule{1em}{0ex}}{\theta}_{i}\ge 0.$$

(4)

**Example 2.1.** Consider the two reaction networks described in Figure 1. The model has d = 3 *species and a total of m* = 5 *possible reactions **R* = {*R*_{1}, … , *R*_{5}} = {*A*_{1} → 2*A*_{1}, *A*_{1} → *A*_{1} + *A*_{2}, *A*_{1} → 2*A*_{3}, *A*_{1} → 2*A*_{2}, *A*_{1} → *A*_{1} + *A*_{3}}. *In this case there are n* = 5 *building blocks **S*_{1}, … , *S*_{5} defined by the intersections of all non-trivial reaction cones generated by reaction triples, illustrated in Figure 2.

Configuration of the five building blocks for the possible reactions *R* = {*R*_{1}, … , *R*_{5}} = {*A*_{1} → 2*A*_{1}, *A*_{1} → *A*_{1} + *A*_{2}, *A*_{1} → 2*A*_{3}, *A*_{1} → 2*A*_{2}, *A*_{1} → *A*_{1} + *A*_{3}} of example 2.1.

*Thus denoting **C _{jkl}* =

$$\begin{array}{cc}\hfill {S}_{1}& ={C}_{134}\cap {C}_{234}\cap {C}_{345}\hfill \\ \hfill {S}_{2}& ={C}_{134}\cap {C}_{145}\cap {C}_{234}\cap {C}_{245}\hfill \\ \hfill {S}_{3}& ={C}_{123}\cap {C}_{134}\cap {C}_{235}\cap {C}_{345}\hfill \\ \hfill {S}_{4}& ={C}_{123}\cap {C}_{134}\cap {C}_{145}\cap {C}_{235}\cap {C}_{245}\hfill \\ \hfill {S}_{5}& ={C}_{123}\cap {C}_{125}\cap {C}_{134}\cap {C}_{145}.\hfill \end{array}$$

. *Note that the cones **C*_{124 }*and **C*_{135} are degenerate and are not involved in the definitions of the S_{i}'s. *Denoting further* ${v}_{\mathit{jkl}}^{\left(i\right)}$ = *vol*(*C _{jkl}* ∩

$$\begin{array}{cc}\hfill {g}_{1}\left(\theta \right)& ={v}_{134}^{\left(1\right)}{\theta}_{1}{\theta}_{3}{\theta}_{4}\phantom{\rule{thinmathspace}{0ex}}+\phantom{\rule{thinmathspace}{0ex}}{v}_{234}^{\left(1\right)}{\theta}_{2}{\theta}_{3}{\theta}_{4}\phantom{\rule{thinmathspace}{0ex}}+\phantom{\rule{thinmathspace}{0ex}}{v}_{345}^{\left(1\right)}{\theta}_{3}{\theta}_{4}{\theta}_{5}\hfill \\ \hfill {g}_{2}\left(\theta \right)& ={v}_{134}^{\left(2\right)}{\theta}_{1}{\theta}_{3}{\theta}_{4}\phantom{\rule{thinmathspace}{0ex}}+\phantom{\rule{thinmathspace}{0ex}}{v}_{145}^{\left(2\right)}{\theta}_{1}{\theta}_{4}{\theta}_{5}\phantom{\rule{thinmathspace}{0ex}}+\phantom{\rule{thinmathspace}{0ex}}{v}_{234}^{\left(2\right)}{\theta}_{2}{\theta}_{3}{\theta}_{4}\phantom{\rule{thinmathspace}{0ex}}+\phantom{\rule{thinmathspace}{0ex}}{v}_{245}^{\left(2\right)}{\theta}_{2}{\theta}_{4}{\theta}_{5}\hfill \\ \hfill {g}_{3}\left(\theta \right)& ={v}_{123}^{\left(3\right)}{\theta}_{1}{\theta}_{2}{\theta}_{3}\phantom{\rule{thinmathspace}{0ex}}+\phantom{\rule{thinmathspace}{0ex}}{v}_{134}^{\left(3\right)}{\theta}_{1}{\theta}_{3}{\theta}_{4}\phantom{\rule{thinmathspace}{0ex}}+\phantom{\rule{thinmathspace}{0ex}}{v}_{235}^{\left(3\right)}{\theta}_{2}{\theta}_{3}{\theta}_{5}\hfill \\ \hfill {g}_{4}\left(\theta \right)& ={v}_{123}^{\left(4\right)}{\theta}_{1}{\theta}_{2}{\theta}_{3}\phantom{\rule{thinmathspace}{0ex}}+\phantom{\rule{thinmathspace}{0ex}}{v}_{134}^{\left(4\right)}{\theta}_{1}{\theta}_{3}{\theta}_{4}\phantom{\rule{thinmathspace}{0ex}}+\phantom{\rule{thinmathspace}{0ex}}{v}_{145}^{\left(4\right)}{\theta}_{1}{\theta}_{4}{\theta}_{5}\phantom{\rule{thinmathspace}{0ex}}+\phantom{\rule{thinmathspace}{0ex}}{v}_{235}^{\left(4\right)}{\theta}_{2}{\theta}_{3}{\theta}_{5}\phantom{\rule{thinmathspace}{0ex}}+\phantom{\rule{thinmathspace}{0ex}}{v}_{245}^{\left(4\right)}{\theta}_{2}{\theta}_{4}{\theta}_{5}\hfill \\ \hfill {g}_{5}\left(\theta \right)& ={v}_{123}^{\left(5\right)}{\theta}_{1}{\theta}_{2}{\theta}_{3}\phantom{\rule{thinmathspace}{0ex}}+\phantom{\rule{thinmathspace}{0ex}}{\theta}_{1}{\theta}_{2}{\theta}_{5}\phantom{\rule{thinmathspace}{0ex}}+\phantom{\rule{thinmathspace}{0ex}}{v}_{134}^{\left(5\right)}{\theta}_{1}{\theta}_{3}{\theta}_{4}\phantom{\rule{thinmathspace}{0ex}}+\phantom{\rule{thinmathspace}{0ex}}{v}_{145}^{\left(5\right)}{\theta}_{1}{\theta}_{4}{\theta}_{5},\hfill \end{array}$$

*where the coefficients satisfy* ${\Sigma}_{i}{v}_{\mathit{jkl}}^{\left(i\right)}$ = 1 *for any triple* {*j, k, l*} appearing on the right-hand-side in the formulas above. *The rational map* (2) *is therefore given by*

$$p=\frac{g}{\underset{jkl}{\Sigma}{\theta}_{j}{\theta}_{k}{\theta}_{l}}$$

where the sum in the denominator extends over all distinct triples {*j, k, l*} *excluding* {*1, 2, 4*} *and* {*1, 3, 5*}, *i.e., the ones corresponding to degenerate cones.*

The model representation via a rational map (2) may be equivalently described in terms of a simpler polynomial map (1) as follows. Let us substitute * _{i}* =

$${\stackrel{~}{g}}_{i}\left(\stackrel{~}{\theta}\right)={p}_{i}\left(\theta \right).$$

Note that * _{i}* : ${\mathbb{R}}_{>0}^{m}$ →

$$\begin{array}{cc}\hfill & \widehat{\theta}={\mathrm{argmax}}_{\stackrel{~}{\theta}}\phantom{\rule{thickmathspace}{0ex}}l\left(\stackrel{~}{\theta}\right)\hfill \\ \hfill & \text{subject to}\phantom{\rule{1em}{0ex}}\phantom{\rule{thickmathspace}{0ex}}\underset{\sigma}{\Sigma}{\stackrel{~}{\theta}}_{\sigma \left(1\right)}\cdots {\stackrel{~}{\theta}}_{\sigma \left(d\right)}=\underset{i}{\Sigma}{\stackrel{~}{g}}_{i}\left(\stackrel{~}{\theta}\right)=\underset{i=1}{\overset{n}{\Sigma}}{p}_{i}\left(\theta \right)=1,\phantom{\rule{1em}{0ex}}{\forall}_{i}\phantom{\rule{thinmathspace}{0ex}}{\stackrel{~}{\theta}}_{i}\ge 0.\hfill \end{array}$$

(*)

.

Consider a fixed *d*-tuple of reactions (say, *σ*_{1}) and in the formulas for * _{i}* (

$${p}_{i}({\stackrel{~}{\theta}}_{1}\mid \cdot )={a}_{i}{\stackrel{~}{\theta}}_{1}+{b}_{i}\phantom{\rule{1em}{0ex}}i=1\dots ,n$$

where Σ* _{i} a_{i}* = 0 and Σ

By Varchenko's theorem (see Pachter and Sturmfels, 2005, chapter 1) the conditional, one dimensional version of problem (*) may be now solved iteratively for each *p _{i}* (

Due to the conditional convexity of the one dimensional problems the above considerations suggest that the following algorithm for (local) maximization of *l*() should be valid (see also Pachter and Sturmfels, 2005, Example 1.7, page 11):

Algorithm 2.1.

*Pick initial vectors and*_{old}^{m}.*While*|*l*() −*l*(_{old})| >←_{old}- for k=1 to m do

- –
*compute a*_{i}, b_{i}(as functions of_{j}, j ≠ k) - – identify the bounded interval as determined by Varchenko's fromula which is statistically meaningful (there is only one).
- – use a simple hill-climbing algorithm to find an optimal ${\stackrel{~}{\theta}}_{k}^{\mathit{opt}}$ in that interval
- – update
← ${\stackrel{~}{\theta}}_{k}^{\mathit{opt}}$_{k}

*Recover θ from by taking θ*=_{k}._{k}/Σ_{i}_{i}

The advantage of the algorithm above is that it reduces a potentially very complicated multivariate optimization problem in which *d* and *m* are large to iteratively solving of a simple, univariate one. The disadvantage is that due to its dimension-iterative character the algorithm is seen to be slow and for smaller networks perhaps less efficient than some off-the-shelf optimization algorithms available in commercial software (e.g., some modified hill-climbing methods with random restarts). For that reason in our numerical example below we used the standard Matlab optimization package rather than Alg. 2.1.

In the next section we revert to the notation of Section 1 and the original problem (4). Based on (*) in this section we may thus extend map *g* to ${\mathbb{R}}_{>0}^{m}$, take *s(θ)* = 1 in (3) and re-cast the original likelihood maximization problem (4) as

$$\widehat{\theta}={\mathrm{argmax}}_{\theta}\underset{i}{\Sigma}{u}_{i}\phantom{\rule{thinmathspace}{0ex}}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{g}_{i}\left(\theta \right)\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\text{subject to}\phantom{\rule{1em}{0ex}}\underset{\sigma}{\Sigma}{\theta}_{\sigma \left(1\right)}\cdots {\theta}_{\sigma \left(d\right)}=1\phantom{\rule{1em}{0ex}}\text{and}\phantom{\rule{1em}{0ex}}{\theta}_{i}\ge 0$$

(4′)

where the *g _{i}*'s are given by (1).

In this section we illustrate the ideas discussed above by analyzing a specific numerical example in detail.

If we have *d* chemical species and data of the form = {*K _{i}, i* = 1 … ,

(5)

where *A _{i}*,

Note that the (deterministic) dynamics of the chemical reaction network (5) is governed by linear differential equations of the form

$${\mathit{dA}}_{i}\u2215\mathit{dt}={\gamma}_{i}{A}_{0}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}i=0,\dots ,3.$$

(6)

where the parameters *γ _{i}*,

An example of generation of the data points *K*_{i} for *i* = 1, 2, 3 via a two-step process of simulation and estimation. Three stochastic trajectories of the reaction network (5) were simulated via Gillespie algorithm with propensity (reaction) **...**

We view the single resulting (_{0}, _{1}, _{2}, _{3}) as coordinates of a point *K* in the species coordinate system [*A _{0}, A_{1}, A_{2}, A_{3}*]. This representation of data points is not related to a choice of the reactions; note, however, that if the estimation error is sufficiently small

The data set = {*K _{i}, i* = 1 … ,

In order to test our method, let us first add one incorrect reaction, *A*_{0} → *A*_{2}, and from this point on suppose we have no *a priori* knowledge of the true chemistry; therefore, the five possible reactions are as shown in (7).

(7)

Later in this section we also consider the case where we add not just one, but several incorrect reactions.

In order to obtain an estimate via (4′) one needs to be able to evaluate the map (1), i.e., in addition to the data counts vector *u* * ^{n}* in (3) one also needs to know the values of the coefficients of the polynomial map. Whereas the calculation of the exact values is difficult for

$$\frac{\mathit{vol}(C\cap {S}_{i})}{\mathit{vol}\left(C\right)}\approx (\#\text{points in}\phantom{\rule{thinmathspace}{0ex}}C\cap {S}_{i})\u2215N\phantom{\rule{1em}{0ex}}i=1\dots ,n.$$

With the coefficient values determined as above, the coordinate polynomial maps *g _{i}* in (1) were easily calculated now by identifying the cones that contained the appropriate building block regions

The geometry in (7) can be visualized in the 3-dimensional subspace *W* ^{n} generated by {*A*_{1},*A*_{2}*A*_{3}}. This follows as all the reaction targets are in this sub-space, and we can understand the configuration of relevant four-dimensional cones by looking at their intersections with *W*. Each four-dimensional cone with vertex *X*_{0} intersects *W* along a tetrahedron. The intersections of all these tetrahedra cut out the building blocks corresponding to our example (7), as illustrated in Figures Figures44 and and5.5. There are five vertices labeled by numbers corresponding to the five target reactions in (7); they form a six-faced convex polyhedron . Let *C* be the intersection of line passing through points 1 and 2, denoted (12), with the plane (345). Then all building blocks are tetrahedra with a vertex at *C* and the opposite face being one of the six faces of the polyhedron . For example, the building block (*C*245) is depicted in Figure 4.

Not surprisingly, the 10 data points generated in our example were distributed among the building blocks that compose the tetrahedron (1234) corresponding to the true reactions; 6 data points fell inside the building block (*C*234) and 4 inside (*C*134). The log-likelihood function was found in this case as

$$l\left(\theta \right)=6\phantom{\rule{thinmathspace}{0ex}}\mathrm{log}(.706\cdot {\theta}_{1}{\theta}_{2}{\theta}_{3}{\theta}_{4}+.35\cdot {\theta}_{2}{\theta}_{3}{\theta}_{4}{\theta}_{5})+4\phantom{\rule{thinmathspace}{0ex}}\mathrm{log}(.294\cdot {\theta}_{1}{\theta}_{2}{\theta}_{3}{\theta}_{4}+.339\cdot {\theta}_{1}{\theta}_{3}{\theta}_{4}{\theta}_{5}).$$

In order to maximize *l*(*θ*) or, equivalently, to minimize −*l*(*θ*), rather than using the conditional Algorithm 2.1 we used the Matlab function **fmincon** for constrained optimization. As in (4′), the constraint is given by the condition *s*(*θ*) = Σ_{σ }*θ*_{σ(1)} … *θ*_{σ(4)} = 1 and comes from the fact that there are 4 reactions in the true network. This constraint also assures that maps into the probability simplex Δ_{n–1}, i.e., defines an algebraic variety (polynomial map) which corresponds to a valid statistical model.

The optimization is repeated 2^{m–l} times (i.e., 16 times for the example for network (7)) with random initial conditions satisfying the constrain. A list of (local) minima was created and entries were merged if they were sufficiently close. The point *θ* that realized the smallest local minimum was reported together with the percentage of time the algorithm ended up at that particular point (success rate). The output for example (7) given by the customized Matlab function was

- Minimum of negative log-likelihood: 6.94
- Theta:
- 1 1 1 1 0.
- Hits: 16 out of 16, 100%.

As we may see from these results, in the notation of Figures Figures44 and and55 the algorithm identified the true reactions (targets) 1, 2, 3 and 4 and discarded the incorrect reaction 5.

We also ran example (5) with, respectively, two, three, four and five incorrect reactions added to the set of four correct ones. The true network was always identified and the success rate (percentage of correct hits for various random initial guess) was high. The results of these experiments are summarized in Table 2. Aditionally, in order to investigate the robustness of our procedure against the lack of precision in the estimates of the *γ*'s, we have added the zero-mean Gaussian term with varying variance to the multiple batches (3000) of *k* = 10 sets of estimators *^γ* in the network (7) with one erroneous reaction. The estimated probability of assigning the lowest probability to the extraneous reaction vector as a function of increased noise added to the mean-squared estimated values *^γ* is presented in Figure 6. As seen from the figure with small to moderate random noise added to the values of the reaction constants the likelihood method is still able to recover the correct set of reactions at remarkably high rate. Not surprisingly, for very noisy data (above one SD) the recovery rate decreases significantly.

The rate of recovery of the correct reactions out of the network (7) as a function of the size of the Gaussian noise added to the least-squares estimates of the (*γ*_{i}) parameters. The solid graph is a smoothed representation of the empirical values **...**

We have proposed herein a statistical method for inferring a biochemical reaction network given several sets of data that originate from “noisy” versions of the reaction rate equations associated with the network. As illustrated in the earlier work of some of the authors (Craciun and Pantea, 2008), in the usual deterministic sense such networks are in general unidentifiable, i.e., different chemical reaction networks may give rise to exactly the same reaction rate equations. In practice, the situation is further complicated since the coefficients of the reaction rate equations are estimated from available experimental data, and hence are subject to measurement error and, moreover, their actual values may differ at different experimental conditions, i.e. at different data points. The statistical approach described here is largely unaffected by these problems, as it only relies on the geometry of the network relative to the data distribution, in order to identify the sets of most likely reactions. Hence, the method takes advantage of the algebraic and geometric representation of the network rather than merely the observed experimental values of the network species, as is commonly the case in network inference models based on graphical methods, like e.g. Bayesian or probabilistic boolean networks (see T. Richardson, 2002). Still, in order to use the proposed multinomial parametrization of a biochemical network, the method does require a valid way of mapping the experimentally estimated rate coefficients into the networks' appropriate convex regions, and with very large measurement errors is likely to perform poorly as illustrated in Figure 6. On the other hand, precisely because of the need for the experimental data mapping, the method has a very attractive feature of being able to potentially combine a variety of different data sets obtained by various methods into one set of experimental points placed in the convex hull of the network building blocks. This potential universality of the method's applicability requires further study and possibly a development of additional statistical methodology beyond the scope of our present work. In the current paper our main goal was to present a proof-of-concept example based on simulated data, with a purposefully straightforward but non-trivial model discrimination problem. For the example provided in this paper the method was seen to perform very well, with almost perfect discrimination against incorrect models even as the complexity of the model selection problem increased.

Nonetheless, further study and development is needed to assess how well the method may perform on more challenging and realistic data sets. In particular, one of the aspects of the methodology which was not pursued here, and which could improve its computational scalability, is the utilization of techniques from computational algebra in order to increase the efficiency and further automate the proposed maximization algorithm.

The authors would like to thank Peter Huggins and Ruriko Yoshida for very helpful discussions, and additionally thank Peter Huggins for making available his Matlab script for volume calculations. The research was partially sponsored by FRG grant NSF–DMS 0840695 (Rempala) and NSF–DMS 0553687 (Craciun) as well as by the NIH grant 1R01DE019243-01 (Rempala) and the NIH grant 1R01GM086881 (Craciun).

**Publisher's Disclaimer: **This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

^{1}In general, it may be beneficial to consider various measures *vol*(·) which are absolutely continuous w.r.t. the usual Lebesque measure. For instance in Section 3 we describe an example where this measure is defined via gamma densities.

^{2}For our current example a brief inspection of the Table 1 indicates a reasonably good agreement between the estimates and the true values of the parameters (*γ*_{0}, *γ*_{1}, *γ*_{2}, *γ*_{3}) both in terms of actual values as well as the corresponding SE's.

Gheorghe Craciun, Department of Mathematics and Department of Biomolecular Chemistry, University of Wisconsin-Madison. Email: ude.csiw.htam@nuicarc, Phone: (608)265-3391, Fax: (608)263-8891.

Casian Pantea, Department of Mathematics, University of Wisconsin-Madison. Email: ude.csiw.htam@aetnap.

Grzegorz A. Rempala, Department of Biostatistics, Medical College of Georgia, Augusta, GA 30912. Email: ude.gcm@alapmerg.

- Bansal M, Belcastro V, Ambesi-Impiombato A, di Bernardo D. How to infer gene networks from expression profiles. Mol. Syst. Biol. 2007;78(3) doi:10.1038/msb4100120. [PMC free article] [PubMed]
- Craciun G, Pantea C. Identifiability of chemical reaction networks. J. Math. Chem. 2008;44(1):244–259.
- Crampin E, Schnell S, McSharry PE. Mathematical and computational techniques to deduce complex biochemical reaction mechanisms. Prog. Biophys. Mol. Biol. 2004;86:77–112. [PubMed]
- Epstein I, Pojman J. An Introduction to Nonlinear Chemical Dynamics: Oscillations, Waves, Patterns, and Chaos. Oxford University Press; Oxford, UK: 2002.
- Erdi P, Toth J. Mathematical Models of Chemical Reactions: Theory and Applications of Deterministic and Stochastic Models. Princeton University Press; Princeton, NJ: 1989.
- Ethier S, Kurtz T. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics. John Wiley & Sons Inc.; New York, NY: 1986. Markov Processes.
- Fay L, Balogh A. Determination of reaction order and rate constants on the basis of the parameter estimation of differential equations. Acta Chim. Acad. Sci. Hun. 1968;57(4):391–401.
- Himmeblau D, Jones C, Bischoff K. Determination of rate constants for complex kinetic models. Ind. Eng. Chem. Fundam. 1967;6(4):539–543.
- Hosten L. A comparative study of short cut procedures for parameter estimation in differential equations. Comput. Chem. Eng. 1979;3:117–126.
- Huggins P, Yoshida R. First steps toward the geometry of cophylogeny arXiv:0809.1908
- Karnaukhov A, Karnaukhova E, Williamson J. Numerical matrices method for nonlinear system identification and description of dynamics of biochemical reaction networks. Biophys. J. 2007;92:3459–3473. [PubMed]
- Margolini A, Califano A. Theory and limitations of genetic networks inference from microarray data. Ann. N.Y. Acad Sci. 2007;1115:51–72. [PubMed]
- Maria G. A review of algorithms and trends in kinetic model identification for chemical and biochemical systems. Chem. Biochem. Eng. Q. 2004;18(3):195–222.
- Pachter L, Sturmfels B. Algebraic Statistics for Computational Biology. Cambridge University Press; USA: 2005.
- Rempala G, Ramos K, Kalbfleisch T. A stochastic model of gene transcription: An application to l1 retrotransposition events. J. Theor. Biol. 2006;242(1):101–116. [PubMed]
- Rudakov E. Differential methods of determination of rate constants of non-complicated chemical reactions. Kinet. Catal. 1960;1:177–187.
- Rudakov E. Determination of rate constants. method of support function. Kinet. Catal. 1970;11:228–234.
- Schuster S, Hilgetag C, Woods J, Fell D. Reaction routes in biochemical reaction systems: algebraic properties, validated calculation procedure and example from nucleotide metabolism. J. Math. Biol. 2002;45:153–181. [PubMed]
- Richardson T, S. P. Ancestral graph markov models. Ann. Statist. 2002;30:962–1030.
- Vajda S, Valko P, Yermakova A. A direct-indirect procedure for estimating kinetic parameters. Computers Chem. Eng. 1986;10(1):49–58.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |