The proposed method offers a systematic strategy to extend DFE and to ameliorate its limitations. Just like DFE, the proposed method starts with an optional data preprocessing step, but without any assumption regarding the functional formats of the fluxes in the system. First, the experimental data are tested for mass conservation to make sure no mass is lost or gained during the observed time period. If the data do indicate losses or gains in mass, it is useful to locate possible branches off the main pathway(s) and to account for the changes in total mass of the metabolites in the pathway [19
]. Second, the time series data are smoothed as necessary, which makes it easier to estimate the slopes of all time courses at a given number of time points, using different numerical techniques. These established smoothing methods include splines, artificial neural networks, as well as different types of filters, such as the popular Kalman, Savitzky-Golay, Whittaker, or Eilers filter (see [2
] for applicable methods). In parallel to these data preprocessing steps, the pathway structure (the system topology) is used to generate a system of symbolic equations describing the dynamics of the system. The generic format for such a representation may be written as
denotes the concentration or amount of a variable or variable pool and n
is the total number of time-dependent variables in the system. The functions
represent reactions or fluxes entering or leaving the quantity Xi
, respectively. Substituting slope estimates for the differentials in this system of equations decouples the ordinary differential equations (ODEs) and results in a system of fluxes that is linear at each time point t
]. The algebraic equations may be represented in matrix format as
where s is a vector of slopes, N is the stoichiometric matrix, v is a vector of fluxes, and K is the number of time points t1, t2,…, tj,…, tK where measurements are available.
Next we check the rank of the linear set of algebraic equations in Eq. (2). The system can be easily solved at each time step to obtain dynamic profiles of all fluxes if the system has full rank. Over-determined systems may be solved by pooling fluxes, the use of pseudo-inverse methods, or regression. However, if the system is underdetermined, the solution space is infinite. To overcome this issue, some of the fluxes need to be estimated independently, until the system has full rank and can be solved uniquely. Elsewhere we showed that additional information maybe used to characterize selected fluxes [19
]. Here, the goal is to estimate some fluxes directly from the time series data, without evoking other sources of information.
As an introductory example, consider a linear part of a pathway with feedback inhibition as shown in Figure . The equations that describe the system in terms of fluxes are
The system could be part of a larger pathway system, but for this illustration the context is not relevant. For the illustration, fluxes were generated with a mix of power-law and Hill functions, namely
5 and KM
2. We use these settings to create artificial data, but subsequently assume no knowledge of the functions or parameters in Eq. (4).
Suppose time series data were measured and they are without noise (Figure ). Eq. (3) immediately indicates that the flux system is underdetermined and therefore has infinitely many solutions. A unique solution could be obtained if one of the fluxes could be determined independently. To achieve this independent determination, one may choose any one of the three equations in the system. For ease of computation, one will typically prefer an equation that contains as few fluxes and as few substrates and modulators as possible. In this linear system, all equations have two fluxes and each of them depends on only one metabolite, so that there is no advantage to choosing one equation rather than another.
Generically, we intend to solve the fluxes in the ith equation, which here happens to have only two fluxes, namely one influx (vin) going into the pool Xi, and one efflux (vout) leaving this pool. The flux vin depends only on the precursor Xin of Xi and vout depends only on Xi itself; to minimize confusion, we call this variable generically Xout. Extracting the ith equation from Eq. (1), we thus obtain, in general terms,
The functional form of neither flux is assumed to be known. Substitution of derivatives with slopes results in K equations of the type
As a specific illustration, consider the second equation
in Eq. (3), where v2
depends only on the precursor X1
depends only on X2
. We substitute the derivative
with slopes that can be measured directly from the time series data, possibly upon smoothing. For a total of 50 time points, one thus obtains 50 algebraic equations of the type
It is reasonable to assume that the in- and effluxes are true functions in a mathematical sense. Thus, since vin depends only on Xin, vin must have one and only one value for every given value of Xin. In particular, if Xin assumes the same value at two different time points, vin must have the same (so far unknown) value at both time points as well. In the illustration example, v2 depends only on X1. Thus, for every value of X1 there is one and only one value of v2. The proposed method therefore requires a screening of the available datasets with the goal of identifying different situations where Xin has some fixed value Xin_const. For all these situations, vin also has some fixed value vin_const. Since we do not know the functional form of vin, we cannot directly compute this value vin_const. However, we do know that this value is very similar for all situations where Xin ≈ Xin_const. Thus, for the set of all Xin ≈ Xin_const, Eq. (6) has the form
In the illustrative example, we screen the available data sets and search for different situations where X1 has the same fixed value X1c and, thus, v2 also has the same (yet unknown) value v2c. Thus, for the entire set of all X1 ≈ X1c the second system equation has the form
For instance, X1 has similar values (~0.26) at time points 4, 4.8, 8.8, and 9.2, while X2 has different values at these time points (Figure ).
Figure 2 (a) FixingX1within a narrow range (~0.26), four instances ofX1are found (solid red circles). Fixing X1 within another narrow range (~0.6) provides three instances of X1 (solid orange circles). Similarly, two instances of X1 are found for X1 ~1.26 (solid (more ...)
We repeat this type of screening for different sets of the same or very similar values of Xin. The result is a set of sets with equal Xin_const values within each set but different Xin_const values for different sets. These sets form a histogram with a bin for each Xin_const. If the range of each bin is small enough, we can assume every Xin in the same bin to have very similar values, so that their corresponding vin_const are also very similar. Henceforth, we only retain bins with at least two entries. An example in the illustrative example consists of time points 3.4 and 9.6, where X1 has again similar values. In this case, the value is ~1.26, which is different from the value we screened before. Similarly, for time points 1, 8.4, and 9.4, X1 has a value of ~0.6 (Figure ). Figure shows many situations in the dataset where X1 has approximately some fixed value, and these sets of X1 are reflected in a “bin database of values.” Within each bin, the corresponding value of v2c is very similar as well.
Suppose we have identified P bins that contain at least two Xin. For these bins we determine the corresponding Xout values, which are typically different from each other. Suppose that bin p contains q values. Thus, we obtain q equations of the type
) always has the same value, but Si
) and vout
) have different values. For our illustration we specify nine bins (P
9) (which have at least two X1
(Figure ), and their corresponding values of X2
at the same time points are shown in Figure . The 6th
of the nine bins (shown as the orange bin in Figure ) contains three instances of X1
. Therefore, we obtain three equations of the type
Equation (10) is formulated analogously for each bin p =1, …, P. In each case, vout (binp) can be represented as at least two equations of the type
Since we do not know the functional form of vin
, we do not know the numerical value of vin_const
). However, since vin_const
) is a constant for each bin, the relative positions of a group of values of vout
) depend on each value –Si
) within a given bin, and the slope values can be measured directly from the time series data. In addition, since vout
) is solely determined by Xout
), we can characterize the relative positions of a set of Xout
) and their corresponding values –Si
). Collecting these relationships, we can establish a plot of Xout
) versus –Si
). If the bin contains only two points of Xout
, we consider them as a pair and link them with a connecting line. If the bin contains q
points of Xout
2), we sort Xout
based on their values and connect every two adjacent points as a pair to form a total of q
-1 pairs. In order to address these pairs, we use an additional index for the position of each pair in each bin, such as (Xout
)(1)) for the first point and (Xout
) (2), –Si
) (2)) for the second point, where r
1, …, q
To continue the illustration, the 8th
bin contains two instances of X1
~1.26. The corresponding values of X2
are 1.54 and 2.93, and the –S2
values are −1.35 and 0.93, respectively. The points in the plot of X2
) versus –S2
) are therefore represented as (1.54,–1.35) and (2.93, 0.93). We consider these two points as a pair and link them using a red line (Figure ). Similarly, the 5th
bin contains four instances of X1
~0.26. Their corresponding values of X2
are 1.20, 1.37, 1.66, and 1.99, and the –S2
values are 0.05, 0.35, 1.02, and 1.65, respectively. The points in the plot of X2
) versus –S2
) are therefore represented as (1.20, 0.05), (1.37, 0.35), (1.66, 1.02), and (1.99, 1.65). Two points each are considered a pair and linked with a red line (Figure ). After the pairs of points are determined, we prune the set by neglecting pairs where the distance between Xout
)(1) and Xout
)(2) is below some threshold
. The reason is that small line segments tend to lead to unduly high estimation errors. The default value for dr
is set as 0.2 in the examples shown in this article, but it will generally depend on the accuracy and quantity of the data. The higher the value is, the fewer pairs will remain after filtration. However, as long as the remaining pairs cover most of the spectrum in the X
axis, an increase in dr
might be preferable. Suppose s
pairs remain after this filtering. Figure shows the collective result for the illustration example.
Figure 3 (a) The 8thbin in Figure(b) contains two differentX2values corresponding to the “blue” instances in Figure(a) forX1~1.26. The corresponding values of X2 and –S2, obtained from the plot of X2 versus –S2, are (1.54, (more ...)
Figure 4 (a) Pairs of points satisfying a threshold value ofd(seeMethods) greater then 0.2. Seven pairs (s=7; connected by blue lines) are selected for the following steps. The green line is the true functional representation of X2 versus v3 (more ...)
Equation (12) indicates that Xout (binp) and –Si (binp) differ by a constant, since we do not know the value of vin_const (binp). This fact translates into a constant vertical shift in the y direction for each pair of points. In other words, the relative y positions of the pair are preserved and the pair has to be shifted together by a yet unknown amount. While we do not know the size of the shift for each individual pair of points, all points collectively represent the graph of Xout versus vout, and it is reasonable to assume that this graph is continuous and usually even monotonic. Therefore, the next step is to merge the individual pairs by determining a proper shift for each pair.
Intuitively, it is easy to see how to shift all pairs so that they are close to one continuous line. Automation of the process requires an algorithm that is not quite straightforward, but can be facilitated with a graphical user interface; technical details of a possible merging process are presented in Figure S1 of the Additional file 1
. A pseudo-code of the merging is the following:
SET each pair of points as a node
SET each node as a subgraph
WHILE the graph is not connected
each subgraph in the graph
each node in the current subgraph
END FOR FIND
the shortest distance and its corresponding nodes
these two nodes
When the merging is completed, all pairs of points are close to a relatively smooth line, but the overall shift of the group of pairs is not known. We do know that essentially all metabolic fluxes will have values close to zero when their substrate concentration approaches zero. Thus, if sufficiently small substrate values are available in one of the bins, one easily estimates a reasonable shift. Should the flux value associated with some substrate concentration be known, the shift can be determined from this information. A further alternative is the following. If the inferred trend line suggests that the flux follows some rate law, such as a Hill function, the parameters of this function, together with the appropriate shift, can be obtained in a single optimization step.
Figure shows, for the illustrative example, the process of merging and shifting. The human eye has no problem accomplishing this task intuitively. In the automated process (see Additional file 1
), one connects each “node” (pair of points) with its closest-neighbor node and positioning the pair of points. This process creates two sub-groups of points. We recalculate the distance between each node in a sub-group with the nodes in the other sub-group, determine the closest pair of nodes, connect them, and shift the corresponding pairs into one sub-group as shown in Figure . Suppose the value of v3
is known for X2
1. If so, we ultimately shift the entire trend accordingly. The result is shown in Figure . A shift based on the association between zero flux and zero substrate concentration is an alternative, although it does not uniquely prescribe a solution in this case.
Finally, based on the numerical or graphical flux profile thus determined, one may test candidate functions that capture the flux-substrate relationship. For instance, the result in the illustrative example shows that the functional relationship of X2vs. v3 is s-shaped. It could thus be consistent with the (true) Hill function in Eq. (4), although the computed result itself certainly would not prove that this format is correct. If one assumes, based on the results, that a Hill function is appropriate, one may fit this functional form to data to find the optimal parameter values of the flux-metabolite dependency. Without making such an assumption, one may alternatively connect the dots in Figure with a continuous line and interpolate the values of fluxes using a spline or another smoothing method. The resulting trend line can be used as a “look-up” plot.
Now that we have determined v3, it is easy to compute v2 from the measured slopes of X2. The plot of v3 is slightly curved, which is consistent with its power-law function in Eq. (4), although again, there is no proof. The Results section discusses further examples.
The parameters of any candidate functional form are easily estimated, because no differential equations are involved and the problem is of low dimension; they represent a fully parameterized kinetic model for the flux term itself and, subsequently for the differential equation. Due to this simplicity, it is even possible to scan a variety of candidate functions and assess their appropriateness. If a suitable functional format can be determined with appropriate parameter values, the task is completed. If not, one may represent the flux-substrate plot with a piecewise-polynomial function, such as cubic spline. Even in this non-explicit, numerical format, the result is sufficient to reduce one or two degrees of freedom in the overall DFE task. Figure presents the overall flow and concept of the method.
Figure 5 Flowchart of the proposed method. Starting with experimental time series, the data are smoothed and balanced for mass conservation, if necessary. The slopes of the time series at each time point are estimated. Combined with the knowledge of the system (more ...)
The procedure described above has generated one or two additional flux estimates. For the example in Eq. (3), the determination of v2 and v3 “fills” the rank, and the system can be uniquely solved. In fact, only one of the two is needed. For examples where one or two additional fluxes are not sufficient for a unique solution, the same procedure has to be performed with other equations until enough fluxes are determined to make the flux system full rank. DFE subsequently identifies all other fluxes as plots against time or against their substrates and modulators.
In cases where fluxes contain more than one variable, the time courses have to be screened for combinations where the contributing variables have the same values. The concepts of the procedure are exactly the same as for the univariate case, but the implementation is obviously more involved (see Additional file 1
). Also, such combinations are rarer than in the cases described above, so that these situations require more diverse datasets for structure identification.