Metabolism is the process of synthesis and degradation of molecules occurring in living organisms. Metabolism is generally represented as a network where metabolites are interconnected by reactions. In order to give a functional description of metabolism, metabolic networks are often decomposed into separated parts, called metabolic pathways. The description of metabolism through metabolic pathways is useful, even though any division in pathways is arbitrary, because it helps in modeling and understanding the behavior of the full network. A metabolic pathway can be defined as a coherent set of enzyme-catalyzed biochemical reactions by which a living organism transforms a set of source compounds into a set of target compounds. By regulating enzyme and protein synthesis, living organisms can adapt to different environments. This model of metabolism as composed by independent metabolic pathways is simplistic, since pathways are nested and interdependent. In fact, metabolism is a complex system and pathways interact with each other.
The aim of the work presented here is to find all the viable sets of heterologous enzymes, which can produce a predefined target compound when added to the pool of endogenous enzymes of a given organism. Our method enables a metabolic engineer to find all heterologous metabolic pathways producing a target compound, for instance liquiritigenin (a plant secondary metabolite with therapeutic applications), from the endogenous metabolites of E. coli K-12, as shown in Figure .
As depicted in Figure , our problem can be formulated as searching for all possible heterologous pathways linking a target compound to the endogenous metabolites of an organism. To this purpose we provide software tools that enable the discovery of potential pathways producing a target chosen by the user [
1]. More precisely the user enters a target compound, a chassis organism, and our software tools return a ranked list of pathways (each list being composed of enzymes) to be engineered into the chassis organism. To achieve this task we have developed an approach composed of three steps. In the first step, using a retrosynthesis software, reactions producing a target compound are iteratively searched backwards until the set of needed precursors only contains source metabolites. This first step returns a retrosynthetic network connecting a target compound to the endogenous metabolites of an organism. There may be several pathways in the retrosynthetic network linking the source metabolites to the target compounds and there is thus a need to enumerate all the possibilities. Pathway enumeration is performed by in the second step. Once the pathways have been enumerated, we evaluate in the third step the possibility to insert each pathway and its associated heterologous enzymes in the host organism. This step consist of determining the catalytic efficacy of the enzymes, the toxicity of the products and the coproducts [
2], and the easiness of inserting the enzymes into the host. The efficiency of the pathways can then be further estimated by flux models for the cell metabolism such as flux balance analysis [
3].
We have already discuss elsewhere the first and third step [
1,
2], i.e. methods to generate retrosynthetic networks and methods to rank pathway efficiency. To apply these methods in the context of heterologous target production, we need a computationally fast method to enumerate all possible pathways. We address the enumeration problem in the current paper.
Different mathematical models that describe metabolism have been proposed (cf. [
4] for a review of the different models). We distinguish two main families of approaches: the ones computing steady states of the fluxes of reactions (one well-known application being the flux balance analysis) and the ones based only the topology of the network. Typically, steady states are studied and simulated by generating the flux space. Of particular interest are the extreme pathways and the elementary modes, they both represent the smallest (minimal) generating set of the flux space and they both are composed of independent non-decomposable pathways in the network [
4]. The differences between extreme pathways and elementary modes have already been discussed in details [
5] and these differences arise when dealing with reversible reactions. In the present paper we consider all reaction irreversible, and networks comprising reversible reactions are modeled by doubling each reversible reaction into a forward reaction and a reverse reaction. Algorithms have been developed to enumerate both extreme pathways [
6] and elementary modes [
7] and these algorithm are all variants of the double description method [
8], which enumerates all extreme rays of a polyhedral cone. The algorithms use as input a stoichiometric matrix (
S) representing the network (cf. [
3] for definition of stoichiometric matrix) and output sets of fluxes (
v) satisfying
Sv = 0. One notices that extreme pathways and elementary modes while representing pathways (to each flux verifying
Sv = 0 correspond a stoichiometrically balanced pathway) do not directly enumerate all pathways linking a source set to a target set of compounds. However as shown in the subsection "Enumerating pathways using the steady state approach" one can construct stoichiometric matrices where input fluxes are added to the set of source compounds and outgoing fluxes are associated to the target and heterologous coproducts such that the extreme pathways and elementary modes enumerated from these matrices do correspond to all pathways linking the source set to the target.
While as mentioned above, the problem of systematically enumerating pathways for heterologous production in chassis organisms has not yet been addressed, there are methods based on the steady state approach to search for heterologous pathways optimizing target productions [
9], and methods to search for shortest pathways between source and target sets of compounds [
10] and [
11]. All these methods are based on optimization and make use of integer linear programming. Precisely, the method of Pharkya
et al. [
9], is aimed at redesigning microbial chassis organisms through heterologous reaction addition and native reaction deletion for the overproduction of a target compound. The addition and deletion are parameterized using binary variables attached to each reaction. A mixed integer linear program (MILP) is then set up to maximize the target yield while minimizing the number of added reactions. The methods of de Figueiredo
et al. [
10], and Pey
et al. [
11] are both aimed at searching for the
k shortest pathways. In de Figueiredo
et al. [
10] the
k first shortest pathways are searched in entire metabolic networks, while in Pey
et al. [
11] the pathways are searched between a source metabolite and a target metabolite. Both methods solve the problem at steady state and search for fluxes,
v, verifying
Sv = 0, while minimizing the number of reactions turned on (using a binary variable). Aside from the fact that integer linear programs suffer from computational complexity (MILP is an NP-hard problem) all the above methods search for at most
k optimized (shortest) pathways and do not guarantee a full enumeration of the possibilities. In our methods the optimal pathways are computed in a post process by ranking the pathways that have been enumerated. Our approach allows one to decouple enumeration from optimization, and thus to plug any optimization criteria, including nonlinear functions and not only target yield or pathway length (cf. page 3 and Carbonell
et al. [
1] for a list of criteria entering our metabolic engineering optimization problem).
Aside from using extreme pathways and elementary modes, we also present in this paper a topological model which directly enumerates all the possible heterologous pathways linking target compounds to a source set of compounds. The main advantage of the topological approach compared to the stationary state approach is computational speed. Speed is in fact an important aspect when searching for the best pathways to engineer, as there are generally a combinatorial number of pathways between given source sets and target sets. As an illustration of this combinatorial complexity, the work of Hatzimanikatis et
al. [
12], which provides a list of 75,000 novel biochemical routes from chorismate to phenylalanine, and the work of Cho et
al. [
13], which enumerates 107,272 reaction routes to produce isobutanol.
There exist standard graph-based methods to search and eventually enumerate pathways in metabolic networks, but these methods including PathFinding [
14-
16] and Pathway Hunter Tool [
17] are computing pathways and shortest pathways in graphs instead of hypergraphs. The particularity of these techniques is that only main substrates and main products are taken into account when constructing pathways, and consequently these main compounds must be differentiated from the cofactors (i.e. co-substrates and co-products). In the work of Croes
et al. [
14,
15] cofactors are filtered out based on their connectivity in the network. Indeed, compounds highly connected such as ATP, NADP, or H
2O are cofactors of most reactions as they do not share carbon atoms with the products of the reactions. In a more recent work [
16], the main compounds in the pathways linking source metabolites to target metabolites are detected using the Kegg RPAIR annotation [
18,
19], which enables one to follow the fate of atoms when going from a set of substrates to a set of products. Another approach to search for main substrates and main product is the one developed with the Pathway Hunter Tool, which consists of mapping substrates to products using cheminformatics fingerprints. While all the above techniques are computationally efficient, their main shortcoming is that they are not able to encompass reactions when a main product is formed from two main substrates. There are plenty of such reactions in metabolic networks, consider for instance the formation of guanidinoacetate from arginine and glycine through a glycine amidinotransferase (EC 2.1.4.1), or the formation of glutathione from
γ-L-glutamyl-L-cysteine and glycine catalyzed by a glutathione synthase (EC 6.3.2.3). Recently, some of the above topological methods have been benchmarked against the integer linear programming technique mentioned above [
11] to search for the shortest pathways linking various compounds, the recovery ratio for a set of 40 predefined reference pathways could not reach 100% with the graph based approach, exemplifying the shortcoming of that approach.
As reviewed above, while there are methods and theoretical results to enumerate elementary modes or extreme pathways and graph based techniques to search for pathways in a given metabolic network, to the best of our knowledge there is no known methods to directly enumerate pathways in the context of metabolic engineering, that is, to enumerate all the pathways encompassing all substrates and products necessary and sufficient to produce a given set of target compounds from a given set of source compounds. In the present paper we address that specific problem and present two methods one based on elementary modes (steady state approach) and one based on a direct enumeration algorithm (topological approach). In order to address this problem we need, in addition, to consider the problem of determining supplement molecules, i.e. metabolites that the organism cannot synthesize, but which can be added to the growth media in order to increase the number of viable pathways; and bootstrap molecules, i.e. metabolites which are required fist in order to be produced [
20]. While in the general context of metabolic network analysis, finding elementary modes does not require to first search for bootstrap molecules, in the context of metabolic engineering however any heterologous pathway solution that comprises a compound that is first consumed before being produced is valid only when the compound is added to the growth medium. Therefore, in our study in the context of metabolic engineering, there is a need to first compute the bootstrap compounds prior to elementary modes.
The paper is divided as follows. In the Methods section we first provide some definitions, then outline our algorithms to solve the pathway enumeration problem with both the steady state approach and the topological approach. The problem of finding and enumerating all the pathways going from a large source (as for instance al the metabolites of an organism) to a target chosen by the user is considered. All the algorithms presented for the topological approach (with the exception of the algorithm for enumeration) have polynomial worst-case running time, the algorithm for enumeration is a polynomial time per output algorithm on some classes of hypergraphs. We also provide algorithms to determine supplements, which are metabolites that the organism cannot synthesize, but which can be added to the growth media in order to increase the number of viable pathways. Furthermore, an analysis of pathways containing supplements allows finding out pathways that contain bootstrap molecules. In the Results and Discussion section we illustrate our algorithms with the enumeration of the possible pathways to synthesize more than 5000 compounds in E. coli. We "experimentally" probe the computational complexity of the steady state and topological approaches for a series of networks of growing sizes and discuss the theoretical complexity results of the topological approach, which are provided in Appendices A and B.
While we illustrate our two methods for the production of heterologous compounds using as a source set all the endogenous metabolites of E. coli, our methods can be applied to any chassis organism and more generally to any source set (for instance a set of nutrients or a set of abundant currency metabolites).