In a number of separate studies, genome-scale flux–balance analysis (FBA) modeling has been shown to be useful for the
in silico design of engineered strains of microbes that overproduce diverse targets. These engineered strains include
Escherichia coli that overproduce lycopene (
Alper et al, 2005a,
2005b), lactic acid (
Fong et al, 2005), succinic acid (
Lee et al, 2005;
Wang et al, 2006),
L-valine (
Park et al, 2007), and
L-threonine (
Lee et al, 2007), and strains of
Saccharomyces cerevisiae that overproduce ethanol (
Bro et al, 2006;
Hjersted et al, 2007). FBA models allow the result of various genetic manipulations strategies to be predicted. As a result, the space of possible genetic manipulations can be computationally searched for the strategy that results in the desired metabolic network state. This space is vast, and algorithms must be designed to search the space efficiently.
Transforming bi-level optimization of FBA models to single-level mixed-integer linear programming (MILP) problems (
Burgard et al, 2003;
Pharkya et al, 2004;
Pharkya and Maranas, 2006) has resulted in computational methods that efficiently search the space of genetic manipulations. This approach is much more efficient than exhaustive, brute-force search, but it is nevertheless very computationally intensive. The runtimes scale exponentially as the number of manipulations allowed in the final design increases. For large models—such as the latest genome-scale model of
E. coli K-12 MG1655 (
Feist et al, 2007),
iAF1260—we have found that this runtime generally proves prohibitive for designs involving more than a few manipulations. Given that useful metabolically engineered strains often require many genetic manipulations (such as the artemisinic acid-producing strain of
S. cerevisiae by
Ro et al (2006), which required the addition of three genes and the up- or downregulation of four genes) and that the number of reactions, metabolites, and genes in metabolic models continues to grow (
Feist and Palsson, 2008), it is clear that more efficient computational search techniques are required for effective
in silico design.
We present a heuristic algorithmic method, which we call Genetic Design through Local Search (GDLS), that is capable of handling large models and allows for a much larger number of genetic manipulations in the final design, with runtime scaling only linearly with the total number of manipulations ‘
T'. GDLS employs a local search approach with multiple search paths (
Anderson, 1989;
Pudil et al, 1994;
Bertsimas and Weismantel, 2005) to find a set of locally optimal strategies. In brief, GDLS functions as follows. It begins by taking a certain strategy as a starting point, which can be a user-defined set of manipulations, a randomly chosen set of genetic manipulations, or no manipulations. The GDLS method then uses an MILP approach to search for the best strategies that differ from the starting point by, at most, ‘
k' additional manipulations. The population of strategies GDLS maintains is limited to a maximum size ‘
M', which is also the number of unique search paths maintained. These
M best strategies are then used as the starting point for the next round of MILP search, resulting in a new set of
M best strategies, each of which differ from the one of the starting
M best strategies by, at most,
k additional manipulations. This procedure is iterated until no better strategy can be found. Through this search method, GDLS is able to find strategies in which the total number of manipulations
T is large using computationally feasible increments (determined by
k and
M). As a result, GDLS can discover strategies that have a larger value of
T and of the desired flux than can feasibly be found by a global search.
A heuristic, sequential approach to finding genetic design strategies was also employed by
Alper et al (2005a). The approach of Alper
et al can be considered as a special case of the approach of GDLS, in which we take no genetic manipulations as the starting point and, at each iteration, search by brute-force for the best strategy that involves only one additional genetic manipulation. GDLS combines the strengths of such a sequential approach with those of the bi-level FBA approach previously mentioned (
Burgard et al, 2003;
Pharkya et al, 2004;
Pharkya and Maranas, 2006), allowing for a graceful trade-off between the good complexity properties of the former and the good optimality properties of the latter.
Other heuristic approaches include the application of meta-heuristics such as evolutionary algorithms and simulated annealing, both of which are explored in OptFlux (
Patil et al, 2005;
Rocha et al, 2008). These meta-heuristics apply random local perturbations to a solution or a population of solutions and propagate those with better performance. Although such meta-heuristics can perform well on difficult global optimization problems, we find that, for genetic design, it is frequently the case that a number of manipulations have to be performed simultaneously to have an effect, whereas each carried out on its own has no effect. In such a case, if one of the manipulations is randomly chosen on its own, then it is no more likely to be propagated, as it has no effect, and the probability of choosing all the manipulations simultaneously in a random perturbation is very small. Thus, it is difficult for these meta-heuristics to identify such manipulations. GDLS overcomes this problem by systematically searching the local neighborhood. We compare the performance of GDLS with the evolutionary algorithm and simulated annealing approaches implemented in OptFlux and, using the meta-heuristics, we indeed are able only to find solutions that compare with those found by GDLS when
k=1.
In this paper, we limit our consideration of bi-level FBA frameworks to one for gene knockouts for simplicity, but any bi-level FBA framework that can be transformed to an MILP, such as one for up- or downregulation of genes using the approach of OptReg (
Pharkya and Maranas, 2006), is compatible. We do not consider FBA extensions such as energy–balance analysis (EBA) (
Beard et al, 2004;
Yang et al, 2005) or thermodynamics-based metabolic flux analysis (TMFA) (
Henry et al, 2007), which account for thermodynamic constraints, or multi-objective optimization (
Sendín et al, 2009), which allows the simultaneous optimization of multiple cellular functions to be studied. These extensions are important for accurate prediction of flux in eukaryotic cells, but are less so for prokaryotic cells (
Vo et al, 2004;
Nagrath et al, 2007), and developing computational methods for genetic design that incorporate such extensions is likely to be an important next step. Multi-objective optimization can also be used as an alternative to bi-level optimization to find genetic design strategies (
Xu et al, 2009), but it is more suggestive, than prescriptive, and has received relatively little attention to date.
Beyond the local search approach, GDLS implements a number of reductions that decrease the size of the FBA model without changing its properties. In addition, to predict network changes from genetic level manipulations, GDLS employs gene–protein reaction (GPR) mappings, which provide a complete many-to-many mapping between genes and the reactions that depend on them. These relationships can be complex, making it difficult to realize the effects of genetic manipulations without computational analysis. Compared with the common approach of allowing direct manipulation of the reaction network, employing GPR mappings not only models more accurately the way in which engineering is carried out, but also potentially reduces the manipulation search space, which reduces search complexity and hence runtime.