Codon optimization formulation

To investigate the relative importance of ICU and CC towards designing sequences for high protein expression, we implemented three computational procedures: the individual codon usage optimization (ICO) method generates a sequence with optimal ICU only; the codon context optimization (CCO) method optimizes sequences with regard to codon context only; and the multi-objective codon optimization (MOCO) method simultaneously considers both ICU and CC. Thus, the resultant sequence is ICU-/CC-optimal when its ICU/CC distribution is closest to the organism’s reference ICU/CC distribution calculated based on the sequences of native high-expression genes. Based on the mathematical formulation presented in Methods, the ICO problem can be described as the maximization of ICU fitness,

*Ψ*_{ICU} (see Eqn. 23), subject to the constraint that the codon sequence can be translated into the target protein (see Eqns. 3, 4 and 11). Due to the discrete codon variables and nonlinear fitness expression of

*Ψ*_{ICU}, ICO is classified a mixed-integer nonlinear programming (MINLP) problem. Nonetheless, it can be linearized using a strategy shown in an earlier study by decomposing the nonlinear |

*p*_{0}^{k}−

*p*_{1}^{k}| term (see Equation 23) into a series of linear and integer constraints which consist of binary and positive real variables [

24]. The resultant mixed-integer linear programming (MILP) problem can be solved using well established computational methods such as either branch-and-bound and branch-and-cut [

25]. However, due to the large and discrete search space which contains all possible DNA sequences that can encode the target protein, solving the MILP using these methods may require a long computational time. Thus, alternative methods, such as GASCO [

26] and QPSOBT [

27], have been proposed for solving ICO using genetic algorithm and particle swarm optimization. Although these heuristic methods are more efficient than conventional MILP solving procedures, they still require a significant amount of computational resources due to the iterative nature of the algorithms. To circumvent the high computational costs, we developed the non-iterative method for solving ICO using the following steps:

I1. Calculate the host’s individual codon usage distribution, *p*_{0}^{k}.

I2. Calculate the subject’s amino acid counts, *θ*_{AA,1}^{j}.

I3. Calculate the optimal codon counts for the subject using the expression:

.

I4. For each

*τ*_{i} in the subject’s sequence, randomly assign a codon

*κ*^{k} if

*θ*_{C}^{k}>

0, and decrement

*θ*_{C,opt}^{k} by one.

I5. Repeat step I4 for all amino acids of the target protein from *τ*_{1,1} to *τ*_{n,1}.

Similarly, CCO can be formulated as the maximization of CC fitness, *Ψ*_{CC} (see Eqn. 26), subject to the constraint that the codon pair sequence can be translated into the target protein (see Eqns. 7, 8 and 12). To find the solution for CCO, the procedure in ICO may not be applicable due to the computational complexity which arises from the dependency of adjacent codon pairs. For example, given a codon pair “AUG-AGA” in a 5’-3’ direction, the following codon pair must only start with “AGA”. Therefore, if we had adopted the ICO procedure to directly identify the codon pairs and randomly assign them to the respective amino acid pairs, there could be conflicting codon pair assignments in certain parts of the sequence. Since the characteristic of independency, which was exploited to develop a simple solution procedure for ICO, is absent in the CCO problem, we resort to a more sophisticated computational approach.

The CCO problem can be conceptualized in a similar way as the well-known traveling salesman problem whereby the traversing from one codon to the next adjacent codon is analogous to the salesman traveling from one city to the next [

28]. Since there will be a “cost” incurred by taking a particular “codon path”, the CCO problem aims to minimize of the total cost for traveling a codon path that is able to code the desired protein sequence. However, the CCO problem is more complex than the traveling salesman problem due to the nonlinear cost function evaluated based on the frequency of codon pair occurrence (see Materials and Methods). For an average sized protein consisting 300 amino acids, the total number of codon paths can be as many as 10

^{100}. Finding an optimal solution for such a large-scale combinatorial problem within an acceptable period of computation time can only be achieved via heuristic optimization methods. Incidentally, the use of genetic algorithm [

29] provides an intuitive framework whereby codon path candidates are “evolved” towards optimal CC through techniques mimicking natural evolutionary processes such as selection, crossover or recombination and mutation. Thus, the procedure for solving CCO is as follows:

C1. Randomly initialize a population of coding sequences for target protein.

C2. Evaluate the CC fitness of each sequence in the population.

C3. Rank the sequences by CC fitness and check termination criterion.

C4. If termination criterion is not satisfied, select the “fittest” sequences (top 50% of the population) as the parents for creation of offsprings via recombination and mutation.

C5. Combine the parents and offsprings to form a new population.

C6. Repeat steps C2 to C5 until termination criterion is satisfied.

In step C3, the termination criterion depends on the degree of improvement in best CC fitness values for consecutive generations of the genetic algorithm. If the improvement in CC fitness across many generations is not significant, the algorithm is said to have converged. In this study, the CC optimization algorithm is set to terminate when there is less than 0.5% increase in CC fitness across 100 generations, i.e.

*Ψ*_{CC}^{(r+100)}/

*Ψ*_{CC}^{(r)}<

0.005 where

*r* refers to the

*r*^{th} generation of the genetic algorithm. When the termination criterion is not satisfied, the subsequent step C4 will perform an elitist selection such that the fittest 50% of the population are always selected for reproduction of offsprings through recombination and mutation. During recombination, a pair of parents is chosen at random and a crossover is carried out at a randomly selected position in the parents’ sequences to create 2 new individuals as offsprings. The offsprings subsequently undergo a random point mutation before they are combined with the parents to form the new generation.

Unlike traditional implementations of genetic algorithm where individuals in the population are represented as as 0–1 bit strings, the presented CC optimization algorithm represents each individual as a sequential list of character triplets indicating the respective codons. Therefore, the codons can be manipulated directly with reference to a hash table which defines the synonymous codons for each amino acid. As a result, the protein encoded by the coding sequences is always the same in the genetic algorithm since crossovers only occur at the boundary of the codon triplets and mutation is always performed with reference to the hash table of synonymous codons for each respective amino acid.

Based on the formulations for ICU and CC optimization, the MOCO problem, which is an integration of both, can be described as maximizing both ICU and CC fitness, i.e. max (

*Ψ*_{ICU}*Ψ*_{CC}), subject to the constraints that both the codon and codon pair sequences can be translated into the target protein sequence. As such, due to the complexity attributed to CC optimization, solution to MOCO will also require a heuristic method. In this case, the nondominated sorting genetic algorithm-II (NSGA-II) is used to solve the multi-objective optimization problem [

30]. The procedure for NSGA-II is similar to that presented for CC optimization except for additional steps required to identify the nondominated solution sets and the ranking of these sets to identify the pareto optimum front. The NSGA-II procedure for solving the MOCO problem is as follows:

M1. Randomly initialize a population of coding sequences for target protein.

M2. Evaluate ICU and CC fitness of each sequence in the population.

M3. Group the sequences into nondominated sets and rank the sets.

M4. Check termination criterion.

M5. If termination criterion is not satisfied, select the “fittest” sequences (top 50% of the population) as the parents for creation of offsprings via recombination and mutation.

M6. Combine the parents and offsprings to form a new population.

M7. Repeat steps M2 to M5 until termination criterion is satisfied.

The identification and ranking of nondominated sets in step M3 is performed via pair-wise comparison of the sequences’ ICU and CC fitness. For a given pair of sequences with fitness values expressed as (

*Ψ*_{ICU}^{1},

*Ψ*_{CC}^{1}) and (

*Ψ*_{ICU}^{2},

*Ψ*_{CC}^{2}), the domination status can be evaluated using the following rules:

• If (

*Ψ*_{ICU}^{1}>

*Ψ*_{ICU}^{2}) and (

*Ψ*_{CC}^{1}≥

*Ψ*_{CC}^{2}), sequence 1 dominates sequence 2.

• If (

*Ψ*_{ICU}^{1}≥

*Ψ*_{ICU}^{2}) and (

*Ψ*_{CC}^{1}>

*Ψ*_{CC}^{2}), sequence 1 dominates sequence 2.

• If (

*Ψ*_{ICU}^{1}<

*Ψ*_{ICU}^{2}) and (

*Ψ*_{CC}^{1}≤

*Ψ*_{CC}^{2}), sequence 2 dominates sequence 1.

• If (

*Ψ*_{ICU}^{1}≤

*Ψ*_{ICU}^{2}) and (

*Ψ*_{CC}^{1}<

*Ψ*_{CC}^{2}), sequence 2 dominates sequence 1.

Whenever a particular sequence is found to be dominated by another sequence, the domination rank of the former sequence is lowered. As such, the grouping and sorting of the nondominated sets are performed simultaneously in step M3 (Figure ). In the original nondominated sorting algorithm [

30], the set of individuals that is dominated by every individual is stored in memory. Therefore, for a total population of

*n*, the total storage requirement is

*O*(

*n*^{2}). However, for the abovementioned algorithm, only

*O*(

*n*) storage is required for storing the domination value of each individual. In terms of computational complexity, both the original and modified algorithm requires at most

*O*(

*mn*^{2}) computations for

*m* objective values since all the

*n* individuals have to be compared pair-wise for every objective to be optimized. Therefore, the nondominated sorting algorithm presented in this thesis is superior on the whole, especially with regards to computational storage requirement which can become an important issue when dealing with long coding sequences.

The output of multi-objective optimization is a set of solutions also known as the pareto optimal front. Since the aim of MOCO is to examine the relative effects of ICU and CC optimization, it is not necessary to analyze all the sequences in the pareto optimal front. Instead, the solution which is nearest to the ideal point will represent the sequence with balanced ICU and CC optimality. As such, the solutions of ICO, CCO and MOCO will subsequently be referred to as

*x*_{ICO},

*x*_{CCO} and

*x*_{MOCO} respectively (Figure ). The program for performing ICO, CCO and MOCO can be downloaded from the following link:

http://bioinfo.bti.a-star.edu.sg/tool/CodonOptimization/.

Finding the codon preference

The entire workflow for codon optimization of a target protein sequence begins with the identification of the host’s preferred ICU and CC distributions as the reference (Figure ). These ICU and CC distributions should ideally capture codon usage patterns that correspond to efficient translation of mRNA to protein. Therefore, the first step of codon optimization identifies the reference ICU and CC distributions by characterizing the underlying mechanisms of efficient translation which can be achieved through transcriptome, translatome and proteome profiling as demonstrated in earlier studies [

31,

32]. However, such large-scale experimental data are not readily available for the extraction of codon usage preference information in all the expression hosts considered in this study. Alternatively, it is assumed that the organisms have evolved to conserve resources by producing high amounts of transcripts for genes that will also be efficiently translated. As such, the widely available transcriptome data from microarray experiments can be used to identify the highly expressed and efficiently translated genes. Thus, the codon pattern of the host’s native high-expression genes will be a suitable reference point for codon optimization.

The step for selecting high-expression genes codon pattern for codon optimization is only relevant if the following two conditions are true: (1) ICU and CC distributions of high-expression genes are significantly biased and nonrandom; and (2) there is a significant difference in ICU and CC distribution between highly expressed genes and all the genes in the host organism’s genome. It is noted that if the first condition is false, there is no codon (pair) bias and codons can be assigned randomly based on a uniform distribution; if the second condition is false, the computation of ICU and CC distributions based on all the genes in the genome will be sufficient to characterize the ICU and CC preference of the organism without the need for selecting high-expression genes.

To determine the significance of ICU and CC biases, we applied the Pearson’s chi-squared test (see Materials and Methods). Using a p-value cut-off of 0.05, the ICU and CC distributions of at least 80% of the amino acids (pairs) amenable to the chi-squared test were found to be significantly biased in the micro-organisms (Table ). In the high-expression genes, aspartate was found to be the only one among all amino acids exhibiting an ICU distribution that is not significantly different from the unbiased distribution for *E. coli*, *P. pastoris* and *S. cerevisiae*. Similarly, more than 80% of the amino acids (pairs) show significant difference in ICU and CC distributions between high-expression genes and all genes in the genomes of these three microbes. Contrastingly, 80% amino acids did not show significant difference in CC distributions between high-expression genes and all genes in *L. lactis*, suggesting that the selection of highly expressed genes may not be required to establish the CC preference of *L. lactis*. By applying the principal component analysis, we can observe that the ICU and CC distributions for all types of genes in *L. lactis* are close to one another when compared to genes from other organisms (Figure ). This indicates that the short listing of highly expressed genes may not be necessary for organisms like *L. lactis*. Nonetheless, we recommend the identification of high-expression genes to characterize the ICU and CC preference of any host, such that there is a better level of confidence that the optimized recombinant gene can be efficiently expressed.

| **Table 1**ICU and CC biasness analysis |

Performance of codon optimization methods

The performance of each optimization approach was evaluated using a leave-one-out cross-validation, where a gene is randomly selected from the entire set of high-expression genes for sequence optimization while the rest of the genes will be used as the training set to calculate the reference ICU and CC distribution (Figure ). The predicted optimum sequences are compared with the original native sequences to evaluate the performance of each codon optimization approach (see Additional file

1 for the sequences of the wild-type and optimized genes). As the degree of similarity to the wild-type high expression genes indicates the gene expressivity potential of the optimized sequences, the quality of each optimized sequence was measured in terms of the percentage of codons matching the corresponding native sequence, denoted by

*P*_{M}. From the results, the

*x*_{ICO},

*x*_{CCO} and

*x*_{MOCO} solutions were generally found to be more similar to the native genes than the random sequences generated by RCA indicating that all the optimization approaches are indeed capable of improving the codon usage pattern compared to the control (Figure ). The

*P*_{M} values of

*x*_{ICO},

*x*_{CCO},

*x*_{MOCO} and

*x*_{RCA} sequences for each gene are further compared in a “tournament” style to show the relative performance of each optimization method. In the tournament matrix (Table ), each cell shows the number of wins by the method in the left-most column against that in the upper-most row. Whenever the numbers of wins and losses (i.e. cells diagonally opposite of each other) do not sum up to 100, the shortfall will be equal to the number of draws.

Through the comparison of ICO and CCO, the

*x*_{CCO} solutions have a higher average percentage of codon matches than

*x*_{ICO} sequences for all four microbes (Figure ), with at least 90% of the

*x*_{CCO} sequences matching the native corresponding sequences better than those generated by ICO (Table ). This result indicates that CC fitness can be a more important design parameter for sequence optimization than ICU fitness which has been a conventional design criterion implemented in several software tools. While it appears likely that the integration of CCO with ICO under a multi-objective optimization framework can potentially lead to even better sequence design, results from our MOCO analysis suggest otherwise. The average of

*P*_{M} value of

*x*_{MOCO} were observed to be lower than that of

*x*_{CCO}, indicating that the consideration of ICU fitness in addition to CC fitness can be detrimental to the sequence design. To our best knowledge, no such formal evaluation of the relative impact of ICU and CC fitness on synthetic gene design has been presented to date. Hence, based on the promising

*in silico* validation results which implicate CC as an important design parameter for optimizing sequences, the newly developed CCO procedure can potentially supersede the ICU optimization techniques currently implemented in gene design software tools. It is noted that similar observations on the relative performance of ICO, CCO and MOCO were made when we performed the

*in silico* leave-one-out cross-validation on the set of 27 high-expression genes of

*E. coli* reported in an earlier study [

33]. Details of this analysis can be found in Additional file

2.