Department of Chemical and Biomolecular Engineering, National University of Singapore, 4 Engineering Drive 4, Singapore, 117576, Singapore

NUS Graduate School for Integrative Sciences and Engineering, National University of Singapore, 28 Medical Drive, #05-01, Singapore, 117456, Singapore

Bioprocessing Technology Institute, Agency for Science, Technology and Research (A*STAR), 20 Biopolis Way, #06-01 Centros, Singapore, 138668, Singapore

Abstract

Background

The construction of customized nucleic acid sequences allows us to have greater flexibility in gene design for recombinant protein expression. Among the various parameters considered for such DNA sequence design, individual codon usage (ICU) has been implicated as one of the most crucial factors affecting mRNA translational efficiency. However, previous works have also reported the significant influence of codon pair usage, also known as codon context (CC), on the level of protein expression.

Results

In this study, we have developed novel computational procedures for evaluating the relative importance of optimizing ICU and CC for enhancing protein expression. By formulating appropriate mathematical expressions to quantify the ICU and CC fitness of a coding sequence, optimization procedures based on genetic algorithm were employed to maximize its ICU and/or CC fitness. Surprisingly, the

Conclusions

The proposed CC optimization framework can complement and enhance the capabilities of current gene design tools, with potential applications to heterologous protein production and even vaccine development in synthetic biotechnology.

Background

Recent developments in artificial gene synthesis have enabled the construction of synthetic gene circuits

Specifically, the degeneracy of the genetic code, reflected by the use of sixty-four codons to encode twenty amino acids and translation termination signal, leads to the situation whereby all amino acids, except methionine and tryptophan, can be encoded by two to six synonymous codons. Notably, the synonymous codons are not equally utilized to encode the amino acids, thus resulting in phenomenon of codon usage bias which was first reported in a study that examines the frequencies of 61 amino acid codons (i.e. termination codons are excluded) in 90 genes

In this study, we applied novel computational procedures to generate DNA sequences exhibiting optimal ICU and CC in

Results

Codon optimization formulation

To investigate the relative importance of ICU and CC towards designing sequences for high protein expression, we implemented three computational procedures: the individual codon usage optimization (ICO) method generates a sequence with optimal ICU only; the codon context optimization (CCO) method optimizes sequences with regard to codon context only; and the multi-objective codon optimization (MOCO) method simultaneously considers both ICU and CC. Thus, the resultant sequence is ICU-/CC-optimal when its ICU/CC distribution is closest to the organism’s reference ICU/CC distribution calculated based on the sequences of native high-expression genes. Based on the mathematical formulation presented in Methods, the ICO problem can be described as the maximization of ICU fitness, _{ICU} (see Eqn. 23), subject to the constraint that the codon sequence can be translated into the target protein (see Eqns. 3, 4 and 11). Due to the discrete codon variables and nonlinear fitness expression of _{ICU}, ICO is classified a mixed-integer nonlinear programming (MINLP) problem. Nonetheless, it can be linearized using a strategy shown in an earlier study by decomposing the nonlinear |_{0}
^{
k
} − _{1}
^{
k
}| term (see Equation 23) into a series of linear and integer constraints which consist of binary and positive real variables

I1. Calculate the host’s individual codon usage distribution, _{0}
^{
k
}.

I2. Calculate the subject’s amino acid counts, _{AA,1}
^{
j
}.

I3. Calculate the optimal codon counts for the subject using the expression:

I4. For each _{
i
} in the subject’s sequence, randomly assign a codon ^{
k
} if _{C}
^{
k
} > 0, and decrement _{C,opt}
^{
k
} by one.

I5. Repeat step I4 for all amino acids of the target protein from _{1,1} to _{
n,1}.

Similarly, CCO can be formulated as the maximization of CC fitness, _{CC} (see Eqn. 26), subject to the constraint that the codon pair sequence can be translated into the target protein (see Eqns. 7, 8 and 12). To find the solution for CCO, the procedure in ICO may not be applicable due to the computational complexity which arises from the dependency of adjacent codon pairs. For example, given a codon pair “AUG-AGA” in a 5’-3’ direction, the following codon pair must only start with “AGA”. Therefore, if we had adopted the ICO procedure to directly identify the codon pairs and randomly assign them to the respective amino acid pairs, there could be conflicting codon pair assignments in certain parts of the sequence. Since the characteristic of independency, which was exploited to develop a simple solution procedure for ICO, is absent in the CCO problem, we resort to a more sophisticated computational approach.

The CCO problem can be conceptualized in a similar way as the well-known traveling salesman problem whereby the traversing from one codon to the next adjacent codon is analogous to the salesman traveling from one city to the next ^{100}. Finding an optimal solution for such a large-scale combinatorial problem within an acceptable period of computation time can only be achieved via heuristic optimization methods. Incidentally, the use of genetic algorithm

C1. Randomly initialize a population of coding sequences for target protein.

C2. Evaluate the CC fitness of each sequence in the population.

C3. Rank the sequences by CC fitness and check termination criterion.

C4. If termination criterion is not satisfied, select the “fittest” sequences (top 50% of the population) as the parents for creation of offsprings via recombination and mutation.

C5. Combine the parents and offsprings to form a new population.

C6. Repeat steps C2 to C5 until termination criterion is satisfied.

In step C3, the termination criterion depends on the degree of improvement in best CC fitness values for consecutive generations of the genetic algorithm. If the improvement in CC fitness across many generations is not significant, the algorithm is said to have converged. In this study, the CC optimization algorithm is set to terminate when there is less than 0.5% increase in CC fitness across 100 generations, i.e. _{CC}
^{(r + 100)}/_{CC}
^{(r)} < 0.005 where ^{th} generation of the genetic algorithm. When the termination criterion is not satisfied, the subsequent step C4 will perform an elitist selection such that the fittest 50% of the population are always selected for reproduction of offsprings through recombination and mutation. During recombination, a pair of parents is chosen at random and a crossover is carried out at a randomly selected position in the parents’ sequences to create 2 new individuals as offsprings. The offsprings subsequently undergo a random point mutation before they are combined with the parents to form the new generation.

Unlike traditional implementations of genetic algorithm where individuals in the population are represented as as 0–1 bit strings, the presented CC optimization algorithm represents each individual as a sequential list of character triplets indicating the respective codons. Therefore, the codons can be manipulated directly with reference to a hash table which defines the synonymous codons for each amino acid. As a result, the protein encoded by the coding sequences is always the same in the genetic algorithm since crossovers only occur at the boundary of the codon triplets and mutation is always performed with reference to the hash table of synonymous codons for each respective amino acid.

Based on the formulations for ICU and CC optimization, the MOCO problem, which is an integration of both, can be described as maximizing both ICU and CC fitness, i.e. max (_{ICU}
_{CC}), subject to the constraints that both the codon and codon pair sequences can be translated into the target protein sequence. As such, due to the complexity attributed to CC optimization, solution to MOCO will also require a heuristic method. In this case, the nondominated sorting genetic algorithm-II (NSGA-II) is used to solve the multi-objective optimization problem

M1. Randomly initialize a population of coding sequences for target protein.

M2. Evaluate ICU and CC fitness of each sequence in the population.

M3. Group the sequences into nondominated sets and rank the sets.

M4. Check termination criterion.

M5. If termination criterion is not satisfied, select the “fittest” sequences (top 50% of the population) as the parents for creation of offsprings via recombination and mutation.

M6. Combine the parents and offsprings to form a new population.

M7. Repeat steps M2 to M5 until termination criterion is satisfied.

The identification and ranking of nondominated sets in step M3 is performed via pair-wise comparison of the sequences’ ICU and CC fitness. For a given pair of sequences with fitness values expressed as (_{ICU}
^{1}, _{CC}
^{1}) and (_{ICU}
^{2}, _{CC}
^{2}), the domination status can be evaluated using the following rules:

• If (_{ICU}
^{1} > _{ICU}
^{2}) and (_{CC}
^{1} ≥ _{CC}
^{2}), sequence 1 dominates sequence 2.

• If (_{ICU}
^{1} ≥ _{ICU}
^{2}) and (_{CC}
^{1} > _{CC}
^{2}), sequence 1 dominates sequence 2.

• If (_{ICU}
^{1} < _{ICU}
^{2}) and (_{CC}
^{1} ≤ _{CC}
^{2}), sequence 2 dominates sequence 1.

• If (_{ICU}
^{1} ≤ _{ICU}
^{2}) and (_{CC}
^{1} < _{CC}
^{2}), sequence 2 dominates sequence 1.

Whenever a particular sequence is found to be dominated by another sequence, the domination rank of the former sequence is lowered. As such, the grouping and sorting of the nondominated sets are performed simultaneously in step M3 (Figure ^{2}). However, for the abovementioned algorithm, only ^{2}) computations for

Multi-objective codon optimization solution

**Multi-objective codon optimization solution.** The optimal solutions generated by MOCO lies on the pareto front (region in yellow).

The output of multi-objective optimization is a set of solutions also known as the pareto optimal front. Since the aim of MOCO is to examine the relative effects of ICU and CC optimization, it is not necessary to analyze all the sequences in the pareto optimal front. Instead, the solution which is nearest to the ideal point will represent the sequence with balanced ICU and CC optimality. As such, the solutions of ICO, CCO and MOCO will subsequently be referred to as _{ICO}, _{CCO} and _{MOCO} respectively (Figure

Finding the codon preference

The entire workflow for codon optimization of a target protein sequence begins with the identification of the host’s preferred ICU and CC distributions as the reference (Figure

General codon optimization workflow

**General codon optimization workflow.** In the step of codon optimization, either ICO, CCO or MOCO can be used to optimized the sequence.

The step for selecting high-expression genes codon pattern for codon optimization is only relevant if the following two conditions are true: (1) ICU and CC distributions of high-expression genes are significantly biased and nonrandom; and (2) there is a significant difference in ICU and CC distribution between highly expressed genes and all the genes in the host organism’s genome. It is noted that if the first condition is false, there is no codon (pair) bias and codons can be assigned randomly based on a uniform distribution; if the second condition is false, the computation of ICU and CC distributions based on all the genes in the genome will be sufficient to characterize the ICU and CC preference of the organism without the need for selecting high-expression genes.

To determine the significance of ICU and CC biases, we applied the Pearson’s chi-squared test (see Materials and Methods). Using a p-value cut-off of 0.05, the ICU and CC distributions of at least 80% of the amino acids (pairs) amenable to the chi-squared test were found to be significantly biased in the micro-organisms (Table

**
E. coli
**

**
L. lactis
**

**
P. pastoris
**

**
S. cerevisiae
**

The chi-squared statistic is computed based on the observed occurrence of each codon (pair) and the expected occurrence under the null hypothesis of uniform distribution. Any amino acid (pair) with p-value < 0.05 is considered to exhibit significantly biased codon (pair) usage. Singular amino acids (methionine and tryptophan) and singular amino acid pairs (pairs only consisting of methionine and/or tryptophan) are not amenable to the biasness analysis since they are not encoded by more than one synonymous codon (pair). Chi-squared statistic and p-value are not calculated for amino acid (pair) with expected counts less than 5 (see Materials and Methods for details). Abbreviations: ^{
A
}, codon (pair) distribution of all genes in the genome; ^{
H
}, codon (pair) distribution of high-expression genes;

Null hypothesis (_{0})

^{
H
} =

^{
H
} = ^{
A
}

^{
H
} =

^{
H
} = ^{
A
}

^{
H
} =

^{
H
} = ^{
A
}

^{
H
} =

^{
H
} = ^{
A
}

Alternative hypothesis (_{1})

^{
H
} ≠

^{
H
} ≠ ^{
A
}

^{
H
} ≠

^{
H
} ≠ ^{
A
}

^{
H
} ≠

^{
H
} ≠ ^{
A
}

^{
H
} ≠

^{
H
} ≠ ^{
A
}

No. of biased amino acids (P-value < 0.05)

18

17

19

17

18

19

18

19

No. of unbiased amino acids (P-value ≥ 0.05)

1

2

0

2

1

0

1

0

No. of singular amino acids

2

2

2

2

2

2

2

2

No. of unevaluated amino acids (Expect count < 5)

0

0

0

0

0

0

0

0

Total no. of amino acids

21

21

21

21

21

21

21

21

No. of biased amino acid pairs (P-value < 0.05)

314

99

327

15

354

259

372

282

No. of unbiased amino acid pairs (P-value ≥ 0.05)

26

23

12

65

38

36

19

9

No. of singular amino acid pairs

4

4

4

4

4

4

4

4

No. of unevaluated amino acid pairs (Expect count < 5)

76

294

77

336

24

121

25

125

Total no. of amino acid pairs

420

420

420

420

420

420

420

420

PCA of ICU and CC distributions

**PCA of ICU and CC distributions.** The first and second principal components (PC1 and PC2) are plotted to show the differences in the ICU and CC distributions of (top 5%) high-expression genes (H), (bottom 5%) low-expression genes (L) and all genes (A) found in the genomes of

Performance of codon optimization methods

The performance of each optimization approach was evaluated using a leave-one-out cross-validation, where a gene is randomly selected from the entire set of high-expression genes for sequence optimization while the rest of the genes will be used as the training set to calculate the reference ICU and CC distribution (Figure _{
M
}. From the results, the _{ICO}, _{CCO} and _{MOCO} solutions were generally found to be more similar to the native genes than the random sequences generated by RCA indicating that all the optimization approaches are indeed capable of improving the codon usage pattern compared to the control (Figure _{
M
} values of _{ICO}, _{CCO}, _{MOCO} and _{RCA} sequences for each gene are further compared in a “tournament” style to show the relative performance of each optimization method. In the tournament matrix (Table

**List of Sequences.** Excel file contains DNA sequences of wild-type high- and low-expression genes; the optimized genes generated by the

Click here for file

Codon optimization validation

**Codon optimization validation****.** The

**
x
**

**
x
**

**
x
**

**
x
**

For every gene, the _{
M
} of the optimal sequences generated by respective optimization approaches are compared pair-wise for each expression host. The numbers of tournament wins/losses by each approach for all the genes in each expression host are added up. The sequences generated by ICO, CCO, MOCO and RCA are indicated as _{ICO}, _{CCO}, _{MOCO} and _{RCA} respectively. In each cell, the numbers from top-most to bottom-most corresponds to the data for

7

19

95

2

18

99

_{ICO}

4

15

93

5

22

99

92

82

97

96

93

100

_{CCO}

96

86

100

93

89

99

78

15

97

74

5

100

_{MOCO}

83

12

99

75

9

99

5

2

3

0

0

0

_{RCA}

6

0

1

1

0

0

Through the comparison of ICO and CCO, the _{CCO} solutions have a higher average percentage of codon matches than _{ICO} sequences for all four microbes (Figure _{CCO} sequences matching the native corresponding sequences better than those generated by ICO (Table _{
M
} value of _{MOCO} were observed to be lower than that of _{CCO}, indicating that the consideration of ICU fitness in addition to CC fitness can be detrimental to the sequence design. To our best knowledge, no such formal evaluation of the relative impact of ICU and CC fitness on synthetic gene design has been presented to date. Hence, based on the promising

**Codon optimization of another set of high-expression genes in****
E. coli
**

Click here for file

Discussion

Capturing the preferred codon usage patterns

Earlier codon optimization studies have recommended the usage of high expression genes to design the recombinant gene for efficient heterologous expression

Several options are available for quantifying the codon usage patterns. In this study, we have adopted the method of treating the ICU and CC distributions as a vector of frequency values to capture the relative abundance of individual codons and codon pairs. An earlier well-known method for quantifying codon usage bias is the codon adaptation index (CAI). The CAI has been widely used for codon optimization due to its observed correlation with gene expressivity

Therefore, the proposed approach of optimizing codons according to the complete ICU and CC distributions of highly expressed genes will be suitable to alleviate the problem of tRNA pool imbalance when the cell is induced to overexpress the target gene. As such, the concept of CAI was not considered in this study as this single value does not capture the details in ICU and CC distributions.

Other potential issues in efficacy of CCO

Codon usage has been shown to affect the accuracy and speed of translation

On the other hand, translation initiation can also be affected by the mRNA structure of the initiation site. At the primary structure level, Shine-Dalgarno sequence and Kozak sequence should be added to the 5’ end of the coding sequence since previous studies have shown that they are required for recognition of the AUG start codon to initiate translation in prokaryotes and eukaryotes, respectively,

CCO tool for synthetic biology

To further develop CCO into a software tool for designing synthetic genes, several other factors may have to be considered. From the experimental aspect, the gene optimization should take into consideration the types of restriction enzymes used for vector construction such that the restriction sites DNA motifs are avoided to prevent unnecessary cleavage of the coding sequence. In certain cases where the optimized coding sequence tends to have nucleotide repeats, additional steps may be required to avoid the repeats or inverted repeats which may lead to DNA recombination or formation of mRNA hairpin loops, respectively, that will reduce the heterologous expressivity of the target protein

The optimal sequences generated by CCO are not found in any natural organism. Thus, the CCO software tool should also consider challenges involved in the synthesis of these artificial genes. The current technology for

Potential applications of CCO

The motivation behind codon optimization is usually to enhance the expression of foreign genes in expression hosts such as

Apart from biotechnological applications, codon optimization can also be used in biomedical research where modulation of protein expression is required to alter physiological response. For example, in the development of vaccines against viruses, one approach is to genetically manipulate the virus to obtain a “live attenuated” strain as the vaccine. Such a vaccine, when administered to the host, will elicit an immune response for the host to develop immunologic memory and specific immunity against the virus without severe disruption to the overall physiology. Some conventional methods of developing live attenuated vaccines include laboratory adaptation of virus in non-human hosts and random/site-directed mutagenesis

Conclusions

Through novel implementations of ICO, CCO and MOCO, the high-expression genes of four microbial hosts were optimized and cross-validated to compare the performance of the optimized sequences. Amongst all the optimization approaches, CCO was found to generate the sequences that are most similar to the native high-expression genes, indicating a greater potential for high

Methods

Identifying highly expressed genes

Provided that highly expressed genes have evolved to adopt optimal codon patterns, information on ICU and CC preference of any organism can be extracted from the DNA sequences of the high-expression genes. In this sense, we used published microarray data of

ICU and CC biasness

To compute the significance of codon (pair) usage bias, we resort to the Pearson’s chi-squared test. Based on the null hypothesis that “the ICU (CC) of high-expression genes follows the uniform/unbiased distribution”, the chi-square statistic for amino acid (pair)

where _{
ij
}
^{
H
} and _{
ij
}
^{
H
} are the expected and observed numbers of synonymous codon (pair) _{
j
}
^{
H
} refers to the number of unique synonymous codon (pair) encoding the amino acid _{
j
}
^{
H
} values for asparagine, glycine and leucine are 2, 4, and 6 respectively. The superscript “_{
ij
}
^{
H
} can be calculated as _{
j
}
^{
H
} refers to the total number of amino acid (pair) _{
j
}
^{2} against the ^{2} distribution with (_{
j
}
^{
H
} − 1) degrees of freedom since the reduction in degrees of freedom is one due to the constraint: ^{2} Test 1”. To ensure that the statistical adequacy of this chi-squared test, any amino acid (pair) with low expected occurrence (i.e. _{
ij
}
^{
H
} < 5) will be omitted from this analysis as recommended in an earlier study

The presented Pearson’s chi-squared formulation is slightly modified to determine whether the ICU (CC) is significantly different between high-expression genes and all genes in the genome. Based on the null hypothesis as “ICU (CC) of high-expression genes is the same as that of all genes in the genome”, the expected number of codon (pair)

where _{
ij
}
^{
A
} refers the observed number of codon (pair) _{
j
}
^{
A
} refers to the total number of amino acid (pair) _{
ij
}
^{
H
} with _{
j
}
^{2}, the chi-squared statistic to test the difference in ICU (CC) distribution between high-expression genes and all genes in the host’s genome can be calculated.

ICU and CC fitness evaluation

In this study, the target gene, subsequently known as the “subject”, is optimized such that the final synthetic sequence design will exhibit ICU and/or CC distributions that are as similar as possible to those preferred by the host’s organism. The ICU and CC fitness values can be used to quantify the degree of similarity in ICU and CC distributions between the subject and the host. Before formulating the ICU and CC fitness, we present the mathematical expression of the coding sequence and amino acid sequence as follows:

where _{
i,1} refers to the amino acid occupying the ^{th} position of the amino acid sequence _{A,1} with the subscript 1 indicating the target protein; _{
i,1} also belongs to the set A of 21 unique amino acids ^{
j
}. Similarly, _{
i,1}, a codon from the set K of 64 unique codons ^{
k
}, represents the codon variable in the ^{th} position of the target coding sequence _{C,1}. It is noted that the coding sequence is express as a sequence of codons instead of nucleotides since codon usage patterns is the key concern. As codon context is another key issue to be examined, we also include the following mathematical expressions for amino acid pairs and codon pairs:

By defining a function to _{
i,1}, _{
i,1}, _{
i,1} and _{
i,1}:

The ICU distribution can be defined as the frequency of each unique codon based on its total number of occurrences in the sequence(s). Based on the mathematical formulation presented hitherto, the required mathematical expressions to calculate the ICU distribution are as follows:

where 1{·} is an indicator function such that

The count variables _{AA}
^{
j
} and _{C}
^{
k
} refer to the numbers of occurrences of amino acid ^{
k
} represents the frequency of occurrence of codon

The ICU fitness, _{ICU}, was divided by 64 such that the numerical value will reflect the average fitness of all codons. In a similar way, if we denote the frequency of occurrence of codon pair ^{
k
}, the CC fitness can be calculated as:

(See Additional file

**Formulation of codon optimization methods.** Detailed mathematical formulation of ICO, CCO and MOCO.

Click here for file

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

BKSC and DL conceived the codon optimization idea. BKSC developed the algorithm, performed the computational simulations and drafted the manuscript. DL revised the manuscript. Both authors read and approved the final manuscript.

Acknowledgements

This work was supported by the National University of Singapore, Biomedical Research Council of A*STAR (Agency for Science, Technology and Research), Singapore and a grant from the Next-Generation BioGreen 21 Program (SSAC, No. PJ008184), Rural Development Administration, Republic of Korea. We would like to thank Dr. Jungoh Ahn and the anonymous reviewers for the invaluable suggestions and feedbacks.