Abstract
Background
DNA shuffling generates combinatorial libraries of chimeric genes by stochastically recombining parent genes. The resulting libraries are subjected to largescale genetic selection or screening to identify those chimeras with favorable properties (e.g., enhanced stability or enzymatic activity). While DNA shuffling has been applied quite successfully, it is limited by its homologydependent, stochastic nature. Consequently, it is used only with parents of sufficient overall sequence identity, and provides no control over the resulting chimeric library.
Results
This paper presents efficient methods to extend the scope of DNA shuffling to handle significantly more diverse parents and to generate more predictable, optimized libraries. Our CODNS (crossover optimization for DNA shuffling) approach employs polynomialtime dynamic programming algorithms to select codons for the parental amino acids, allowing for zero or a fixed number of conservative substitutions. We first present efficient algorithms to optimize the local sequence identity or the nearestneighbor approximation of the change in free energy upon annealing, objectives that were previously optimized by computationallyexpensive integer programming methods. We then present efficient algorithms for more powerful objectives that seek to localize and enhance the frequency of recombination by producing "runs" of common nucleotides either overall or according to the sequence diversity of the resulting chimeras. We demonstrate the effectiveness of CODNS in choosing codons and allocating substitutions to promote recombination between parents targeted in earlier studies: two GAR transformylases (41% amino acid sequence identity), two very distantly related DNA polymerases, Pol X and β (15%), and betalactamases of varying identity (2647%).
Conclusions
Our methods provide the protein engineer with a new approach to DNA shuffling that supports substantially more diverse parents, is more deterministic, and generates more predictable and more diverse chimeric libraries.
Background
The harnessing of DNA recombination in vitro has transformed protein engineering by enabling engineers, like nature, to sample sequence space more broadly than is allowed by point mutagenesis at individual residues. Recombination produces chimeras comprised of sequential fragments from parent genes, thereby bringing together sets of sequences that were previously active in the parental background, and are thus likely to be less disruptive than random ones. Chimeragenesis typically produces combinatorial libraries, and those chimeras with beneficial properties can be identified by largescale genetic screening and selection.
DNA shuffling [1,2], the progenitor of recombinationbased protein engineering, works by randomly digesting the parent genes into fragments and reassembling the fragments into new chimeric genes (Figure 1). Recombination occurs when fragments from different parents are sufficiently complementary to anneal and prime synthesis from the 3' end. DNA shuffling has been called by its developer Pim Stemmer "the most dangerous thing you can do in biology" [3], due to its power in generating novel proteins. Indeed, it has been the basis both for commercial success (Affymax, Maxygen) and the development of effective protein variants [47].
Figure 1. Basic steps in gene shuffling protocol (following [30]). (1) Parental genes are stochastically fragmented. (2) The fragments are denatured, and strands with sufficient complementarity are annealed and extended. Crossovers are formed when the complementary strands are from different parents and can be extended to complete fragments. The process is repeated for multiple rounds, generating additional fragments and crossovers. (3) Ultimately a chimeric library is generated, some of whose members represent fulllength genes, as shown.
DNA shuffling is both homologydependent (recombination can occur only in runs of similar DNA sequence), and stochastic (the engineer does not control the recombination sites). Due to dependence on sequence similarity, DNA shuffling may fail to generate desirable chimeras (or any chimeras at all) for diverse parents, as they have only a few, small regions of DNA similarity, insufficient to generate many crossovers. Homologyindependent stochastic methods (e.g., ITCHY [8] and SHIPREC [9]) mitigate the need for such parental sequence similarity, but at the cost of generating many more nonviable chimeras.
In contrast with stochastic methods, sitedirected methods enable the engineer to explicitly choose breakpoint locations so as to optimize expected library quality (e.g., by employing structural information [10], or by minimizing predicted disruption [11,12], library diversity [13], or both factors [14]). We have developed a sitedirected method employing planned ligation of parental fragments by short overhangs [15]. We have coupled this approach to robotic implementation in order to generate specific chimeras in defined experimental vessels [16]. Such highlydirected methods of chimera generation are most useful when screening represents a significant effort. In those situations where screening or genetic selection is readily available, then stochastic approaches, with less overall cost, might prove preferable.
We present here methods for extending stochastic experiments by optimizing DNA shuffling (Figure 2), yielding an approach that is less dependent on parental DNA sequence similarity (parents can be more diverse) and more deterministic (crossovers are more predictable), and which is amenable to library optimization. Our approach, which we call CODNS (
 c
 o
 DN
 s
Figure 2. Schematic overview of CODNS. Here our method CODNS is applied to choose codons in a portion of two parental genes so as to produce a 9 nt "run" of common nucleotides, likely to be sufficient for a crossover between complementary strands. To achieve the run, we use a combination of silent DNA substitutions (underlined) as well as a conservative amino acid substitution (boxed). The implications of these choices can be global; e.g., TTC for F and CTC for L would end the current run at 7 nt, but provide the first 2 nt of a new one. Our dynamic programming algorithm finds the globally optimal solution.
We demonstrate the effectiveness of CODNS in several case studies. We first optimize the GAR transformylases previously optimized by eCodonOpt [17]. We then show that CODNS can optimize two DNA polymerases (Pol X and Pol β) that are sufficiently diverse (15% amino acid sequence identity) to previously require the development and application of the SCOPE method [10], instead of direct application of DNA shuffling. Finally, we study the impact of parental sequence identity by considering pairs of betalactamase parents of differing diversity levels.
Methods
We take as input the amino acid sequences of the parent proteins to be shuffled, aligned to a length of n (amino acids and gaps) based on sequence and/or structure. For simplicity of exposition, we present our methods for the most common case of shuffling two parents, a_{1 }and a_{2}. Our methods readily extend to creating equivalent sites for recombination in multiple parents, and it remains interesting future work to allow for nonuniform shuffling (i.e., where different crossovers are possible between different pairs of parents).
To optimize the shuffling experiment, we select a codon for each amino acid for each parent, yielding DNA sequences d_{1 }and d_{2 }of length 3n (maintaining gaps for those in the amino acid sequences). To expand the pool of codons being considered at a particular position, we may choose to make an amino acid substitution. Thus we take as additional input a specification of the allowed substitutions for each residue position for each parent, along with a number m of them to make. The allowed substitution specification may be derived from sequence and/or structural analysis of the parents, including general amino acid substitution matrices [18], positionspecific amino acid statistics from related proteins [19], and fold predictions for possible substitutions [20]. The results presented below determine allowed substitutions under the BLOSUM62 substitution matrix, considering only "conservative" substitutions which score no more than 4 worse than wildtype [15].
In describing the algorithms, we use possible codon sets representing the codons allowed at each position in the wildtype and under the allowed substitutions. For position i, set C_{1}[i] contains the possible codons for a_{1}[i], pairing each with an indication of whether or not it requires a substitution, e.g., {(TTT, 0), (TTC, 0), (TGG, 1)} for an F that could potentially be mutated to W. Set C_{2}[i] is defined similarly for the second parent. We note that these may readily be used to restrict where to employ mutations (e.g., masking based on structural analysis, as discussed by Moore and Maranas [17]), by allowing only wildtype codons (or amino acids) in some positions.
We consider four types of objective function, targeting common nucleotides (at aligned positions), nearestneighbor approximation to change in free energy of annealing (from dinucleotide pairs), common nucleotide runs (in contiguous strings), or library diversity (among resulting chimeras). We develop increasingly more complex dynamic programming algorithms to optimize these objectives.
Common nucleotide optimization
In this most basic optimization for DNA shuffling, the goal is to maximize the number of identical nucleotides at common positions:
where I is the indicator function (1 for true, 0 for false).
With no substitutions allowed, each residue position is independent of each other one. Thus we simply select for each position a pair of codons (one for each parent) with a maximal number of common nucleotides. When substitutions are allowed, we need to allocate them for optimal impact. While several approaches are possible, we develop here one based on dynamic programming, to serve as the basis for the more complex objective functions we pursue in subsequent subsections.
In our dynamic programming matrix, one dimension represents an aligned residue position (i.e., we have optimized the sequences up to that point), and the other represents a number of substitutions (i.e., we have made that many thus far). Let N[i, s] denote the number of common nucleotides within the first i residues, using exactly s substitutions. The value of N[i, s] extends the value of N[i  1, s  (t_{1 }+ t_{2})] with the additional number of common nucleotides obtained by selecting a pair of codons for position i while making t_{1 }+ t_{2 }additional substitutions (0 or 1 for each parent). Optimal substructure holds, since the optimal value of N[i, s] depends on the optimal value of N[i  1, s  (t_{1 }+ t_{2})]. The recurrence is
where g gives the number (03) of common nucleotides for a pair of codons.
After filling in the dynamic programming table, we trace back from N[n, m] to generate an optimal pair of DNA sequences. The matrix is of size n * m and each cell takes constant time to compute.
optimization
While it is hard to directly model the process of DNA shuffling [2123], it is driven by the change in free energy upon annealing of the different parental nucleotide strands (coding of one with noncoding of the other; see Figure 1). We want to minimize the free energy change, so that it is favorable to cross over. Since the free energy change is very hard to compute, a common approach is to approximate it by decomposing the free energy into the sum of contributions from pairs of dinucleotides, the nearestneighbor approximation [24]:
where is the free energy change associated with annealing dinucleotide d_{1}[i]·d_{1}[i + 1] with dinucleotide d_{2}[i]·d_{2}[i + 1]. These values can be computed from enthalpic ΔH (kcal/mol) and entropic ΔS (cal/mol·K) nearestneighbor parameters compiled at 37°C and [Na^{+}] = 1.0 M [24], including both pairs of complementary strands. To actually estimate the change in free energy, there are additional constant terms such as the average initiation energy contribution; we omit them as they do not affect the optimization. While the underlying parameters are defined on pairs of dinucleotides, we abuse notation a bit in our formulation below and use for 4mers to mean the sum over the constituent dinucleotides.
We now develop a dynamic programming formulation to optimize this objective more efficiently (in polynomial time) than the integer linear programming of eCodonOpt [17], while still ensuring global optimality. In order to compute the contributions from a selected codon, we must also know the final nucleotide of the previous codon, as it forms a dinucleotide with the first nucleotide of the current codon. Thus we extend the common nucleotide dynamic programming table to keep track of this information. Figure 3(left) illustrates the dependency; the recurrence is
where b·c indicates the concatenation of base b onto codon c, and estimates the change in free energy, as described in the text. Cell A[i, s, b_{1}, b_{2}] holds the best score for the first i positions, using exactly s substitutions, with third nucleotides b_{1 }(first parent) and b_{2 }(second parent) for position i. As with common nucleotide optimization, if a codon pair makes t_{1 }+ t_{2 }substitutions at position i, then A[i, s, b_{1}, b_{2}] extends the solution to a cell for position i  1 with s  (t_{1 }+ t_{2}) substitutions, considering any of the third nucleotides and at position i  1.
Figure 3. Examples of dynamic programming recurrences. The top part of each example shows relationships in the table and the bottom part the score differences. (left) optimization, a nearestneighbor approximation to the change in free energy of annealing, summing adjacent nucleotide pairs. We must keep track of the third nucleotides of the codons for position i  1 in order to compute the contribution. (right) Run optimization cases: common codon; common codon after substitution; continue and end a run; continue and end one run and start a new one; end one run and start a new one. The run score contribution is computed based on the pattern of nucleotide matches.
The table is of size n × m × 5^{2 }for 2 parents, since there are only (4 + 1)^{2 }combinations of single nucleotide pairs for two parents (four nucleotides and a gap each). Each cell can be computed in constant time. In practice, we construct a 2D table (over i and s), with each cell maintaining a list of scores for the (b_{1}, b_{2}) pairs that actually occur.
Run optimization
Moore and Maranas argued that the nearestneighbor approximation to change in annealing free energy is a better objective for shuffling optimization than the number of common nucleotides [17]. Intuitively, since the nearestneighbor approximation considers adjacent nucleotides together rather than treating them independently, it is more likely to yield sufficient complementarity between fragments and thereby promote recombination. Here we go even further and explicitly optimize for contiguous complementary regions, since annealing is driven by sufficiently long (anecdotally 6 nt or more) such regions.
We define a common nucleotide run as a maximallength substring appearing at aligned positions in the DNA sequences d_{1 }and d_{2}, and use as our objective function:
where f, which must be nondecreasing, indicates the value for DNA shuffling of a run of length R, and the sum is taken over all runs. We have implemented and tested several different scoring functions; the results use the following two functions:
In f_{1}, we count the total number of nucleotides in a run, but only if the run exceeds a given length (we empirically evaluated several thresholds). This assumes that crossovers are impossible for runs with fewer than θ common nucleotides, and become increasingly likely with additional nucleotides beyond θ. In f_{2}, we consider crossovers impossible for fewer than 6 nucleotides and very likely for 9 nucleotides or more (scoring the total number of nucleotides as in f_{1}), and we ramp up from the impossible score of 0 at 5 nt to the likely score of 9 at 9 nt, thereby counting the partial benefit that may be provided by runs between 6 and 9 nucleotides.
We must extend our dynamic programming table with an additional dimension to keep track of the current run length. Thus we have a table in which cell R[i, s, r] holds the best score for the first i positions, using exactly s substitutions, such that the final nucleotide in the codons chosen for position i is the rth in a run (0 if mismatch). Again, if we make t substitutions at position i, then R[i, s, r] extends the solution to a cell for position i  1 with s  (t_{1 }+ t_{2}) substitutions. Now we must also account for the preceding run length; there are several cases (Figure 3, right): the codons chosen for the current amino acid position may continue a run from the previous position, may end that run, and may start a new run. In any case, the current r and possible codon pair determines the preceding r' at which to look, and optimal substructure still holds. The recurrence is thus
where a(c_{1}, c_{2}) and z(c_{1}, c_{2}) give the lengths of the longest common prefix and suffix, respectively, of a pair of codons. The first case handles a common codon, while the second case handles an unequal codon pair, which may end and/or begin a run. The score depends on that from the related cell, with an increment in f(·) accounting for any extension in run length and initiation of a new run. (See again Figure 3, right.) When there is a tie, we prefer the codon pair with the most common nucleotides, even if that has no impact on run score. This choice increases overall sequence identity, to promote better annealing of strands from different parents.
The matrix is of size n * m * (3n + 1), since the run length potentially ranges from 0 to the entire DNA sequence length (3n). However, in practical cases, most run lengths are not attainable. Furthermore, for r_{1 }< r_{2}, if R[i, s, r_{1}] + f(r_{2})  f(r_{1}) < R[i, s, r_{2}], then the r_{2 }cell "dominates" the r_{1 }onethe r_{1 }one cannot be part of the optimal solution. Thus we modify the usual dynamic programming algorithm slightly, to avoid filling in cells with unattainable or dominated run lengths. We perform the standard nested loop over i (residue position) and s (number of substitutions). Then for each i and s we determine which run lengths are attainable and undominated and fill in only those entries. Rather than keeping a 3D table, we keep a 2D table in which each cell has a list of run lengths and their scores. Note from the structure of Eq. 14 that we can determine the run lengths for i, s from the possible codons at i and the run lengths that were attained and undominated for i  1 and s, s  1, and s  2 (depending on the numbers of substitutions required for the codons).
Diversity optimization
We have previously developed methods for optimizing the diversity of libraries of chimeras produced by sitedirected recombination [13,14]. We showed that the total number of mutations in a library is a constant determined only by the parents, but that by assessing the squareddifferences in the numbers, we can optimize for a relatively uniform sampling of sequence space. In the case of two parents, we define the diversity variance over a library as:
where λ is the number of fragments, m(H_{i}, H_{j}) is the mutation level (number of amino acid differences) between a pair of chimeras H_{i }and H_{j}, and is the average of m over the library. (We drop a constant factor of 2, which doesn't affect the optimization.) To mitigate the effect of neutral mutations, rather than using literal equality we measure m using one of the standard sets of amino acid classes. The goal is to minimize the variance, seeking to sample sequence space as uniformly as possible.
The objective function is defined in terms of the chimeras in the library. In the context of DNA shuffling, we assume that a sufficiently large run of common nucleotides (with respect to a threshold θ as in Eq. 10) results in a breakpoint, and thus that the (fulllength) chimeras are welldefined as all combinations of fragments between the breakpoints. Breakpoints resulting from smaller runs only add to the diversity of the resulting library.
For an efficient algorithm, we must be able to compute the objective function during the optimization, without enumerating the exponential number of chimeras. In our previous sitedirected work [13], we developed a recursive formulation relating the diversity variance for a library to that of a sublibrary with one fewer breakpoint. That formulation took as given the total number of breakpoints, which isn't available in the DNA shuffling context. However, similar algebraic manipulations (omitted due to lack of space) yield a related formula without requiring preknowledge of the number of breakpoints.
Claim 1 The diversity variance d(l, k) of a library from parent sequences P_{a }and P_{b }with kth breakpoint is at residue l can be computed from the diversity variance d(l', k  1) for a library with (k  1) st breakpoint at residue l' < l by the following formula:
where
and we use notation P[i, j] to indicate the substring from position i to j, inclusive.
Based on Eq. 16, we further extend our runlength optimization dynamic programming recurrence to optimize for diversity:
We add two more dimensions, to keep track of k, how many runs of length θ we have seen (i.e., confidently yielding breakpoints), and l, where the last one was, as in the claim. Intuitively these two additional dimensions are necessary since the number of breakpoints affects the size of the library and thus the diversity variance, and since the additional diversity induced by a run depends on the nucleotides between the previous breakpoint and the new one. Note that in Eq. 16, k is the number of breakpoints, with the last breakpoint always at the end of the current position l; however, in Eq. 19, k is the number of previous runs, or k + 1 when substituted into Eq. 16. As with run optimization, our implementation avoids filling in the table for run lengths that are unattainable (though the notion of dominated entries does not carry over).
Codon usage
In order to promote better protein expression, we follow the GeneDesigner protocol [25] in employing organismspecific codon usage tables. A codon usage table for an organism [26] encodes the frequency with which each codon has been observed in a sequence database; different organisms display different "preferences" [27]. In a preprocessing step, we disallow rare codons that make up less than 10% of the occurrences for their amino acid. Then when computing one of the recurrences, we use the codon usage table to resolve cases where multiple possible codons give the same score (i.e., they have the same implications for continuing, ending, and beginning runs). In such cases, we selecting among the possible codons with probability according to their usage frequency.
Results and discussion
We use three case studies to demonstrate the effectiveness of CODNS in optimizing DNA shuffling experiments. The first two case studies are a pair of glycinamide ribonucleotide (GAR) transformylases (previously optimized by eCodonOpt [17]) and a pair of distantly related DNA polymerases (previously recombined by SCOPE [10]). We optimize shuffling plans using from 0 to 10 mutations under each of the objective functions, abbreviated in the figures as cn (common nucleotides), ΔG (nearestneighbor approximation to change in free energy of annealing), f_{1 }(runs under f_{1 }scoring), f_{2 }(runs under f_{2 }scoring), and dv (library diversity). We examine particular plans optimized under different objectives, in order to see how they differ in allocating mutations and producing homologous runs suitable for crossovers. We then study the overall trends in optimizing the objectives and in producing runs. We also consider the diversity of the chimeras that would result by recombination under different runoptimal plans. Comparisons with what would result eCodonOpt [17] can be made by noting that it optimizes cn and ΔG (though we use an efficient dynamic programming algorithm to do so). In a third case study, we evaluate the effects of wildtype sequence identity on the optimization, using different pairs of betalactamases.
GAR transformylases
The parents for our first case study are a GAR transformylase from E. coli and one from humans. Previous work showed that DNA shuffling crossovers are extremely rare without codon optimization [17]. We obtained the (gapless) alignment from the supplementary material of [17], and transcribed it to 201 amino acids with 82 (40.8%) in common. The wildtype DNA sequences had 47% nucleotides in common [17], with only two runs of length 7 and no runs longer than 7 nt.
Figure 4 illustrates some optimal plans, showing runs of length ≥ 9, a relatively confident threshold for crossovers (analogous observations can be made with other thresholds, not shown to save space). In some cases, many plans may be tied for optimal under the objective. We extended our dynamic programming backtrace to generate the tied solutions [28]; for common nucleotides, we enumerated all, but for , we stopped after 1000 were generated. The figure then shows the tiedforoptimal plans with the most nucleotides in runs (at a threshold of 9). All objectives yield the same run patterns with 0 mutations. However, with more mutations, run and diversity optimization methods focus their efforts on regions that are sufficiently similar to allow formation of runs with wellplaced mutations and wellchosen codons, while choices made by common nucleotide and optimization are not as productive. Diversity optimization produces the same number of runs as f_{1 }optimization, but places the runs more evenly throughout the entire sequence so that crossing over at those sites would yield chimeras comprised of more uniformlysized fragments better sampling the sequence space spanned by the parents. ("Size" in diversity optimization refers to residues at which the parents differ, not just the total number of amino acids [13].)
Figure 4. GAR transformylase detailed plans. Shown are the locations of selected mutations (vertical lines) and resulting runs of length ≥ 9 (boxes; staggered at some locations to avoid overlap), optimized under different objectives. For f_{2}, runs are colored by length, with 6:green, 7:black, 8:magenta, ≥ 9:blue. For 0 mutations, all methods yielded the same run pattern, so only the f_{2 }version is shown.
We next analyzed the overall ability of CODNS to select codons and allocate mutations to meet the different optimization goals. Figure 5 illustrates the objective score trends with increasing numbers of mutations. All of the plots are quite linear (recall that is to be minimized), demonstrating that there is sufficient freedom within these two parents to enable effective optimization for the objectives, and that the algorithms are successfully exploiting the available freedom. Since the f_{2 }metric gives "partial credit" for run lengths of 6, 7, and 8, we break out those contributions to its score. We see most of the optimization still focuses on full 9 nt and larger runs, which is natural given the reduced score contribution for shorter runs (since they are believed to be less productive in promoting recombination). The trends for diversity optimization are not shown here since scores are not directly comparable for libraries of different sizes (resulting from different numbers of runs yielded by different patterns of codons and mutations).
Figure 5. GAR transformylase scoring trends. The trends are given for each objective, from 0 to 10 mutations.
We introduced novel runbased objective functions in order to more directly target the sufficient stretches of parental homology required for annealing, only indirectly optimized by the objectives of common nucleotides and employed by eCodonOpt. To assess the impact of this more direct objective functions, we determined the number of runs produced by plans under the different objectives. We varied the threshold to consider a homologous region as a "run" from 7 (lower confidence) to 12 (higher confidence). As discussed above, among the plans tied for optimal for a particular objective, we sought the one with the best run score. We evaluated both the number of runs and the number of nucleotides in those runs. For the sake of space, Figure 6 presents only the results for a threshold of 9; the trends at the other thresholds are very similar. We see that, as mutations are introduced, optimizing directly for runs is indeed much more effective at producing runs than either of the "proxies" of common nucleotides or . We do not show trends for f_{2}, as it also optimizes for "partial credit" runs (of lengths 6, 7, and 8).
Figure 6. GAR transformylase run trends. Shown are the run trends (left: # nts in runs; right: # runs) in GAR transformylase plans employing from 0 to 10 mutations, optimized for common nucleotides (red), (blue), and runs under f_{1 }score (magenta).
The final question regards diversityhow much control we can exert over the level of diversity we introduce (how different the resulting chimeras in a plan are from the parents and from each other). Here we deem a 9 nt run as sufficient for a breakpoint, and evaluate the ability of CODNS to minimize our library diversity variance objective (Eq. 15) while maximizing the number of runs. Figure 7 illustrates some plans with the same (optimal) number of runs but different library diversity scores. As discussed above, mutations in diversityoptimal plans are optimally allocated so as to create runs more evenly distributed throughout the entire sequence (counting positions with different amino acids in the parents). However, plans with larger diversity variance scores place runs closer together and leave the Cterminal portion without any runs, thereby generating no diversity there (the final 100+ residues will be from one parent or the other, rather than a hybrid).
DNA polymerases
Our second case study involves two distantlyrelated members from the Xfamily of DNA polymerases: African swine fever virus DNA polymerase X (Pol X) and Rattus norvegicus DNA polymerase beta (Pol β). While these two proteins share a similar fold, they have very low sequence identity. The sitedirected SCOPE method [10] was developed due to the difficulty in producing viable Pol X  Pol β chimeras by other methods. We obtained the published structurebased sequence alignment of the two parents, in which the full Pol X and the palm and finger domains of Pol β were aligned to a length of 214 residues and gaps, with only 32 residues (15%) in common. The wildtype DNA sequences had only 158/642 (24%) nucleotides in common, with no common nucleotide runs of length greater than 5. Thus standard DNA shuffling techniques are unlikely to produce any crossovers.
We optimized these parents under each of our objective functions, using from 0 to 10 mutations. Figure 8 illustrates an optimal plan for each objective with 0, 4, or 8 mutations, showing runs of length ≥ 9. These particular parents are so diverse that only a few such runs can be produced by codon selection alone (no mutations). We do see, however, that the runoptimization methods form one more run (positions 148157) than do the common nucleotide and methods, and f_{2 }forms some potentially productive shorter runs. The difference increases with more mutations, as run optimization directly allocates them so as to produce more runs, while, due to the parental diversity, the indirect choices made to optimize common nucleotides and are unlikely to lead to runs. With less freedom, it is harder to optimize diversity. We do see that while the f_{1 }plan is diversityoptimal for 0 and 4 mutations, it is not for 8 mutations, and the diversityoptimal plan spreads mutations out more. We also observe that the Nterminal region is so diverse that no run is produced there even with 8 mutations. Diversity optimization thus tends to create runs that are more evenly distributed in the large Cterminal region.
Figure 9 illustrates the effectiveness of allocated codons and mutations in terms of the optimizing the different objectives. We again see linear trends for the four objectives (as discussed with GAR transformylases, diversity is not directly comparable over different library sizes). Thus even with these two extremely diverse parents, it is possible to select mutations according to a specified objective, and our algorithm does so.
Figure 10 illustrates how well the different objectives do at producing runs, and how many nucleotides comprise those runs. For the sake of space (and as with GAR transformylases), we only illustrate under a 9 nt threshold for a homologous region to count as a run, but we found exactly the same trends with other thresholds from 7 to 12. Once again, explicitly optimizing for runs proves to be much more effective at producing runs (and more nucleotides in them) than does indirectly optimizing for runs by overall nucleotide identity or the nearestneighbor approximation to change in free energy of annealing. With these diverse parents, the indirect objectives do not happen to produce many runs or nucleotides in runs; in fact, they produce almost no runs of longer lengths (e.g., 12, not shown), even with 10 mutations.
Figure 10. DNA polymerase run trends. The run trends for the DNA polymerases are shown as in Figure 6.
Even with such diverse parents, there is sufficient freedom in the codon and mutation choices that different runoptimal plans yield different levels of chimera diversity. Figure 11 illustrates some such plans, at two mutation levels. The three plans at the same mutation level have the same number of runs but increasing diversity (and increasingly more even distributions of run) from top to bottom.
Figure 11. DNA polymerase plans with different diversity values. The plans for DNA polymerases with different diversity values (as in Figure 7).
Betalactamases
Our third case study examines the effect of wildtype sequence identity. Betalactamases, which hydrolyze the betalactam found in certain antibiotics (e.g., penicillin), have been the object of much chimeragenesis work, including DNA shuffling [1] and sitedirected methods [11]. We previously developed a multiple sequence alignment (272 residues and gaps) of diverse betalactamases [15]. For the present study, we considered (a) the common betalactamase targets TEM1 (E. coli) and PSE4 (Pseudomonas aeruginosa) (42% amino acid identity); (b) the even more diverse pair from P. aeruginosa and Bacillus licheniformis (26% id); (c) the more similar pair from E. coli and Proteus mirabilis (47% id).
Optimizing the wildtype amino acid sequences for common nucleotides yields DNA identity of (a) 70%, (b) 61%, and (c) 73%. These numbers are somewhat borderline for standard DNA shuffling. They do result in some runs, though generally fewer than when directly optimizing for runs, a trend that widens with more mutations. Optimizing for runs yields the same behavior as observed for the previous two cases; due to lack of space, we only present the free energy score and the number of nucleotides in runs under the f_{1 }metric (Figure 12). We again see the linear tread for both objectives with increasing mutations from 0 to 10. The actual energy score and run score both depend on parental sequence identity, with the same ranking on both metrics.
Figure 12. Betalactamase objective function trends. Shown are the objective function trends (left: ; right: f_{1 }at θ = 9) for betalactamase plans with 0 to 10 mutations, for pairs of betalactamase parents from (red) E. coli and P. aeruginosa, (blue) P. aeruginosa and B. licheniformis, and (magenta) E. coli and P. mirabilis.
Conclusion
DNA shuffling is a staple of protein engineering, and we have demonstrated that our new algorithms can substantially improve the expected productivity of an experiment. Even without performing any mutations, we are able to allocate codons to better form runs. By performing a small number of conservative substitutions, not expected to significantly affect stability or activity, we generally are able to increase the number of runs and the number of nucleotides in runs, linearly with the number of substitutions. Finally, since we are establishing runs whose lengths are sufficient to promote regular recombination, we can enhance our optimization to account for properties of the resulting chimeric library. Future directions include extending run optimization to incorporate the type of potential underlying (i.e., accounting for differences in nucleotide content), to optimize multiple parents simultaneously, and to integrate CODNS within our Paretooptimization framework [29] in order to optimize productivity of shuffling in concert with other properties. While both extensions will increase the computational expense, the resulting gain in experimental efficiency could be well worth it. In summary, our methods yield a new approach to DNA shuffling that supports substantially more diverse parents, is more deterministic, and generates more predictable and more diverse chimeric libraries.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
LH, AMF, and CBK developed the approach; LH, and CBK designed the algorithms, LH implemented the algorithms and collected the results; LH, AMF, and CBK analyzed the results and wrote the paper. All authors read and approved the final manuscript.
Acknowledgements
This work is supported in part by US NSF grant IIS1017231.
References

Stemmer WPC: Rapid evolution of a protein in vitro by DNA shuffling.
Nature 1994, 370:389391. PubMed Abstract  Publisher Full Text

Stemmer WPC: DNA shuffling by random fragmentation and reassembly: in vitro recombination for molecular evolution.
Proc Natl Acad Sci USA 1994, 91:1074710751. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Littlehales C: Profile: Willem 'Pim' Stemmer.
Nat Biotechnol 2009, 27:220. PubMed Abstract  Publisher Full Text

Crameri A, Raillard SA, Bermudez E, Stemmer W: DNA shuffling of a family of genes from diverse species accelerates directed evolution.
Nature 1998, 391:288291. PubMed Abstract  Publisher Full Text

Chang C, Chen T, Cox B, Dawes G, Stemmer W, Punnonen J, Patten P: Evolution of a cytokine using DNA family shuffling.
Nat Biotechnol 1999, 17:793797. PubMed Abstract  Publisher Full Text

Ness J, Welch M, Giver L, Bueno M, Cherry J, Borchert T, Stemmer W, Minshull J: DNA shuffling of subgenomic sequences of subtilisin.
Nat Biotechnol 1999, 17:893896. PubMed Abstract  Publisher Full Text

Christians F, Scapozza L, Crameri A, Folkers G, Stemmer W: Directed evolution of a thymidine kinase for AZT phophorylation using DNA family shuffling.
Nat Biotechnol 1999, 17:259264. PubMed Abstract  Publisher Full Text

Ostermeier M, Shim JH, Benkovic SJ: A combinatorial approach to hybrid enzymes independent of DNA homology.
Nat Biotechnol 1999, 17:12051209. PubMed Abstract  Publisher Full Text

Sieber V, Martinez CA, Arnold FH: Libraries of hybrid proteins from distantly related sequences.
Nat Biotechnol 2001, 19:456460. PubMed Abstract  Publisher Full Text

O'Maille PE, Bakhtina M, Tsai MD: Structurebased combinatorial protein engineering (SCOPE).
J Mol Biol 2002, 321:677691. PubMed Abstract  Publisher Full Text

Meyer MM, Silberg JJ, Voigt CA, Endelman JB, Mayo SL, Wang ZG, Arnold FH: Library analysis of SCHEMAguided protein recombination.
Protein Sci 2003, 12:16861693. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ye X, Friedman A, BaileyKellogg C: Hypergraph model of multiresidue interactions in proteins: sequentiallyconstrained partitioning algorithms for optimization of sitedirected protein recombination.
J Comput Biol 2007, 14:777790.
Conference version: Proc. RECOMB, 2006, pp. 1529
PubMed Abstract  Publisher Full Text 
Zheng W, Ye X, Friedman AM, BaileyKellogg C: Algorithms for selecting breakpoint locations to optimize diversity in protein engineering by sitedirected protein recombination.
Comput Syst Bioinformatics Conf 2007, 6:3140. PubMed Abstract  Publisher Full Text

Zheng W, Friedman AM, BaileyKellogg C: Algorithms for joint optimization of stability and diversity planning combinatorial libraries of chimeric proteins.
J Comput Biol 2009, 16:11511168.
Conference version: Proc. RECOMB, 2008, pp. 300314
PubMed Abstract  Publisher Full Text 
Saftalov L, Smith P, Friedman A, BaileyKellogg C: Sitedirected combinatorial construction of chimaeric genes: general method for optimizing assembly of gene fragments.
Proteins 2006, 64(3):629642. PubMed Abstract  Publisher Full Text

Avramova L, Desai J, Weaver S, Friedman A, BaileyKellogg C: Robotic hierarchical mixing for the production of combinatorial libraries of proteins and small molecules.
J Comb Chem 2008, 10:6368. PubMed Abstract  Publisher Full Text

Moore G, Maranas C: eCodonOpt: a systematic computational framework for optimizing codon usage in directed evolution experiments.
Nucleic Acids Res 2002, 30:24072416. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Henikoff S, Henikoff JG: Amino acid substitutions from protein blocks.
Proc Natl Acad Sci USA 1992, 89:1091510919. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Durbin R, Eddy S, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press; 1998.

Guerois R, Nielsen JE, Serrano L: Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations.
J Mol Biol 2002, 320:369387. PubMed Abstract  Publisher Full Text

Sun F: Modeling DNA shuffling.
J Comput Biol 1999, 6:7790. PubMed Abstract  Publisher Full Text

Moore GL, Maranas CD, Lutz S, Benkovic SJ: Predicting crossover generation in DNA shuffling.
Proc Natl Acad Sci USA 2001, 98:32263231. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Maheshri N, Schaffer D: Computational and experimental analysis of DNA shuffling.
Proc Natl Acad Sci USA 2003, 100:30713076. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

SantaLucia J Jr: A unified view of polymer, dumbbell, and oligonucleotide DNA nearestneighbor thermodynamics.
Proc Natl Acad Sci USA 1998, 95:14601465. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Villalobos A, Ness J, Gustafsson C, Minshull J, Govindarajan S: Gene Designer: a synthetic biology tool for constructing artificial DNA segments.
BMC Bioinformatics 2006, 7:285. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Nakamura Y, Gojobori T, Ikemura T: Codon usage tabulated from the international DNA sequence databases: status for the year 2000. [http://www.kazusa.or.jp/codon/] webcite
Nucleic Acids Res 2000, 28:292. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Guoy M, Gautier C: Codon usage in bacteria: correlation with gene expressivity.
Nucleic Acids Res 1982, 10:70557074. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Waterman MS, Byers TH: A dynamic programming algorithm to find all solutions in a neighborhood of the optimum.
Math Biosci 1985, 77:179188. Publisher Full Text

He L, Friedman AM, BaileyKellogg C: A divide and conquer approach to determine the Pareto frontier for optimization of protein engineering experiments.
Proteins 2012, 80(3):790806. PubMed Abstract  Publisher Full Text

Joern J: DNA shuffling. In Directed Evolution Library Creation: Methods and Protocols. Humana Press; 2003:8589. [Methods Mol Biol, vol 1.]