Department of Biochemistry, University of Western Ontario, London, Ontario, Canada

Department of Applied Mathematics, University of Western Ontario, London, Ontario, Canada

Abstract

Background

Unigenic evolution is a powerful genetic strategy involving random mutagenesis of a single gene product to delineate functionally important domains of a protein. This method involves selection of variants of the protein which retain function, followed by statistical analysis comparing expected and observed mutation frequencies of each residue. Resultant mutability indices for each residue are averaged across a specified window of codons to identify hypomutable regions of the protein. As originally described, the effect of changes to the length of this averaging window was not fully eludicated. In addition, it was unclear when sufficient functional variants had been examined to conclude that residues conserved in all variants have important functional roles.

Results

We demonstrate that the length of averaging window dramatically affects identification of individual hypomutable regions and delineation of region boundaries. Accordingly, we devised a region-independent chi-square analysis that eliminates loss of information incurred during window averaging and removes the arbitrary assignment of window length. We also present a method to estimate the probability that conserved residues have not been mutated simply by chance. In addition, we describe an improved estimation of the expected mutation frequency.

Conclusion

Overall, these methods significantly extend the analysis of unigenic evolution data over existing methods to allow comprehensive, unbiased identification of domains and possibly even individual residues that are essential for protein function.

Background

The completion of genome sequencing projects has led to the identification of novel proteins at an unprecedented rate

One innovative experimental approach with the capacity to identify domains and possibly even specific amino acid residues that are required for function is a genetic strategy known as unigenic evolution, developed by Deminoff ^{2 }.

The results of the statistical analysis described by Deminoff

Results

Since the goal of unigenic evolution is to identify residues that are critical to protein function

Since the mutational data generated by unigenic evolution contains both missense and silent nucleotide substitutions, an observed frequency of missense mutations (_{obs. missense}) for each codon in the protein can be calculated. Following Deminoff

This observed frequency of missense mutations can then be compared to the expected frequency of missense mutations for each corresponding codon. The expected frequency of missense mutation is calculated by observing that each codon has a characteristic potential for producing a silent or missense mutation given one nucleotide change. Looking at all possible single base changes, the expected frequency of missense mutations can be easily calculated. This "first pass" technique assumes that all single nucleotide substitutions are equally likely.

Deminoff _{exp. missense }for each codon can therefore be calculated by determining the frequencies of missense substitutions created by either transitions or transversions, and weighting these expectations by the observed frequency of transitions and transversions in the database

Normalizing the frequency of expected missense mutations (based on codon sequence) to the transition/transversion ratios observed in functional clones greatly improves the accuracy of the expected value. However, this technique assumes i) that all transitions (or all transversions) are equally likely and ii) that substitutions observed in functional clones are representative of all substitutions that occur in unigenic evolution. Since selection for viable mutants results in a bias toward mutations that are tolerated by functional molecules, we expect that each of these assumptions may result in some substitution frequencies that are over- or under-represented. We therefore extended the analysis presented in Deminoff

Mutation frequencies from a random pool

To remove any possible bias caused by considering only the mutations in functional clones, we analyzed a stratified random sample of protein clones (6 clones from each of three mutant libraries), where the libraries contained all functional and non-functional clones. We then determined the frequency of each substitution for each nucleotide in this random pool. The distribution of observed nucleotide substitutions in the random pool is given in Table

Nucleotide Substitutions in a Random Sample of 18 Unscreened Pin1 Clones.

**Nucleotide Substitution**

**# Observed**

A to G

38

T to C

35

G to A

25

C to T

28

T to A

11

A to T

16

C to A

8

G to T

5

A to C

4

T to G

0

C to G

0

G to C

1

Using the nucleotide substitution data that was observed within this random sample of clones, equations were formulated to calculate the expected frequency of missense mutations (_{exp. missense}) for each codon. The first step in this analysis is to estimate the underlying mutation rates for each base. The probability that a base, B, is replaced by substitution in one run through PCR is defined as mB. Since the substitution may actually have occured on the complementary strand, we treat a mutation from A to G, for example, as equivalent to a substitution from T to C. Thus the probability of mutating an adenine to any other nucleotide is given by:

where (#A-G) represents the number of A to G substitutions observed in the random pool of functional and non-functional clones (Table

Similarly, probabilities for individual nucleotide substitutions (denoted m_{A-G}, m_{A-C}, etc.) can by calculated using the same mutational data from the random pool. As a sample dataset, a summary of the nucleotide substitution rates observed in the random pool of clones is given in Table

Estimated mutation probabilities based on 18 unscreened Pin1 clones.

**Nucleotide**

**Mutation Probability**

**m _{A }and m_{T}**

**0.0318**

m_{A-C }and m_{T-G}

0.0012

m_{A-G }and m_{T-C}

0.0223

m_{A-T }and m_{T-A}

0.0082

**m _{C }and m_{G}**

**0.0120**

m_{C-A }and m_{G-T}

0.0023

m_{C-G }and m_{G-C}

0.0002

m_{C-T }and m_{G-A}

0.0095

Based on these mutation probabilities, the frequency of expected missense substitutions for each codon can be calculated. An amino acid with more than one codon such as Cys (TGC and TGT) will exhibit distinct _{exp. missense }values for each codon. For example, the frequency of expected missense substitutions for Cys (TGC) was calculated using the following equation:

Note that it is possible that some missense mutations may actually be nonsense (e.g. TGA). To avoid

Again, as a sample dataset, calculated values of _{exp. missense }for each codon in Pin1 are given in Table

Comparison of _{expected missense }values for Pin1 codons.

**Codon**

_{expected missense }this study

_{expected missense }Deminoff et al.

Met (ATG)

1.00

1.00

Trp (TGG)

1.00

1.00

**Cys (TGC)**

**0.83**

**0.76**

**Asp (GAC)**

**0.83**

**0.76**

**Asp (GAT)**

**0.70**

**0.76**

**Glu (GAA)**

**0.70**

**0.76**

**Glu (GAG)**

**0.83**

**0.76**

**Phe (TTC)**

**0.87**

**0.76**

Phe (TTT)

0.77

0.76

**His (CAC)**

**0.83**

**0.76**

Lys (AAA)

0.77

0.76

**Lys (AAG)**

**0.87**

**0.76**

**Asn (AAC)**

**0.87**

**0.76**

**Gln (CAG)**

**0.83**

**0.76**

**Tyr (TAC)**

**0.87**

**0.76**

**Ile (ATC)**

**0.90**

**0.72**

Ala (GCC)

0.67

0.67

Ala (GCG)

0.67

0.67

**Gly (GGA)**

**0.43**

**0.67**

Gly (GGC)

0.67

0.67

Gly (GGG)

0.67

0.67

**Gly (GGT)**

**0.43**

**0.67**

**Pro (CCA)**

**0.43**

**0.67**

Pro (CCC)

0.67

0.67

Pro (CCG)

0.67

0.67

**Pro (CCT)**

**0.43**

**0.67**

**Val (GTC)**

**0.78**

**0.67**

**Val (GTG)**

**0.78**

**0.67**

**Thr (ACC)**

**0.78**

**0.67**

**Thr (ACG)**

**0.78**

**0.67**

**Thr (ACT)**

**0.58**

**0.67**

**Ser (AGC)**

**0.83**

**0.76**

**Ser (AGT)**

**0.70**

**0.76**

**Ser (TCC)**

**0.78**

**0.67**

**Ser (TCT)**

**0.58**

**0.67**

**Ser (TCA)**

**0.58**

**0.67**

**Ser (TCG)**

**0.78**

**0.67**

**Leu (CTG)**

**0.61**

**0.43**

**Leu (CTC)**

**0.78**

**0.67**

Arg (AGA)

0.69

0.72

**Arg (AGG)**

**0.81**

**0.72**

**Arg (CGA)**

**0.39**

**0.62**

Boldface indicates rows in which the two methods differ by 5% or more.

Identifying hypomutable regions

The observed frequency of missense mutations for each codon in the protein calculated from equation (1) can be compared to the expected frequency (Table

to determine the hypomutability of the residue. The value of H will range between 0 and -1, where -1 reflects maximal hypomutability and occurs when we observe zero missense mutations; zero occurs when the observed frequency equals the expected frequency.

When the observed missense frequency is greater than the expected value, however, H ranges between zero and M = (1-_{exp. missense})/_{exp. missense}. Since M is a (possibly large) number that varies from one codon to the next, we normalize hypermutability in this case by dividing H by the theoretical maximum, M, for that residue. This normalized hypermutability has a minimum of zero and a maximum value of +1, which only occurs if all mutations observed are missense mutations. To plot these results, the mutability of individual residues is averaged over a window of a specified number of codons. The average hypo- or hypermutability is then plotted in the center of the specified window, and the window is shifted downstream one codon at a time. Note that this normalization and plotting technique, although described in different terms, is equivalent to the method previously described by Deminoff

Since no objective means for choosing the length of the averaging window have been established, we investigated the dependence of our results on this length. Accordingly, we applied the procedure described above with window lengths ranging from 1–25 codons to the sample dataset (Figure

Mutability plots

**Mutability plots**. Mutability plots were determined as described in the text (grey bars). The mutability of individual residues was averaged over a window of 1, 5, 11 or 25 codons. The hypo- or hypermutability was then plotted as a bar in the center of the specified window and the window was shifted downstream one codon at a time. Individual hypomutable regions, designated A, B, C, and D are indicated on the plot for the 11 codon window. For comparison, the difference between mutability calculated by previous methods (5) and mutability as described in this manuscript is also shown (circles).

Determining region boundaries and significance

In addition to the influence of the window size on the number of hypomutable regions, it is important to recognize that the boundaries of each hypomutable region are not always clearly defined. To determine the significance of putative hypomutable regions and to define their boundaries statistically, a chi-square (χ^{2}) analysis of each region is performed. For a series of residues corresponding to each apparently hypomutable region, χ^{2 }can be determined using the expression:

In this equation, the total number of expected missense mutations for each hypomutable region is calculated by multiplying the total number of mutations (silent + missense) observed in a hypomutable region by the average value of _{exp. missense }for all codons within the region; the expected number of silent mutations in the region is calculated similarly. The Yates correction has been used

Deminoff ^{2 }value for each series of residues until

Using this technique, however, we found that delineation of the boundaries of each region was somewhat arbitrary. For example, a region of 8 residues might be significant. If the 9^{th }residue is added to the region, significance is lost. However if both the 9^{th }and 10^{th }residues are added, significance is regained. Similarly, we found that region boundaries were sensitive to the initial choice of "central" residues, and to the direction in which the region was first expanded.

We therefore developed a region-independent method for identifying significant hypomutable regions. In this method, we consider region lengths up to 50 residues long. For each region length, we calculate χ^{2 }for every possible region of that length in our sequence. This corresponds once again to sliding a window of the appropriate length along the sequence, however in this case we are not averaging across the window, but computing the significance of the region within the window as a whole. Thus, for example, computing the χ^{2 }value for an 11-codon region is not equivalent to computing the average hypomutability in a window of the same length.

With these values in hand, we produce a 3-dimensional plot of the χ^{2 }value for each region of every length (Figure ^{2 }value exceeded the ^{2 }does not distinguish between significantly hypo- or hypermutable regions, thus we only plot χ^{2 }for significantly hypomutable regions (i.e. if the observed number of missense mutations is less than expected). The figure reveals four hypomutable regions, corresponding relatively closely to regions A through D described above. We find that region A is only significant for fairly short region lengths. Region B is significant over a wider range of lengths, and the χ^{2 }value associated with the region changes depending on region length. In region C, we observe the effect described previously: regions of length 17 and 18 are significant, but significance is lost for regions of length 19 through 24. If the region is expanded to lengths of 25 through 27, however, significance is regained. Region D is significant for almost any region length, reaching its highest χ^{2 }values for region lengths of about 25 residues.

Region-independent chi-square analysis

**Region-independent chi-square analysis**. Three-dimensional plot illustrating χ^{2 }of each significantly hypomutable region plotted against region length and amino acid residue number. Calculations were performed as described in the text. Only regions that are significant at the 0.005 level are plotted; the whole window is plotted whenever this significance level is achieved. If a residue is involved in more than one significant region of the same length, the region with the highest χ^{2 }value is plotted. Colours indicate χ^{2 }of the region and range from deep red (χ^{2 }> 15, corresponding to α < 0.0001) to pale green (χ^{2 }> 8, α < 0.005). Four hypomutable regions approximating regions A-D (Figure 1) are evident.

Although a plot such as Figure

Significance of non-mutated codons

As stated previously, the observed hypomutability of residues or regions within a protein could result from selection against mutations located within essential regions, through differences in the mutation frequency of various codons, or simply by chance. The overall goal of unigenic evolution is to identify residues for which the first of these factors is important ^{F}. Thus

As with the expected mutation rates calculated in previous sections,

_{Met }= (1- m_{A})(1- m_{T})(1- m_{G})

For an amino acid with multiple codons such as cysteine (TGC and TGT)

_{Cys (TGC) }= (1- m_{T})(1- m_{G}) (1- m_{C-G }+ m_{C-A})

The latter equation follows from the fact that a mutation of the first two nucleotides (T and G) to any other nucleotide results in an amino acid change, while a mutation of the third base results in a missense substitution only if mutated to a G or A.

Once again we provide a sample dataset from the unigenic evolution of Pin1 in Table

Summary of non-mutated residues and corresponding

Gly 10 (GGC)

0.135*

Trp 11 (TGG)

0.010

Arg 21 (CGA)

0.013

Asn 30 (AAC)

0.004

Ala 31 (GCC)

0.135*

Ser 32 (AGC)

0.020

Pro 52 (CCT)

0.135*

Val 55 (GTC)

0.026

Cys 57 (TGC)

0.021

His 59 (CAC)

0.020

Leu 60 (CTG)

0.058*

Leu 61 (CTG)

0.058*

Lys 63 (AAG)

0.004

His 64 (CAC)

0.020

Ser 67 (TCA)

0.026

Trp 73 (TGG)

0.010

Arg 74 (CGG)

0.164*

Arg 80 (CGG)

0.164*

Glu 84 (GAG)

0.020

Ala 85 (GCC)

0.135*

Tyr 92 (TAC)

0.004

Gly 99 (GGA)

0.135*

Leu 106 (CTG)

0.058*

Ser 108 (TCA)

0.026

Ser 111 (AGC)

0.026

Asp 112 (GAC)

0.020

Cys 113 (TGC)

0.021

Ser 115 (TCA)

0.026

Ala 116 (GCC)

0.135*

Gly 120 (GGA)

0.135*

Leu 122 (CTG)

0.058*

Ala 137 (GCC)

0.135*

Glu 145 (GAG)

0.020

Met 146 (ATG)

0.002

Ser 147 (AGC)

0.026

Val 150 (GTG)

0.026

Gly 155 (GGC)

0.135*

His 157 (CAC)

0.020

Thr 162 (ACT)

0.025

*

For the 24 residues with

Discussion

Building on previous analytical work in this area

To refine the analysis of Deminoff

Comparison of _{exp. missense }values for all codons in our sample data set revealed that this was indeed the case (Table _{exp. missense }values of most codons differed. Specifically, the expected frequency of missense mutations differed by 5% or more for 31 of 42 codons. This indicates that mutation rates in the population of functional clones were not necessarily representative of mutation rates within the entire library of mutant alleles (both functional and non-functional).

Interestingly, an initial comparison of mutability plots generated using data from the random pool of clones (this study) to a plot generated using the transition/transversion ratio in functional clones _{exp. missense }values were calculated using data from the random pool. The

The data presented in Figure

We believe that another major practical contribution of this work is the derivation of ^{F}, where _{C-G }and m_{G-C }in Table

Throughout this study, we used data obtained from the unigenic evolution of the peptidyl-prolyl isomerase Pin1 to illustrate our techniques. We used unigenic evolution in this case because the enzyme is highly conserved in eukaryotic organisms, and it was therefore difficult to identify functionally critical residues from a sequence alignment. The unigenic evolution strategy represents an unbiased approach that makes no a priori assumptions about which residues should be subjected to mutagenesis; furthermore, because residues other than alanine can be substituted in non-critical positions, new information about the amino acid chemistry required at each position is obtained. In the case of Pin1, unigenic evolution revealed four hypomutable regions, defined using the methods outlined in the current manuscript. Two of these functionally critical regions were subjected to saturating mutagenesis using random oligonucleotides, and functional clones were selected

Conclusion

Based on the results that we obtained in our experimental dataset, it can be readily envisaged that unigenic evolution together with the statistical methods that are described in this paper will be a powerful strategy for elucidating functional domains and, in some cases, specific residues that are essential for protein function.

Methods

Construction of libraries of Pin1 variants

Three independent libraries encoding variants of the human Pin1 cDNA

Isolation of functional Pin1 variants

Following mutagenesis, each of the three libraries was cloned into a yeast expression vector (pY204) to allow for selection of functional variants of Pin1 in the yeast strain YKH100 (

Authors' contributions

CDB carried out the experimental work for the unigenic evolution and participated in statistical analysis and drafting the manuscript. CJB, DWL and BHS jointly designed the experimental study and participated in improving the statistical analysis and drafting the manuscript. LMW designed the statistical method, participated in statistical analysis and drafted the manuscript. All authors have read and approved of the manuscript.

Acknowledgements

This work was supported by a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (to L.M.W.) and a Collaborative Health Research Grant from the Natural Sciences and Engineering Research Council of Canada and the Canadian Institutes of Health Research as well as the Ontario Cancer Research Network (to D.W.L., C.J.B. and B.H.S.). We thank Dr. Steven Hanes (Wadsworth Center, Albany NY) for providing the yeast strain harboring the Ess1 disruption.