Department of Genome Sciences, University of Washington, Seattle, WA, USA

School of Mathematics and Statistics, University of Sydney, Sydney, Australia

Department of Biology, Ithaca College, Ithaca, NY, USA

Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA

Abstract

Background

The characterization of DNA replication origins in yeast has shed much light on the mechanisms of initiation of DNA replication. However, very little is known about the evolution of origins or the evolution of mechanisms through which origins are recognized by the initiation machinery. This lack of understanding is largely due to the vast evolutionary distances between model organisms in which origins have been examined.

Results

In this study we have isolated and characterized autonomously replicating sequences (ARSs) in

Conclusions

Our findings demonstrate a replication initiation system with novel features and underscore the functional diversity within the budding yeasts. Furthermore, we have developed new approaches for analyzing biologically functional DNA sequences with ill-defined motifs.

Background

Eukaryotic DNA replication initiates at loci known as origins of replication. In budding and fission yeast origins are short sequences (< 1 kb) that allow autonomous replication of episomal plasmids

In

Due to the large evolutionary distance between

To further explore the mechanisms of DNA replication initiation among the budding yeast species, we have investigated the structure of ARSs from another pre-WGD yeast -

Results

The Isolation of

In order to identify functional ARSs in

Screen to isolate

**Screen to isolate L. kluyveri ARSs**. (A)

Previous work in

**plasmids were transformed into S. cerevisiae, L. kluyveri, and K. lactis and assayed for ARS function in the different species**. The 'ARS source' column denotes the origin of the ARS, while the 'functions in' column denotes proportion of ARSs that are functional in the listed species.

**Supplementary Tables**. The supplementary tables associated with this study.

Click here for file

Identification of the

Our findings raise questions about the molecular determinants of ARS function in these three species. Are the observed differences in the level of ARS selectivity among the three yeast species reflected in the

Identification of the

**Identification of the LkACS motif**. (A) Position Weighted Matrix logos of putative ACS motifs for

Since the longer motifs are largely T-rich, we tested the sufficiency of A/T rich DNA for

We note that all the related

Truncation and Mutagenesis Analysis

In

Truncation of

**Truncation of LkARSs to narrow down essential functional regions**. (A-C) Sub-fragments of three

**Supplementary Figures**. The supplementary figures associated with this study. **Figure S1**. Additional **Figure S2**. The reduced predictive power of the 11 bp **Figure S3**. **Figure S4**. Nucleotide distributions surrounding functionally relevant ACS motifs in

Click here for file

We isolated functional fragments of three of the

Mutagenesis of

**Mutagenesis of LkARSs to identify sequences necessary for LkARS function**. (A-C) the shortest functional fragments of the three

If the putative

We found that our putative

The selectivity of the ACS PWM models

Going back to our observation of the different levels of selectivity of the replication machinery in our three yeast species (Figure

As an aside we note that our definition of selectivity of a PWM is highly correlated with its information content

The Predictive Power of ACS Motifs for ARS Function

To assess the effectiveness of our ACS models in predicting ARS functionality in their respective species we use an extended version of our ARS interchangeability data presented earlier. Specifically, for each ACS model we test how well it predicts the host species functionality (functional, non-functional, or weak) of the set of "foreign" ARSs. The latter set consists of all the ARSs that were originally screened in one of the other two species. For example, in the case of

We used several correlated measures to gauge the predictive power of each species PWM model: a 2-class aROC (area under the ROC curve) that measures how well the model can distinguish between functional and non-functional (foreign) ARSs, a 3-class aROC that measures how well it can distinguish between all three functional categories (non-, weak and functional) and a measure of how well the weak ARSs are placed between the non-functional and functional ones (see Methods for details of all three measures).

All three measurements gave the same consistent answer: the predictive power of the ACS motif is highest in

The rightmost column of Additional File

The last observation as well as its generally low predictive value raise the question of whether our 9 bp

Analysis of auxiliary sequence elements

Another possible explanation for the weak predictivity of our

Nevertheless, we tested whether an augmented sequence model that combines any of these auxiliary motifs with our ACS model could offer a significantly improved predictivity of

Extended sequence models

**Extended sequence models**. Graphical representation of the three linear weights models we studied that factor sequence information beyond the ACS. The paired linear model (A) is using an auxiliary motif in addition to the ACS PWM: the overall score is the weighted sum of the individual (disjoint) match scores. The contextual PWM model (B) consists of the weighted sum of the ACS match and the adjacent matches to the contextual PWMs. The latter are learned from the sites flanking the ACS sites in the alignment of the native ARSs. The Markov contextual model (C) combines the ACS match with the (log of the) Markov likelihood of the adjacent segments (normalized by an iid background model). The contextual Markov models are learned from the alignment of the native ARSs.

The estimated aROCs in Additional File

Alternative approaches to utilizing extended sequence information for predicting origin locations were introduced by Breier

PWM contextual model

Breier

We compared the extent to which the patterns we observe in these extended profiles can improve our prediction of foreign ARS function in each of our three yeast species. For each species, we divided each of the extended sequence profiles into segments based on visual inspection, with one segment reserved for the ACS itself, and estimated a PWM for each segment. A candidate match is then scored as a weighted sum of PWM match scores, one for each segment (Figure

We again used cross validation to learn the optimal weights for each species and evaluate the corresponding predictivity of the resulting combined model (see Methods for details). The results of this analysis are presented in Additional File

Unlike Breier

To test whether the overall span of the

Taken together our contextual PWM model provides statistically sound evidence that ARS activity in

Markov contextual model

Recently MacAlpine

We begin as in the PWM contextual model by partitioning the alignment of the native ARSs into segments, with one segment reserved for the ACS. We then learn a low order Markov chain (3^{rd }order was most commonly used) for each of the alignment segments instead of the PWM we previously used to model that segment. Consequently, each non-ACS segment is scored using its Markov chain likelihood and the score of the whole match is again the weighted sum of the scores of all the segments including the ACS segment, which still uses a PWM (Figure

Discussion/Conclusions

The study of eukaryotic replication origins in budding yeast has been largely limited to the well-studied

We used the classic ARS screen

Despite this similarity at the ACS level, replication origins in

We have evidence that supports this flat response curve hypothesis. First, the same computational models that show very high accuracy in predicting ARS function in

This behavior of

A difference in DNA sequence requirements must correspond to a difference in the protein machinery that interacts with origin DNA. The stochastic mechanism used by

Finally, we caution that while we consistently used a 9 bp model for the ^{1 }and it has the highest predictive power among those models. Moreover, we constructed several contextual models based on other putative ACS motifs (including length 11, 13 and 16) and none of these models had a higher predictive power than the 9 bp based model.

Methods

Construction of vector pIL07

As described

Construction of

Genomic DNA from the sequenced diploid FM479 strain of

Screening of

The plasmid libraries were used to transform a

**ARSs used in this study**. A list of coordinates and functional information of the

Click here for file

Cloning, Truncation and Mutagenesis of ARS sequences

As described

Analysis of ARS interchangeability data

We categorized the functionality of each

We assessed the statistical significance of the number of

The hypergeometric test measures the level of surprise of the size of the intersection between two sets, in our case the set of

We repeated this test for evaluating the size of the set of

Identification of ACS position weight matrix (PWM)

The

We used the given annotations to generate a file of

When searching for the

The "

To define the

Assessing the selectivity of a PWM

We define the selectivity of a motif as its ability to distinguish between sites generated according to the motif model (PWM in this case) and sites occurring in "random DNA". Specifically, if we imagine we have a list of "real sites" of length l that are generated by the model and a null generated list of N l-mers we can ask how many null sites vs. real sites score above the threshold which we vary. This is a special case of an ROC (receiver operating characteristic) curve and as such it is summarized by the area under the curve (aROC). A maximally selective PWM will have an aROC of 1 (all real sites score higher than all null sites) whereas a non-selective PWM will have an aROC of 0.5 (a real site is equally likely to score higher or lower than a null site.

For each species putative ACS motif (we used the 9 bp ^{2}.

We then generated 200,000 "null sites" by sampling a sequence of length L (where L = 100,000 + width of motif - 1) from a background 4th order Markov chain trained on the species' set of intergenic sequences. The null sites were defined by the list of all the words in this string and its reverse complement. We then scored the 200 sampled "real sites" as well as these 200,000 sampled "null sites" using an LLR (log-likelihood ratio) score: a site score is the log of the ratio of the likelihood of the site under the PWM model over the likelihood of the site under the background Markov model. The slight twist we introduced here is to average the background likelihood over the two "strands", i.e., over the site and its reverse complement.

We then used the canonical measure of aROC to gauge how well the PWM distinguishes between the null and real sites.

Estimating the predictive power of an ACS PWM

Using each host species ACS PWM we assign each foreign ARS a score that is the score of the best match to the ACS (assigned by SADMAMA). We then rank each foreign ARS according to its score and evaluate, using the standard measure of aROC, how well this ranking agrees with the ARS functionality of these sequences in the host species. A perfectly predictive classifier (or PWM in this case) would give an aROC value of 1 while a random classifier would give an aROC of ~0.5. When using the aROC we can only define 2 classes so we use only the functional and non-functional set of ARSs in this evaluation, leaving out all weak ones.

To utilize the set of weak ARSs we note that the aROC has an equivalent probabilistic formulation. Namely, if you imagine randomly drawing one sequence from the functional and one from the non-functional set of foreign ARSs, then the aROC is the probability that the (classifier) score of the functional sequence will be higher than the score of the non-functional sequence. One advantage of this latter formulation is that it suggests an obvious generalization for more than 2 classes. For example, when we have 3 linearly ordered classes, as in our example (non-functional < weak < functional), we can define the generalized aROC as the probability that the scores of a randomly drawn triplet of sequences, one from each class, will be ordered correctly: the score of the non-functional sequence is the smallest and the score of the functional sequence is the highest. Note that while a perfect classifier would still have a generalized aROC of 1, the generalized aROC of a random classifier on 3 classes would be roughly 1/6 ~ 0.167 as there are 6 different permutations or ways to order the scores of the 3 sequences.

The summary of our analysis of the 3-class, or generalized, aROC is presented in Additional File

Combining the information from the 3-class and the 2-class aROC allows us to gauge the predictive power of our models on a finer scale than we can when using the standard 2-class aROC. For example, we can test whether the scores our models assign to the weak ARSs are correctly placed between the functional and the non-functional foreign ARSs. More precisely, we compute the probability that a randomly drawn weak foreign ARS is correctly placed between a randomly drawn pair of functional and non-functional foreign ARSs,

Confidence intervals for aROC of predicting functionality of foreign ARSs

To account for the random effects in evaluating the aROC we used bootstrap to construct approximate 95% confidence as described next. Each foreign ARS is assigned a score, corresponding to the best match to the host species ACS, as well as a label describing its ARS functionality in the host species: functional, non-functional, or weakly functional. Thus, for each host species we have 3 lists of foreign ARS sequence scores: functional, non-functional and weak. We sample with replacement each of the three lists separately to generate 10,000 bootstrapped score lists of the same size as the original. We then compute the aROC for the bootstrapped functional and non-functional lists and compute the 3-class aROC using all three bootstrapped lists. This provides us with an empirical sample of 10,000 aROCs from which we generate approximate confidence intervals as described next.

When the host species was

When

Searching for an auxiliary motif

GIMSAN was applied to the set of 84 native

Evaluating the paired linear model

We considered paired linear models defined by the 9 bp putative

The three auxiliary motifs we considered were selected based on the fact that two of them were assigned by GIMSAN the best overall p-values: the 14 bp motif found using a ZOOPS model and a 25 bp motif found using an OOPS model. In addition we tested a 6 bp motif reported by GIMSAN using a ZOOPS model (all 3 motifs can be inspected in Additional File

Given a training set of ARSs for which we know the labels ('yes', 'no', or 'weak') we define the optimal pair of weights as the one that will maximize either the 2-class (in which case we ignore all 'weak' ARSs) or 3-class aROC depending on which one we are trying to optimize. The optimization is achieved using the general Powell minimization^{3 }method implemented in the Python Scipy package.

At the core of the optimization is the function that computes the score of each sequence given the current value of the pair of weights. In principle, computing this score for a given sequence involves considering every pair of sites in the sequence, one per each PWM and taking the maximum of all the corresponding weighted sums. However, we can rank the matches to each PWM and use those ranks to identify each pair of matches with a point in the 2-d integer lattice.

It is easy to see that ignoring the non-overlap constraint the maximal weighted sum will always coincide with the (1,1) point in the lattice, that is the best match to each PWM (recall the weights are positive). However, in general this point as well as others on this lattice might not be feasible due to overlap between the corresponding sequence matches. It is not difficult to see that in this general case the maximum can only be attained on the maximal lattice points among the set of feasible lattice points. Finding the latter points is something that can readily be done in a preprocessing step by going over the 2 lists of ranked matches, one for each PWM. This preprocessing significantly reduces the amount of computation required for evaluating the aROC associated with each pair of weights.

To evaluate the predictive power of our model we use cross-validation. We randomly partition the set of

Constructing approximate confidence intervals for the cross-validation procedure

In this work we used cross-validation on a number of occasions to estimate a model's aROC. As usual with such point estimates there is randomness in its exact value. In this case, the average aROC depends on the arbitrary assignment of the sequences into the

We randomly partition the data into

A variant of this procedure allows us to determine whether one method for predicting ARS function is statistically significantly better than another method. Specifically, we evaluate the difference between the two methods' aROC on each bootstrapped sample generated as above. If the 95% confidence interval of the difference lies entirely to the right/left of 0 then we say the first/second method is significantly better. The 95% confidence intervals are constructed using the normal method if that normal assumption is supported by the Lilliefors test (at the standard 5% level). Otherwise, the 0.025 and 0.975 sample quantiles are used to define the approximate 95% confidence interval.

PWM contextual model

The alignment of the native ARSs in each species (Additional File

For each scanned sequence (a foreign ARS) the top 50 matches of the ACS PWM were found by SADMAMA (-pwmPC 0.01 -m 4 both_strands -siteNullScore avg_strands). A python script was written to parse the SADMAMA output and add to it the weighted scores of the neighboring segments using the contextual PWMs mentioned above. A pseudo count of 0 was used for each contextual PWM and the background model for the LLR score was a 0-th order Markov chain learned from the host intergenic sequences.

Given a training set, the weights are optimized for the 2-class aROC (or the 3-class aROC depending on the optimized target). The training sets are determined through a cross-validation scheme applied to the set of foreign ARSs. Specifically, for each species we divided its set of foreign ARSs into

We also examined the effects of allowing some flexibility at the seams between the PWMs. Specifically; we allowed some slack, a small gap or overlap, between the end of the current segment and the start of the next one. The slack in the offset from the ACS PWM was no more than

The

• 1-100 (T-rich region); 101-133 (ACS); 134-216 (A-rich region)

The

• 51-100 (T-rich region); 101-150 (ACS); 151-200 (A-rich region)

The

• 51-100 (T-rich region); 101-109 (ACS); 110-150 (AT-rich region); 151-209 (A-rich region)

We also tested two "shorter"

Markov contextual model

The models that were tested were based on the same segmentation used in the PWM contextual model above.

Authors' contributions

Conceived and designed the experiments: IL, MI, BKT, UK. Performed the experiments: IL, KL, SCCC, LY, AC, LH, EC, GK, HP, JB. Analyzed the data: IL ET BKT UK. Wrote the paper: IL, MI, BKT, UK. All authors read and approved the final manuscript.

Endnotes

^{1}As did the extended

^{2}These flanks influence the sites scores as will become clear below.

^{3}We could have used a 1-dimensional optimization here but the code was written for a more general case allowing more than one auxiliary PWM.

Acknowledgements and Funding

This study was supported by NIH72557 and NSF MCB-0453773 to BK and NSF 0644136 to UK. IL is supported by NIH award 1F32GM090561-01. We would like to thank Mark Johnston for providing strain FM628, Gregory Kuzmik for technical assistance, and Maitreya J. Dunham for reading the manuscript and helpful discussions. Portions of the data presented here were generated as part of the undergraduate research course BioBM399 conducted at Cornell University.