Department of Mathematics and Statistics, University of Ottawa, Ottawa K1N 6N5, Canada

Abstract

Background

Paralog reduction, the loss of duplicate genes after whole genome duplication (WGD) is a pervasive process. Whether this loss proceeds gene by gene or through deletion of multi-gene DNA segments is controversial, as is the question of fractionation bias, namely whether one homeologous chromosome is more vulnerable to gene deletion than the other.

Results

As a null hypothesis, we first assume deletion events, on either homeolog, excise a geometrically distributed number of genes with unknown mean

Conclusions

The recurrence for

Background

Whole genome doubling (WGD) creates two identical copies (

When a duplicate gene is lost, it may be lost from one copy (

The study of fractionation is basically a study of runs, that is runs of duplicate genes on two homeologous chromosomes alternating with runs of single-copy genes on one or both of these chromsomes. Because of the way these runs are generated biologically, and because they involve two chromosomes evolving in a non-independent way, standard statistical or combinatorial run analyses are not directly applicable.

In this paper, we present a detailed version of the excision model of fractionation with geometrically distributed deletion lengths, for which we previously analyzed a tractable, but biologically unrealistic, special case

A further complication arises from the way deletion events accumulate into longer runs of single-copy genes. The deletion of a certain number of duplicate genes may overlap the site of a previous deletion event on the

Another biologically important question is to determine

It is not difficult to simulate the fractionation process, but this gives little insight into its mathematical structure. Given that it is unlikely for any closed form of π to exist, nor for any simple computing formula, our goal here is to develop a recurrence for the distribution of π(

This work is an attempt at creating a rigorous "null" model of duplicate loss, based on parameters

The models

The structure of the data

The data on paralog reduction are of the form (G, **ℤ**, satisfying the condition that **ℤ**.

The sequence **ℤ **we denote by

The use of **ℤ **instead of a finite interval is consistent with our goal of getting to the mathematical essence of the process, without any complicating parameters such as interval length. In practice, we use long intervals of at least 100,000 so that any edge effects will be negligible. See

The deletion events

Let

• We start (

• At any **y **with parameter 1/

• Then with probability

• If the deletion is on

• One type of collision,

• The second type of collision,

Skippable collisions are a natural way to model the excision process, since deletion of duplicates and the subsequent rejoining of the DNA directly before and directly after the excised fragment means that this fragment is no longer "visible" to the deletion process. Observationally, however, we know deletion has occurred because we have access to the sequence

When the deletion event has to skip over previous 0s, this hides the anchor **r **the random variable indicating the total number of deletion events responsible for a run. Then, given **r **= **z **is distributed as the sum of

If we observe

Deletions with skipping and blocking

**Event**

**
i
**

**
a
**

**-7**

**-6**

**-5**

**-4**

**-3**

**-2**

**-1**

**0**

**1**

**2**

**3**

**4**

**5**

**6**

**7**

**8**

**
r
**

Start

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

-1

3

1

1

1

1

1

1

- 0

- 0

- 0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2

1

1

1

1

1

1

0

0

0

1

1

1

1

1

1

1

1

-4

1

1

1

1

- 0

1

1

1

1

1

1

1

1

1

1

1

1

1

3

5

1

1

1

1

1

1

1

0

0

0

1

1

1

- 0

1

1

1

1,1

1

1

1

0

1

1

1

1

1

1

1

1

1

1

1

1

1

4

4

3

1

1

1

1

1

1

0

0

0

1

1

- 0

0

- 0

- 0

1

1,2

1

1

1

0

1

1

1

1

1

1

1

1

1

1

1

1

1

5

1

1

1

1

1

1

0

0

0

1

1

0

0

0

0

1

2

-5

4

1

1

- 0

0

- 0

- 0

1

1

1

1

1

1

1

1

1

1

3

Five deletion events affecting two homeologous chromosomes, leading to two runs of single-copy genes. The fourth step illustrates the "skip" process, at

Results

Simulations to determine

We carried out simulations on an interval of **ℤ **of length 100,000. This enabled us to use a discrete time process instead of the continuous time process on **ℤ**. The "anchors" for the deletion events were chosen at random among the currently undeleted genes. The remaining steps were carried out as described in the previous section and Table

The top row of Figure

Simulations of events per run and run length

**Simulations of events per run and run length****. **Distribution of number of deletion events

We mention that any edge effects in our simulation are negligible. Whether we work with **ℤ **of length 100,000 or, as previously

Figure

Dependance of run length on deletion parameters

**Dependance of run length on deletion parameters****. **Average length of run of single copy genes in for

A recurrence for

We are interested in inferring

As

1. new runs (

Types of event

**Types of event****.** Types of deletion event affecting less than three pre-existing runs. Red and blue shading distinguishes between deletions from the two homeologous chromosomes. Grey areas represent previous deletions from either chromosome. White area indicates run of undeleted terms. Lightly shaded area indicates run of previously deleted terms. Darker area represents current deletion event. Hatched striped area above lightly shaded area indicates either previous deletions from both homeologous chromosomes, or only from the homeolog not affected by the current deletion. A: creates one new run with

2. runs that touch, overlap or entirely engulf exactly one previous run of deleted terms with

3. runs that touch, overlap or engulf, by the skipping process, two previous runs of _{1 }and _{2 }events respectively, creating a new run of _{1 }+ _{2 }+ 1 events, and diminishing the total number of runs by 1, including types D and E in Figure

4. runs that touch, overlap or engulf, by the skipping process, _{1}, ⋯, _{k }events respectively, creating a new run of _{1 }+ ⋯ + _{k }+ 1 events, and diminishing the total number of runs by

The first process, involving a deletion event of length

The distribution

Let _{1 }and _{2 }be the proportion of deletion events affecting homeologous chromosomes 1 and 2, respectively, so that _{1 }+ _{2 }= 1. Let _{i }and the same for the terms at the extreme right.

The proportion of undeleted terms in runs of length _{ρ}, where _{ρ }= ∑_{l}_{>0 }

where

Events of type _{i }create runs of deleted terms with

The probability

We define the contribution to mean run length of

Events of type _{ii }turn a deleted run with _{if}, with

The probability

We define the contribution to mean run length of _{ii }events to be

which can be calculated using an expansion such as that in (6). Events of type _{ii }turn a deleted run with

The probability

We define the contribution to mean run length of _{if }events to be

Events of type _{if}, with

The probability

in which the reduction of the number of nested summations is key to the computability of the entire calculation.

We define the contribution to mean run length of _{iii }events to be

which can be calculated using an expansion such as that in (10). Events of type _{iii }turn two deleted runs with

The probability

and the contribution to mean run length is

Events of type _{iif}, with

The probability

where

The probability

and

The probability

and

The probability

and

Events of type _{iii }turn two deleted runs with _{iif}, _{ifi }and E_{iff},, with

We reiterate here that the last lines of each of (2),(6) and (10) include the collection of terms, significantly cutting down on computing time when these formulae are implemented, especially in the case of (10).

In this initial model, we neglect the merger of three or more runs of deletions. There is no conceptual difficulty in including three or more mergers, but the proliferation of embedded summations would require excessive computation. Thus we should expect the model to be adequate until

Let _{B}, ⋯, _{E }be the sums of their respective subscripted terms (with all combinations of _{π}(

For

In an implementation on a finite interval of **ℤ**, the number of runs of deleted terms will change from some value

The distribution of number of events per run will also change from

and where the mean of the number of deleted genes per run increases from

The mean

The new proportion

In the same interval of **ℤ**, we define the change _{τ}(

For

In the implementation, the number of runs of deleted terms with genes on both chromosomes will change from

The proportions of runs with deletion events from both chromosomes will also change from

We implement equations (1) to (33) as a recurrence with a step size parameter Λ to control the number of events using the same _{A}, _{B}, _{C}, _{D}, _{E}, _{π}(·) and _{τ}(·) between successive normalizations, and using Λ_{π}(·) and Λ_{τ}(·) instead of _{π}(·) and _{τ}(·) in (25)-(33). The choice of Λ determines the trade-off between computing speed and accuracy.

Figure

Comparison of event frequencies in simulations and model

**Comparison of event frequencies in simulations and model.** Changes in rates of different event types as calculated by recurrence (dashed lines), compared with simulation results (solid lines). Horizontal axis: Proportion of duplicates deleted = 1 -

Biased fractionation with large deletion sizes leads to slow initial growth in the proportions of events of types D and E and "other".

There are at least two reasons for the discrepancies between the simulations and the recurrences observed in Figure _{A }and slower increase in _{B }+ _{C}. Later discrepancies are partially due to not accounting for the merger of three or more runs. These can be estimated and are summarized as "other " in the diagram, but the quantities involved are not fed back to the recurrence through (26).

Other possible sources of error might be due to the cutoffs in

Conclusions

We have developed a model for the fractionation process based on deletion events excising a geometrically-distributed number of contiguous paralogs from either one of a pair of homeologous chromosomes. The existence of data prompting this model is due to a functional biological constraint against deleting both copies of a duplicate pair of genes.

The mathematical framework we propose should eventually serve for testing the geometric excision hypothesis against alternatives such as single gene-by-gene inactivations, although we have not developed this in this paper. In addition, further developments could treat the gene-by-gene inactivation model as the null hypothesis, and the geometric excision model, with mean greater than 1, as the alternative hypothesis.

Simulations of these models indicate the feasibility of estimating the mean

The main question we have explored is the exact derivation of

In order to validate our fractionation model empirically, we will have to expand it to incorporate the rearrangement events that are pervasive in genome evolution. Our previous work on this problem shows that the effect of rearrangement is to seriously bias the observable, credible instances of fractionation towards smaller runs of deleted genes

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

DS, CZ and BW formulated the problem, carried out the calculations and simulations, and wrote the paper. All authors read and approved the final manuscript.

Acknowledgements

Research funded in part by a Discovery grant from the Natural Sciences and Engineering Research Council of Canada.

This article has been published as part of