This article is part of the supplement: Proceedings of the Ninth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics
Department of Mathematics and Statistics, University of Ottawa Ottawa, Canada K1N 6N5
BMC Bioinformatics 2011, 12(Suppl 9):S5 doi:10.1186/1471-2105-12-S9-S5Published: 5 October 2011
Paralog reduction, the loss of duplicate genes after whole genome duplication (WGD) is a pervasive process. Whether this loss proceeds gene by gene or through deletion of multi-gene DNA segments is controversial, as is the question of fractionation bias, namely whether one homeologous chromosome is more vulnerable to gene deletion than the other.
As a null hypothesis, we first assume deletion events, on one homeolog only, excise a geometrically distributed number of genes with unknown mean µ, and these events combine to produce deleted runs of length l, distributed approximately as a negative binomial with unknown parameter r, itself a random variable with distribution π(·). A more realistic model requires deletion events on both homeologs distributed as a truncated geometric. We simulate the distribution of run lengths l in both models, as well as the underlying π(r), as a function of µ, and show how sampling l allows us to estimate µ. We apply this to data on a total of 15 genomes descended from 6 distinct WGD events and show how to correct the bias towards shorter runs caused by genome rearrangements. Because of the difficulty in deriving π(·) analytically, we develop a deterministic recurrence to calculate each π(r) as a function of µ and the proportion of unreduced paralog pairs.
The parameter µ can be estimated based on run lengths of single-copy regions. Estimates of µ in real data do not exclude the possibility that duplicate gene deletion is largely gene by gene, although it may sometimes involve longer segments.