Abstract
Background
The size distribution of gene families in a broad range of genomes is well approximated by a generalized Pareto function. Evolution of ensembles of gene families can be described with Birth, Death, and Innovation Models (BDIMs). Analysis of the properties of different versions of BDIMs has the potential of revealing important features of genome evolution.
Results
In this work, we extend our previous analysis of stochastic BDIMs.
In addition to the previously examined rational BDIMs, we introduce potentially more realistic logistic BDIMs, in which birth/death rates are limited for the largest families, and show that their properties are similar to those of models that include no such limitation. We show that the mean time required for the formation of the largest gene families detected in eukaryotic genomes is limited by the mean number of duplications per gene and does not increase indefinitely with the model degree. Instead, this time reaches a minimum value, which corresponds to a nonlinear rational BDIM with the degree of approximately 2.7. Even for this BDIM, the mean time of the largest family formation is orders of magnitude greater than any realistic estimates based on the timescale of life's evolution. We employed the embedding chains technique to estimate the expected number of elementary evolutionary events (gene duplications and deletions) preceding the formation of gene families of the observed size and found that the mean number of events exceeds the family size by orders of magnitude, suggesting a highly dynamic process of genome evolution. The variance of the time required for the formation of the largest families was found to be extremely large, with the coefficient of variation >> 1. This indicates that some gene families might grow much faster than the mean rate such that the minimal time required for family formation is more relevant for a realistic representation of genome evolution than the mean time. We determined this minimal time using Monte Carlo simulations of family growth from an ensemble of simultaneously evolving singletons. In these simulations, the time elapsed before the formation of the largest family was much shorter than the estimated mean time and was compatible with the timescale of evolution of eukaryotes.
Conclusions
The analysis of stochastic BDIMs presented here shows that nonlinear versions of such models can well approximate not only the size distribution of gene families but also the dynamics of their formation during genome evolution. The fact that only higher degree BDIMs are compatible with the observed characteristics of genome evolution suggests that the growth of gene families is selfaccelerating, which might reflect differential selective pressure acting on different genes.
Background
An extremely broad variety of phenomena in physics, biology, and the social sphere is described by power law distributions. The power laws apply to the distribution of the number of links between documents in the Internet, the population of towns, the number of species that become extinct within a year, the number of sexual and other contacts between people, and numerous other quantities [14]. In the field of genomics, the "dominance by a selected few" [5] encapsulated in the power laws applies to the distribution of the number of transcripts per gene, the number of interactions per protein, the number of genes in coexpressed gene sets, the number of genes or pseudogenes in paralogous families, the number of connections per node in metabolic networks, and other quantities that can be obtained by genome analysis [59].
Mathematically, these distributions are described by the formula: P(i) ≈ ci^{γ }where P(i) is the frequency of nodes with exactly i connections or sets with exactly i members, γ is a parameter which typically assumes values between 1 and 3, and c is a normalization constant. Obviously, in doublelogarithmic coordinates, the plot of P as a function of i is close to a straight line with a negative slope. Recently, it has been shown that the distributions of several genomerelated quantities are best described by the socalled generalized Pareto function: P(i) = c(i + a)^{γ }where γ > 0, a are parameters [1013]. At large i (i >>a), this distribution is indistinguishable from a power law, but at small i, it deviates substantially, with the magnitude of the deviation depending on a.
Power law distributions and the associated scalefree networks are compatible with the intuitively plausible mechanism of evolution by preferential attachment although other modes of evolution are also possible [9,14]. Under preferential attachment, a network or a mathematically analogous object, such as an ensemble of gene families, grows via attachment of new nodes to the preexisting ones with a probability that is proportional to the degree (number of connections) of the latter.
However, preferential attachment or other general evolutionary principles associated with power law type distributions and scalefree phenomena do not actually explain the emergence of these phenomena in biologically meaningful terms. A biological explanation involves, at a minimum, identifying the elementary events underlying the evolutionary process and the simplest models of evolution that include these events and are compatible with the observations. Under this logic, families of paralogous genes represent a perfect object for evolutionary modeling. Indeed, for these families, elementary evolutionary processes are defined naturally. By definition, paralogous families evolve by gene duplication. It has been long suspected and, with the advent of genomics, established beyond reasonable doubt that genome evolution proceeds largely by duplication of genes or portions thereof, and even long genomic segments or entire genomes [1520]. All sequenced genomes contain numerous paralogous genes, and in more complex genomes, the majority of genes have at least one paralog [21,22]. Duplication is followed by mutational diversification and gradually leads to functional differentiation of the paralogs. It is thought that such differentiation occurs via the routes of neofunctionalization (emergence, in one of the paralogs, of a new function nonexistent in the ancestral gene) [16] and, probably most often, subfunctionalization, i.e., partitioning of subfunctions of the ancestral gene among the paralogs [23,24]. Hence, duplication obviously is the first elementary process of genome evolution. Genomes and gene families not only grow but often shrink or, probably most of the time, persist in equilibrium. Therefore, duplication must be counterbalanced by the opposite elementary process, gene loss. Again, comparative genomics has shown that gene loss occurs in all species and seems to be extensive in certain lineages, particularly in parasites [2527]. Finally, genes new to a given lineage may emerge either as a result of a dramatic change after duplication obliterating all "memories" of a gene's origin, or via horizontal gene transfer, or by evolution of a proteincoding gene from a noncoding sequence (rare as this latter process might be). Collectively, the contribution of these processes to genome evolution may be termed innovation. Gene duplication, gene loss, and innovation seem to comprise a reasonable minimal set of elementary events for modeling genome evolution. The only potential major addition could be rearrangement of the gene structure whereby genes accrete or lose domains. However, at least for first approximation modeling, these changes could be covered either by duplication, if they do not yield new genes without detectable relationships to preexisting families, or by innovation if they do. We should further note that evolutionary analysis of paralogous gene families can be reasonably viewed as a study of the evolution of genomes themselves if all genes are viewed as members of paralogous families, ranging in size (number of members) from 1 to N (the size of the largest family). Of course, one must keep in mind that describing genome evolution in terms of gene duplication, loss, and innovation represents a high level of abstraction, whereby a gene is considered an atomic unit of evolution, and mutation processes occurring within a gene are ignored. However, numerous comparativegenomic studies have shown the utility of the genelevel abstraction both for systematic prediction of the functions of uncharacterized genes using the patterns of their distribution in diverse genomes [2831] and for understanding general evolutionary trends. A striking recent example of the latter type of achievement is the demonstration that different functional categories of genes scale differently with genome size, with the steepest ascent of regulatory genes offering a plausible explanation for the observed limits of genome size in prokaryotes [32].
A natural framework for modeling evolution of gene families is a birthanddeath process, a concept well explored in many physical and chemical contexts [33]. Duplication constitutes a gene birth, and gene loss is a death event; innovation also can be readily incorporated in this context. The birthanddeath approach has been applied to modeling the evolution of paralogous genome family sizes [6,12,34], the distribution of folds and families in the entire protein universe [35], and proteinprotein interaction networks [36,37]. For over a century since the publication of Darwin's seminal work [38], biologists believed that evolution at all levels is largely driven by natural selection [39]. However, the advent of molecular evolution shifted the perspective by demonstrating, largely through the work of Kimura and his school, that many, if not most, of the fixed nucleotide substitutions are effectively neutral [40]. Recent comparative analyses of gene expression led to the expansion of the neutral evolution concept beyond the genome sequence, at least to the level of the transcriptome [41,42]. Perhaps the principal importance of the neutral theory is that it leads to a change of the prevailing null hypothesis of evolutionary biology: neutrality should be taken as the null hypothesis, and selection should be invoked only when this hypothesis can be rejected. Birth and death models naturally fit this paradigm because they do not include the notion of selection (at least not explicitly). It is therefore of considerable interest to determine whether or not simple models of this class can be rejected as the explanation for various observed features of genomes.
In the previous work [12], we examined in detail simple deterministic models of genome evolution, which we dubbed BDIMs, after
 b
 d
 i
Subsequently, we expanded the BDIM framework by introducing stochastic BDIMs, which account not only for the stationary state of the gene ensemble but also for the characteristics of evolution of the system, such as the probability of the formation of a family of the given size before extinction and the mean times of formation and extinction of a family of a given size [43]. We first investigated these issues for the linear 2^{nd }order balanced stochastic BDIM. Given the published estimates of the rates of gene duplication and loss [24], we found that this version of BDIM, which gives a good approximation of the stationary distributions of family sizes for different genomes, predicts completely unrealistic mean times for reaching the observed sizes of the largest domain families. In computer simulations with a large ensemble of genes, even the minimum time required for the formation of the largest family was shown to be unrealistically long. Thus, the linear BDIM is incompatible with the estimates of the rate of genome size growth derived from the empirical data. Therefore we performed a preliminary examination of nonlinear, higher degree BDIMs and showed that the rate of genome size growth increases with the degree of the model, rendering nonlinear BDIMs more realistic models of genome evolution [43].
Here, we present a detailed analysis of the properties of different nonlinear stochastic BDIMs, including polynomial, rational, and logistic ones, which were obtained by the appropriate transformations of the original linear model. These models generated the same stationary family size distribution, but the stochastic properties of the higher order models were dramatically different from those of the linear BDIM. The mean number of elementary events, duplications and deletions, which are required for the formation of the largest family, decrease monotonically with the increase of the model degree. By contrast, the mean time of formation of a gene family of the given size under a fixed average duplication rate went through a minimum depending on the model degree; typically, the model degree corresponding to this minimum was between 2 and 3. However, even with this optimal degree, the mean times of formation of the largest families in different genomes were unrealistically long.
The times of formation and extinction of gene families are random variables with unknown distributions. Therefore it was important to determine the variance of these times and the number of elementary events preceding the formation and extinction of the largest families. We found that the coefficients of variation were very large such that the extreme values of the formation times for the largest family could differ from the mean time by at least two orders of magnitude. Thus, for assessing the feasibility of the formation of the largest families under a given model, the relevant value is not the mean but the minimal time of family formation over the entire ensemble of genes. Using Monte Carlo simulations, we show that the minimal time required for the formation of families of the expected size under BDIMs of the orders between 2–3 is compatible with the timescale of genome evolution.
Results and discussion
1. Definitions and empirical data
The basic BDIM definitions and assumptions
We treat a genome as a "bag" of genes (or, more precisely, portions of genes) encoding protein domains (or simply domains for brevity; see [12] for details). Domains are treated as independent evolving units disregarding cooccurrence of domains in multidomain proteins. Each domain is considered to be a member of a family, which may have one or more members. In this work, we interchangeably refer to domain families or gene families. Three types of elementary events are postulated: i) birth, which yields a new member in the same domain family as a result of gene duplication, ii) death, i.e., inactivation and/or deletion of a domain, and iii) innovation, which generates a new, singlemember family. Innovation may occur via domain evolution from a noncoding sequence or a nonglobular protein sequence, via horizontal gene transfer from another species, or via radical change of a domain after duplication. The rates of elementary events are defined as the probabilities of the respective events during an infinitesimally short time interval [44] and is postulated to be independent of time (all analyzed models are homogeneous) and of the structure, biological function, and other features of individual domain families. Clearly, these assumptions are simplifications made in order to avoid prohibitively complex models; the justification is that, over large (genomewide) ensembles of families and long time intervals, the existing nonhomogeneities are likely to cancel out, making homogeneous models realistic. It may be useful to emphasize that homogeneity of the models does not imply constancy of the number of events during any finite time interval, which is a random variable.
The data on the size of domain families in sequenced genomes were obtained as described previously [12]. Briefly, the domains were identified by comparing the CDD library of positionspecific scoring matrices (PSSMs) for domains extracted from the Pfam and SMART databases, to the protein sequences from completely sequenced eukaryotic and prokaryotic genomes http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db =Genome webcite using the RPSBLAST program [45].
In a finite genome, the maximum number of domains in a family cannot exceed the total number of domains and, in reality, is probably much smaller. Let N be the maximum possible number of domain family members (this limit is introduced for technical reasons; however, this should not be perceived as a biologically unrealistic assumption because N can be made extremely large, e.g., to exceed the number of genes in the largest known genome by several orders of magnitude; furthermore, almost all of the results below are valid with N = ∞ under certain well defined conditions, which ensure the existence of the ergodic distribution of the birthanddeath process). We also consider virtual, "empty" families that consist of 0 domains. In the stochastic BDIMs, newborn domains are extracted from this class and dead domains return to it. Originally, we examined exclusively the deterministic version of the BDIMs [12]. Introduction of the 0 class "closes" the model and allows us to transform it into a Markov process, which provides for the possibility to explore the stochastic properties of the system [43]. In these stochastic models, innovation was not introduced explicitly as it was in the deterministic models, but was implied in the emergence of domains from the 0 class.
Let p_{i}(t) be the frequency of a domain family of size i. Then p_{i}(t) satisfy a system of forward Kolmogorov equations for birthanddeath process (e.g., [44,46]):
dp_{0}(t)/dt = λ_{0}p_{0}(t) + δ_{1}p_{1}(t),
dp_{i}(t)/dt = λ_{i1}p_{i1}(t)  (λ_{i }+ δ_{i})p_{i}(t) + δ_{i+1}p_{i+1}(t) for 0 <i <N, (1.1)
dp_{N}(t)/dt = λ_{N1}p_{N1}(t)  δ_{N}p_{N}(t).
Mathematically, (1.1) defines the state probabilities of a birthanddeath process with the finite number of states {0,1,...N} and reflecting boundaries in 0 and N. The evolution of individual trajectories of the birthanddeath process X(t), whose state probabilities satisfy the system (1.1), can be described as follows. At the starting time, the system is situated in some initial state x_{0}. The time axis {t ≥ 0} can be divided into intervals [0,τ_{1}), [τ_{1}, τ_{2}), [τ_{2}, τ_{3}) ... such that X(t) is a constant on each interval. If, at the moment τ_{n}, the system was situated in the point x_{n }= i, then, in the moment τ_{n+1}, the system moves either into the state i+1 with the probability β_{i }= λ_{i}/(λ_{i }+ δ_{i}) or into the state i1 with the probability μ_{i }= δ_{i}/(λ_{i }+ δ _{i}). The sojourn time t_{i }= τ_{n+1 } τ_{n }between the arrival at the point x_{n }= i and the exit from this point is a random variable independent of the previous history of the system and is distributed exponentially: P{t_{i }≥ x} = exp((λ_{i }+ δ_{i})x). Note that the random variables t_{i }are independent, and the mean sojourn time, E(t_{i}), in the state i is E(t_{i}) = 1/(λ_{i }+ δ_{i}).
Process (1.1) has a unique stationary ergodic distribution p_{0},...,p_{N }defined by the equalities dp_{i}(t)/dt = 0 for 0 ≤ i ≤ N. Let J(i, t) = δ_{i}p_{i}(t)  λ_{i1}p_{i1}(t) be the current through the state i in t time moment, J(i) = δ_{i}p_{i } λ_{i1}p_{i1 }be the current in the stationary state. Then the equation for the stationary distribution can be written as J(i+1)  J(i) = 0. As the system is closed, J(0) = 0 and hence J(i) = 0 for all i, such that
p_{i }/ p_{i1 }= λ_{i1}/δ_{i}.
We will consider also the variant of this model with states {1,...N} and reflecting boundaries in states 1 and N:
dp_{1}(t)/dt = λ_{1}p_{1}(t) + δ_{2}p_{2}(t),
dp_{i}(t)/dt = λ_{i1}p_{i1}(t)  (λ_{i }+ δ_{i})p_{i}(t) + δ_{i+1}p_{i+1}(t) for 1 <i <N, (1.3)
dp_{N}(t)/dt = λ_{N1}p_{N1}(t)  δ_{N}p_{N}(t).
This model describes the evolution of the size of a domain family that includes an indispensable (essential) gene and is not allowed to go extinct. Similarly, for model (1.3), the ergodic distribution is:
The ergodic distribution (1.2) (or 1.4) is globally stable and is approached exponentially with respect to time from any initial state. The asymptotic of the ergodic distribution is completely defined by the asymptotic behavior of the function χ(i) ≡ λ_{i1}/δ_{i}. Let us suppose that, for large i, the following expansion is valid:
χ(i) ≡ λ_{i1}/δ_{i }= i^{s }θ (1γ/i + O(1/i^{2})) (1.5)
Then, the asymptotical behavior of the stationary distribution of model (1.1) is completely defined by three parameters: s, θ and γ ([12]). In particular, if the birthanddeath process is the 1^{st }order balanced, i.e. if, by definition, s = 0 in (1.5), then, asymptotically, p_{i }~ θ^{i}i^{γ }. If the process is 2^{nd }order balanced, i.e. s = 0 and θ = 1, then p_{i }~ i^{γ}.
The complete description of all possible asymptotics of the ergodic distributions of model (1.1) under condition (1.5) is given in Mathematical Appendix, Theorem 1 (hereinafter all references of the form (A.m.n) refer to the corresponding formula in the Mathematical Appendix [see 1]). It asserts that a large class of models, namely the second order balanced BDIMs, provide any given power asymptotic of the stationary frequency distributions of family sizes.
Additional File 1. This additional file includes proofs of some of the mathematical statements contained in the main text as well as accessory mathematical formulations.
Format: PDF Size: 203KB Download file
This file can be viewed with: Adobe Acrobat Reader
2. Classification of BDIMs
Linear BDIM
The simplest model that shows the generalized Pareto distribution is the linear BDIM with
λ_{i }= λ(i+a), δ_{i }= δ(i + b) for i > 0, λ, δ, a and b are constants. (2.1)
The equilibrium distribution of domain family sizes is defined by:
So, if λ = δ (θ = 1), the resulting 2^{nd }order balanced linear BDIM has a power asymptotics with γ = 1 + b  a.
Polynomial BDIM
Informally, polynomial BDIMs can be introduced as follows. Under the linear BDIM, the dependence of the birth and death rates on family size is very weak; although each gene "senses" the size of the family (as reflected in the nonzero parameters a and b), this dependence cannot be interpreted as a specific form of interaction between family members. If such interactions are postulated, λ_{i }~ P_{n}(i) and/or δ_{i }~ Q_{m}(i), where P_{n}(i), Q_{m}(i) are polynomials on i of the nth and mth degrees. The ergodic distribution of the stochastic polynomial BDIM of the form (1.1) and (1.3) is asymptotically the same as that of the originally described deterministic polynomial BDIM [12], see Appendix (A.1.4), (A.1.5) [see 1] and Proposition 2 for details. We show here that nonlinear polynomial 2^{nd }order balanced BDIM predict evolution rates that are dramatically greater than those for the linear BDIM.
As an example, let us consider the quadratic BDIM in more detail. It takes into account the simplest, pairwise interaction between family members, which leads to λ_{i }~ i^{2 }and/or δ_{i }~ i^{2}, i.e., one or both rates are polynomials on i of the second degree. If the polynomial degrees of the birth and death rates are different (e.g., λ_{i }~ i and δ_{i }~ i^{2}), the corresponding BDIM is nonbalanced, and equilibrium frequencies have no power asymptotics. Thus, let
λ_{i }= λ (i^{2 }+ r_{1}i + r_{2}), δ_{i }= δ(i^{2 }+ q_{1}i + q_{2}), (2.3)
where λ, δ, r_{k}, q_{k}, k = 1,2 are constants (such that λ_{i}, δ_{i }are positive for all i); equivalently,
λ_{i }= λ (i + a)(i + a_{2}), δ_{i }= δ (i + b)(i + b_{2}).
Then, r_{1 }= a + a_{2}, q_{1 }= b + b_{2}, and
χ(i) = λ_{i1}/δ_{i }= θ (1 + (r_{1 } q_{1 } 2)/i + O(1/i^{2})), where θ = λ/δ.
The quadratic BDIM has equilibrium sizes of domain families (see A.1.6)
p_{i }≈ c_{2}p_{0 }λ_{0}/λθ^{i}i^{ρ2}
where ρ = r_{1 } q_{1}, c_{2 }= p_{0 }[(Γ (1 + b) Γ (1 + b_{2})] / [Γ(1 + a) Γ (1 + a_{2})], and
Thus, if the quadratic BDIM is 2^{nd }order balanced, then p_{i }~ i^{ρ2}. Note that the asymptotic behavior frequencies p_{i }do not depend on free coefficients r_{2 }and q_{2 }in (2.3), but only on θ and r_{1 } q_{1}, although the constant c_{2 }could depend on the free coefficients r_{2}, q_{2}.
Rational BDIM
Rational models comprise a rather general class of BDIMs, for which the asymptotic behavior of the equilibrium frequencies and equilibrium sizes of domain families are fully tractable. The ergodic distribution of the stochastic rational BDIM is asymptotically the same as that of the deterministic rational BDIM [12]. In particular, if the model is 2^{nd }order balanced, then p_{i }~ i^{γ}, (see A.1.2 and Proposition 1 in the Appendix for details [see 1]).
The rational BDIMs can describe a substantially wider class of birth and death rates compared to polynomial models. In particular, birth rate can have a maximum at some specific value of family size and then decrease with further growth of the size, e.g., as shown in Fig. 1. This dependence of rates on family size can be described by the rational model with λ_{i }= λ(i + a_{1})/(i + a_{2})^{2}, δ_{i }= δ(i + b_{1})/(i + b_{2})^{2}.
Figure 1. Dependence of the birth rate (λ_{i }= (i + c_{1})/(i + c_{2})^{2}) on the family size.
Logistic BDIM
Evidently, the number of size classes of protein families, N, should be finite, although intrinsic features that could determine the value of N so far have not been considered (the impossibility of an infinite genome is selfevident but one would expect a much tighter bound based, e.g., on the limited time and resources available for genome replication and expression). Under the BDIMs described above, birth rate grows monotonically as the family size increases from 1 to N and then abruptly drops to 0 (since families of size N+1 or greater are not allowed). However, this behavior is an arbitrary simplification of the model and hardly can reflect the actual process of genome evolution.
In population dynamics models, the finiteness of a population size typically results from the "saturation type" of growth: the growth rate tends to 0 as the population size tends to the maximal possible value (see, e.g., [47]). It seems likely that, during genome evolution, gene duplication (and death) rate also tends to 0 as duplications leading to increase in gene number become deleterious when the size of some paralogous families becomes prohibitively large. The simplest formalism, which yields this type of population growth, is the logistic form of the birth rate. Logisticlike stochastic models have been investigated in various applications (e.g., [48,49]), which considered a birthanddeath process with the rates
λ (i) = c_{3}(c_{1 }+ i)(Ni), δ(i) = c_{3}i(c_{2}i), c_{k }> 0, k = 1,2,3, c_{2 }>N.
This model produces lognormal and logseries distributions; with the appropriate values of parameters, power low distributions of frequencies also appear, but only for intermediate values of i, namely, 1 <<i <<N and N >> 1.
Nonlinear transformation of BDIM
We have shown previously [43] that the following modification of any form of BDIM:
λ*_{i }= λ_{i}g(i), δ*_{i }= δ_{i}g(i1) (2.4)
where g(i), i = 0,...N, is a positive function, g(0) = 1, results in a BDIM with the same ergodic distribution of the family sizes as the original one. In particular, modifications of a linear BDIM with g(i) = (i + 1)^{d1 }or g(i) = (i + 1)^{d1}(1  i/(N + c)) define, respectively, wide classes of rational or logistic BDIMs with the same stationary distribution as the original linear BDIM, but with manifestly different dynamic properties.
3. Probability of formation of a family of the given size before extinction and mean and variance of extinction time
In is known [44] that the probability for the birthanddeath process to reach state n before reaching state 0 from an initial state i> 0 is given by formula (A.2.2). In terms of BDIM (1.1), this means that the probability of formation of a family of size n starting from a family of size i before getting to extinction is given by (A.2.2).
The random birthanddeath process (1.1) certainly visits state 0 in the course of time; this means that any domain family will eventually get extinct (and then, formally, can be "reborn", returning from the 0class). Below we compute the mean time to extinction of a family of the given size for different versions of BDIM; the mean time to extinction of the largest family in the given genome is of particular interest.
Let us denote S(n)=inf{t:X(t) = 0X(0)=n} the time to the first passage of state 0 from the initial state n; S(n) is a random variable for each n. The mean time to extinction of the family of initial size n, E(S(n)), is given by the general formula A.3.2.
Linear BDIM
We have shown previously that, for the linear 2^{nd }order balanced BDIM, the probability that a singleton expands to a family of size n before dying, P^{(1)}(1,n) has the power asymptotics for large n (A.2.5). The values of probabilities P^{(1)}(1,n) for different species are shown in Table 1; these probabilities are no greater than ~10^{4 } 10^{5}. The mean time to extinction, E(S(n)), can be calculated using the relation E(S(n)) = 1/λE^{(1)}_{n}, where E^{(1)}_{n}, the mean time to extinction expressed in the 1/λ time units, is given by formula A.3.3 (see Table 1 for some numerical data and Figs. 1,2 in [43]).
Table 1. Family evolution under the linear BDIM (d=1)
Figure 2. Coefficient of variation of the extinction time versus the family size for the linear BDIM. The model parameters are for D. melanogaster (blue), C. elegans (purple), H. sapiens (red), A. thaliana (green) (Table 1 in [43]).
The variance of extinction time Var(S(n)) for the linear 2^{nd }order balanced BDIM is Var(S(n)) = 1/λ^{2}W^{(1)}_{n}, where W^{(1)}_{n }can be calculated using the formula (A.3.7). The plot of the coefficient of variation s^{(1)}_{n }= (W^{(1)}_{n})^{1/2}/E^{(1)}_{n }versus n for different species is shown in Fig. 2 (see also Table 1 for some numerical data). Clearly, the extinction time can vary within an extremely broad range of values.
Nonlinear polynomial and rational BDIM
The stochastic behavior of the system and its characteristics also can be investigated within the broader framework of rational BDIMs. We will examine models represented as transformed linear BDIM (2.1), with
λ_{i }= λ(i + a)(i + 1)^{d1}, δ_{i }= λ(i + b)i^{d1}, (3.1)
where d ≥ 1 is the model degree. Let us recall that Theorem 1 (Mathematical Appendix [see 1 ]) shows that the highest degrees and the corresponding coefficients of the birth and death rates at i^{d }must be equal to provide for the power asymptotics of the stationary distribution, P(i) ~ i^{γ}. The power γ of this distribution is completely determined by the degree d and the coefficients at i^{d1}. Thus, the model (1.1), (3.1) is representative of all rational BDIMs of the degree d with a given power asymptotic (γ = b  a + 1) of the stationary distribution. Besides, according to Proposition 1, this distribution for model (3.1) is exactly the same as for the corresponding linear model with λ_{i }= λ (i + a), δ_{i }= λ (i + b), which was studied in detail in [12].
We applied formula (A.2.6)
with g(i) = (i + 1)^{d1}, to calculate the probability of formation of a family of the given size from a singleton before getting to extinction for the BDIM of degree d, P^{(d)}(1,n). For example, the probabilities P^{(2)}(1,n) and P^{(3)}(1,n) for the quadratic and cubic BDIMs, respectively, are given by this formula with g(i) = i + 1 and g(i) = (i + 1)^{2}, respectively. Figures 3 and 4 show the dependence of the probabilities P^{(2)}(1,n) and P^{(3)}(1,n) on the family size n for different species. The dependence of the probability P^{(d)}(1,N) of the formation of the largest family on the model degree is shown in Fig. 5.
Figure 3. Probability of family formation starting from a singleton, P^{(2)}(1,n), versus the family size (n) for the quadratic BDIM (in double logarithmic scale). The model parameters are for D. melanogaster (blue), C. elegans (purple), H. Sapiens (red), Arabidopsis thaliana (green).
Figure 4. Probability of family formation from a singleton, P^{(3)}(1,n), versus the family size (n) for the cubic BDIM. The model parameters are for D. melanogaster (blue), C. elegans (purple), H. Sapiens (red), Arabidopsis thaliana (green).
Figure 5. Probability of formation of the largest family starting from a singleton, P^{(d)}(1,N), for rational BDIMs depending on the model degree d. The model parameters are for D. melanogaster (blue), C. elegans (purple), H. Sapiens (red), Arabidopsis thaliana (green).
The mean time to extinction for the rational BDIM (1.1), (3.1) with a fixed d is calculated using the formula (A.3.4) where
Here E^{*}_{n }is the mean time to extinction in the 1/λ time units. Figures 6 and 7 show the dependence of E^{(2)}_{n }and E^{(3)}_{n }on n for the quadratic and cubic BDIMs, respectively. Fig. 8 shows the mean times of extinction of the largest family, E^{(d)}_{N}, for different species, depending on the model degree d. Some numerical values of the mean time to extinction for quadratic and cubic BDIMs and different species are given in Tables 2 and 3. The variance of the extinction time of a family of size n, Var(S(n))= 1/λ^{2}W^{(d)}(n), d = 2, 3 for the quadratic and cubic BDIMs, and the coefficient of variation s^{(d)}_{n }= (W^{(d)}_{n})^{1/2}/E^{(d)}_{n }are calculated using the formulas (A.3.8). The results are shown in Figs. 9 and 10. Some numerical values of the coefficient of variation of the extinction time for different species are given in Table 4.
Figure 6. Mean time to extinction (in 1/λ units) depending on the family size for the quadratic BDIM. The model parameters are for D. melanogaster (blue), C. elegans (purple), H. Sapiens (red), Arabidopsis thaliana (green).
Figure 7. Mean time to extinction (in 1/λ units) depending on the family size for the cubic BDIM. The model parameters are for D. melanogaster (blue), C. elegans (purple), H. Sapiens (red), Arabidopsis thaliana (green).
Figure 8. Mean time to extinction (in 1/λ units) of the largest family for the rational BDIM depending on the model degree d. The model parameters are for D. melanogaster (blue), C. elegans (purple), H. Sapiens (red), Arabidopsis thaliana (green).
Figure 9. Coefficient of variation of the time to extinction depending on the family size for the quadratic BDIM. The model parameters are for D. melanogaster (blue), C. elegans (purple), H. Sapiens (red), Arabidopsis thaliana (green).
Figure 10. The coefficient of variation of extinction time versus family size for the cubic BDIM. The model parameters are for D. melanogaster (blue), C. elegans (purple), H. Sapiens (red), Arabidopsis thaliana (green).
Table 2. Family evolution under the linear BDIM (d = 1)
Table 3. Family evolution under the cubic BDIM (d = 3).
Table 4. Coefficients of variation of the extinction and formation times for the BDIMs of different degrees
Logistic BDIM
Let us consider the logistic modification of the rational BDIM; specifically, we will examine models with the birth and death rates of the form
λ_{i }= λ(i + a)(i + 1)^{d1}(1  i/(N + c)), δ_{i }= δ (i + b)i^{d1}(1  (i  1)/(N + c)). (3.2)
We will refer to the parameter c as saturation boundary. The shape of λ_{i }essentially depends on the value of c (Fig. 11).
Figure 11. Dependence of λ_{i }(3.2) at d = 2 on i with different boundary value, c = 1, c = 100, c = 1000 (from bottom to top). The model parameters are for Drosophila melanogaster.
The logistic model (1.1), (3.2) is a transformation (2.9) of the linear BDIM using the function:
g(i) = (i + 1)^{d1}(1  i/(N + c), c = const ≥ 0. (3.3)
The stationary distribution of family size frequencies for the logistic model (1.1), (3.2) is exactly the same as that for corresponding linear BDIM but the stochastic properties are different and close to the rational models, and essentially depend on the boundary c. With a large c, the model is very close to the corresponding rational model with λ_{i }= λ(i + a)(i + 1)^{d1}, δ_{i }= δ (i + b)i^{d1}, but with small c, we can observe some new effects when the family size approaches N.
The probability of formation of a family of a given size from a singleton before getting to extinction for the logistic BDIM is calculated using the general formula (A.2.6) where the function g(i) is given by (3.3). The dependence of this probability on the model degree d under a fixed large value of the boundary c~N is similar to that for the corresponding rational models but differs under a small c; Fig. 12 shows this dependence for c = 1.
Figure 12. Dependence of the probability P^{(d)}(1,n) on the family size n for the logistic model with c = 1 for d = 1, 2 and 3 (from bottom to top). The model parameters are for Drosophila melanogaster.
The mean times of extinction for the logistic BDIMs are calculated using formula (A.3.4). Fig. 13 shows the mean times of extinction of the largest family, E^{(d)}_{N}, depending on the model degree d for different values of saturation boundary c. Fig. 14 shows the dependence of E^{(d)}_{N }on the saturation boundary c for different values of d.
Figure 13. Mean time to extinction (in 1/λ units) of the largest families for the logistic BDIM depending on the model degree d for c = 1,c = 100 and c = 1000 (from top to bottom, in double logarithmic scale). The model parameters are for Drosophila melanogaster.
Figure 14. Mean time to extinction (in 1/λ units) of the largest families for the logistic BDIM, dependently on the boundary value c for d = 1 (left) and d = 2 (right). The model parameters are for Drosophila melanogaster.
4. Mean and variance of formation time for a family of the given size
Let us denote T(j, n) = inf{t: X(t) = nX(0) = j} the time to the first passage of state n from the initial state j; T(j, n) is a random variable for each j, n. The mean time to the first passage for BDIM (1.1), m(j, n) = E(T(j, n)), can be calculated using the formula m(j, n) = m_{0}(j, n) + m_{1}(j, n). Here the term m_{0}(j,n) is the mean time elapsed before the system leaves the 0 state for the last time, and the term m_{1}(j,n) is the mean time of formation of a family of size n from a singleton after its last "resurrection" (see formulas (A.4.1) for details). Below we examine only the mean family formation time from an essential singleton (model (1.3)).
Linear BDIM
Previously, we determined the mean time of formation of a family of size n from a singleton for different species [43]. For the linear BDIM, the value of the mean formation time from an essential singleton is given by formula M^{(1)}(1,n) = 1/λ M^{(1)}_{n}, where M^{(1)}_{n}, the mean formation time in 1/λ units is calculated using the formula (A.4.6)
The transition from the 1/λ time units to years is considered in s.6 of the Mathematical Appendix [see 1 ]. The mean formation time E(T(1, n)) in years is calculated using the formula (A.6.4) and the current empirical estimates of the gene duplication rate [24]. Plots of E(T(1, n)) for different species are shown in Fig. 15.
Figure 15. Mean time to formation (in years, Ga, with r_{du }= 2*10^{8}) depending on family size for the linear BDIM (double logarithmic scale). The model parameters are for D. melanogaster (blue), C. elegans (purple), H. Sapiens (red), Arabidopsis thaliana (green).
Once we computed the mean time of formation of a family of size n for different species, the question arises how accurately is the time T(i, n) of the first random passage through the threshold n predicted by the mean value. To address this problem, we estimated the variance of the time of family formation, Var(T(i, n)) using the general formulas (A.5.2) for model (1.1) and (A.5.3) for model (1.3), respectively. For the linear BDIM, the variance of the formation time for a family of size n from an essential singleton, V^{(1)}_{n}, is given by the formula (A.5.5). A more important and informative characteristic, which is independent on the model parameter λ, is the coefficient of variation, which is equal to . The coefficient of variation of the formation time of a family of size n from a singleton, σ^{(d)}_{n }= (V^{(d)}_{n})^{1/2}/M^{(d)}(1;n) for the BDIM of degree d is the most relevant value. The plots of σ^{(1)}_{n }versus n for the linear model and for different species are shown in Fig. 16.
Figure 16. The coefficient of variation σ^{(1)}_{n }of family formation time depending on n for the linear BDIM. The model parameters are for D. melanogaster (blue), C. elegans (purple), H. Sapiens (red), Arabidopsis thaliana (green).
The coefficients of variation were very large for all species (see numerical values in Table 4). To summarize the results obtained for the stochastic characteristics of the linear BDIM, we found that: i) under this model, the mean time to extinction of the largest families in most genomes was much shorter than the mean time of formation of these families, and ii) using the current estimates of duplication rates in eukaryotic genomes (r_{du }≈ 2 × 10^{8 }duplications/gene/year [24]) to express the mean family formation times in real time units instead of the dimensionless 1/λ units, we obtain M^{(1)}(1;N) ~ 10^{13 } 10^{14 }yrs, a completely unrealistic time estimate. The mean family formation times given by the linear BDIM would become realistic only if the recent analyses underestimated the gene duplication rate by a factor of ~10^{4}, which does not seem plausible. Thus, the linear BDIM cannot provide an adequate description of genome evolution, at least when only the mean time of family formation is considered. The variance of the family formation time is extremely large (the coefficient of variation is ~10^{2}), and, accordingly, large deviations from the mean time, more to two orders of magnitude, are possible. However, even taking this into account, the family formation times predicted by the linear BDIM are far longer than the time allotted for life's evolution of earth. In the next section, we consider nonlinear, higher order models that have the potential to yield shorter mean times of family formation.
Polynomial BDIMs
The mean time of formation of families from an essential singleton (or after the last "resurrection" of a family) depending on the family size n for the polynomial BDIMs is E(T(1,n)) = 1/λ M*_{n }where M*_{n}, the mean time of formation in 1/λ units can be calculated using the formulas (A.4.9)
Fig. 17 shows the dependence of the mean time of family formation on the family size for the quadratic BDIM in years, calculated using the formula (A.6.4). The values of mean times of formation for this BDIM are given in Table 2.
Figure 17. Mean time of formation (in years, Ga, with r_{du }= 2*10^{8}) depending on family size n for the quadratic BDIM (in double logarithmic scale). The model parameters are for D. melanogaster (blue),C. elegans (purple),H. Sapiens (red),Arabidopsis thaliana (green).
The variance of formation time of a family of the size n can be calculated using the formula (A.5.6), with g(j)=j+1 for the quadratic BDIM and g(j) = (j + 1)^{2 }for the cubic BDIM, respectively. The dependence of the coefficient of variation σ^{(2)}_{n }= (V^{(2)}(1,n))^{1/2}/M^{(2)}(1;n) on the family size for the quadratic BDIM is shown in Fig. 18, and some numerical data are given in Table 4.
Figure 18. The coefficient of variation σ^{(2)}_{n }of formation time versus family size for the quadratic BDIM and species. The model parameters are for D. melanogaster (blue), C. elegans (purple), H. Sapiens (red), Arabidopsis thaliana (green).
Although the variance of family formation times for the quadratic BDIM is approximately 5 orders of magnitude less than that for the linear BDIM, the values of the coefficient of variation for quadratic BDIM are about 1.3–1.5 times greater than those for the linear BDIM. Thus, the actual formation time for the largest family could differ from the mean value by several orders of magnitude with a high probability. Figures 19 and 20 show the dependence of the mean and the coefficients of variation of family formation time on family size for the cubic BDIM.
Figure 19. Mean time of formation (in years, Ga, with r_{du }= 2*10^{8}) depending on family size n for the cubic BDIM. The model parameters are for D. melanogaster (blue), C. elegans (purple), H. Sapiens (red), Arabidopsis thaliana (green).
Figure 20. The coefficient of variation σ^{(3)}_{n }of formation time versus family size for the cubic BDIM. The model parameters are for D. melanogaster (blue), C. elegans (purple), H. Sapiens (red), Arabidopsis thaliana (green).
We have shown previously that the cubic model shows extremely high evolution rate comparatively with the linear and even quadratic models under the same value of the parameter λ [43]. On the contrary, the mean formation times in years for the quadratic and cubic models are of the same order (Tables 2 and 3). The polynomial models bring the mean time required for the formation of families of the observed size closer to realistic values but these times still remain far too long. Specifically, with the empirical estimates of the duplication rates used above for the linear BDIM, the quadratic model gives the mean family formation times ~10^{11 }yrs. This value is close to the minimum possible time of family formation that can be calculated using the duplication rate estimates of Lynch and Conery [24] and nonlinear rational BDIMs.
Nonlinear rational BDIMs
Let us investigate the dependence of the dynamics of the mean time of family formation on the model degree and the family size. The mean time of formation of a family of size n from a singleton under a fixed model degree d, M^{(d)}(1;n), for the rational BDIM (1.1),(3.1), is calculated using the formula (A.4.9). A comparison of the mean time of formation and extinction for rational BDIMs reveals an interesting property of nonlinear BDIMs: for any given family size n, there exists such a model degree that the times of family formation and extinction are equal (as it becomes obvious from the intersection of the respective curves in Fig. 21). Accordingly, at higher model degrees, the mean time of formation becomes shorter than the mean time to extinction. The model degree that corresponds to the point of intersection in Fig. 21 obviously depends on the size of the considered family. Tables 2 and 3 show that the mean time of formation is about 100 times more than the mean time to extinction for the largest families of different species for the quadratic BDIM and only about 10 times more for the cubic model.
Figure 21. Mean times (in 1/λ units) of formation (upper curve before the point of intersection) and extinction (upper curve after the point of intersection) of the largest family depending on the model degree (semilogarithmic scale). The model parameters are for Homo sapiens.
As shown previously, increasing the degree (the "order of interaction") d results in indefinite decrease of the family formation time expressed in 1/λ time units ([43] and Fig. 22). However, we have also shown that this effect is offset by the rapid increase of the average duplication rate in the model. Assuming the gene duplication rate of ~2*10^{8 }year^{1 }[24], the evolution time in years, calculated according to the formula (A.6.5), does not decrease indefinitely, but has a minimum at the model degree between 2 and 3 (Fig. 23). Even the minimum mean time of the largest family formation achievable with the rational BDIMs is on the order of 10^{11 }years (see Table 6), which is incompatible with the age of life on Earth [43]. Thus, a rational BDIM of any degree cannot provide an adequate description of genome evolution, at least when only the mean time of family formation is considered. Accordingly, for assessing the feasibility of the formation of the largest families under a given model, the variance of the formation time should be investigated.
Figure 22. Mean time of formation of the largest family (in 1/λ units), M^{(d)}_{N}, for the rational BDIM depending on the model degree d (double logarithmic scale). The model parameters are for D. melanogaster (blue), C. elegans (purple), H. Sapiens (red), Arabidopsis thaliana (green).
Figure 23. Dependence of the time (in years, Ga) required for the formation of the largest family on the model degree d for the rational BDIM (semilogarithmic scale). The model parameters are for D. melanogaster (blue), C. elegans (purple), H. Sapiens (red), Arabidopsis thaliana (green).
Table 6. Rational BDIM yielding the shortest mean time of family formation
Generally, the variance of the formation time of the family of the given size is given by the formulas (A.5.3) and (A.5.6). Although the variance of formation times for the quadratic and, especially, for the cubic BDIM is several orders of magnitude less than that for the linear BDIM, the coefficients of variation for both formation and extinction time increase with the model degree (Table 4). These coefficients are so large that the actual formation time of the largest family could differ from its mean value by several orders of magnitude with a high probability.
Logistic BDIM
The mean time of formation (in 1/λ units) of a family of size n from an essential singleton for the logistic BDIM (1.3), (3.2) under fixed d is calculated using formula (A.4.9). Fig. 24 shows the dependence of mean times of family formation, M^{(d)}(1;n), on the family size n for different model degrees d under the fixed saturation boundary c = 1, and Fig. 25 shows the dependence of mean times of family formation on the boundary value (see Tables 7 and 8 for some numerical data). Similarly to the rational BDIM, increasing the degree (the "order of interaction") of the logistic model results in faster family evolution under a fixed value of the parameter λ. However, when this inner model parameter is excluded and the mean time of family formation is expressed in years according to formula (A.6.5), then we again face a restriction that does not allow indefinite shortening of the family formation time, T^{(d)}_{N}. Specifically, T^{(d)}_{N }for the logistic model with a fixed N has a minimum over d. We identified the model degrees yielding the minimum mean time of formation of the largest family for the logisticrational BDIM. Fig. 26 and Table 9 show the dependence of T^{(d)}_{N }on d for the logistic model with fixed saturation boundary.
Figure 24. Mean time of formation (in 1/λ units) of a family of the given size depending on the size for the logistic BDIM with the boundary value c = 1 for d = 1, d = 2, d = 3 (from top to bottom, semilogarithmic scale). The model parameters are for Drosophila melanogaster..
Figure 25. Mean time of formation (in 1/λ units) of the largest family for the logistic BDIM, depending on the model degree d for c = 1, c = 100 and c = 1000 (from top to bottom, double logarithmic scale). The model parameters are for Drosophila melanogaster.
Table 7. Evolution of gene families under the logistic BDIM with c = 1 and different d.
Table 8. Evolution of gene families under the logistic BDIM with c = 100 and different d.
Table 9. Logistic BDIM yielding the shortest mean time of family formation under c = 1
Figure 26. Dependence of the mean time (in years, Ga) required for the formation of the largest family for the logistic BDIM under fixed saturation boundary c = 1 on the model degree d (semilogarithmic scale). The model parameters are for D. melanogaster (blue), C. elegans (purple), H. Sapiens (red), Arabidopsis thaliana (green).
Thus, as in the case of rational BDIMs, increase of the degree of logistic BDIMs under a fixed value of average duplication rate r_{du }cannot yield mean family formation times < 10^{11 }years. Furthermore, the "saturation effect" seen in the logistic models increases the mean time of family formation compared to the corresponding rational models (compare Tables 5 and 7).
Table 5. Coefficients of variation of the number of events before formation of the largest family for the BDIMs of different degrees
5. The mean number of elementary events before family extinction and formation
Comparing the mean family formation and extinction times predicted by BDIMs with the actual evolutionary timescale allow us to choose the most appropriate version from the examined class of models. The number of elementary evolutionary events namely, duplication and deletion of domains, predicted by these models is of potential interest in itself as an approximation of an important characteristic of genome evolution.
To calculate the mean number of elementary events during evolution of gene families, we employed the socalled embedding chains {Y(n)} instead of the original BDIM. The embedding chain {Y_{n}} for a particular BDIM is a random walk with discrete time on the same set of states and transition probabilities p_{i,i+1 }= β_{i }= λ_{i}/(λ_{i }+ δ_{i}), p_{i,i1 }= μ_{i }= δ_{i}/(λ_{i }+ δ_{i}) and p_{ij }= 0 for all other cases (see s.7 of Mathematical Appendix for details [see 1 ]).
The transition from the state i to the state i+1 (or i1) corresponds to the duplication (or deletion) of a domain in a family of size i. The only difference between the original birthanddeath process and the embedding chain is that the sojourn time for the embedding chain is equal to 1 for any state i instead of 1/(λ_{i }+ δ_{i}). The ratio β_{i}/μ_{i }(= λ_{i}/δ_{i}) characterizes the trend of family evolution from the state i, i.e., is the family more likely to grow or to shrink; for a symmetric random walk, β_{i}/μ_{i }= 1 for all i. The dependence of the ratio β_{i}/μ_{i }on i for different rational and logistic embedded chains is shown in Figures 27 and 28. For the rational models, β_{i}/μ_{i }≈ 1 for large i; for the logistic models, β_{i}/μ_{i }≈ 1 for 0 <<i <<N (however, this ratio significantly deviates from 1 at both ends of the interval of states). Thus, the behavior of the embedding chain is similar to the behavior of the symmetric random walk in the corresponding subsets of states. Informally, the plots in Figures 27 and 28 indicate that small families may preferentially grow (under higher degree models) or shrink (under low degree models) whereas the evolution of large families tends to a symmetrical random walk.
Figure 27. The ratio β _{i}/μ _{i }against family size i for the rational BDIM depending on the model degree d:d = 1, d = 1.6, d = 2 (from bottom to top), in double logarithmic scale. The model parameters are for Drosophila melanogaster.
Figure 28. The ratio β _{i}/μ _{i }against family size i for the logistic BDIM (3.2) with c = 1 depending on the model degree d:d = 1, d = 1.6, d = 2 (from bottom to top). The model parameters are for Drosophila melanogaster
The mean number of elementary events before the formation of a family of the given size, f_{n}, is computed using formulas (A.7.5)(A.7.7). The plots in Figures 29 and 30 show the dependence of f_{n }on the family size for different species for the linear and quadratic models, respectively. The mean number of elementary events before the extinction of a family of the given size, e_{n}, is computed using formulas (A.7.13)(A.7.15) and Figures 31 and 32 show the corresponding dependences for family extinction. Some numerical data for the mean number of elementary events for polynomial BDIMs are shown in Tables 1,2,3 and, for coefficients of variation, in Table 5. Given that all the analyzed BDIMs are balanced, i.e., the birth and death rates are asymptotically equal, it was not unexpected that the mean number of events required for the formation of a large family (or the number of events preceding the extinction of such a family) was orders of magnitude greater than the size of the family. This suggests a highly dynamic picture of genome evolution whereby numerous duplications counterbalanced by gene losses are typically involved in the evolution of large families. However, the number of events required for the formation of a family of the given size quickly drops with the increase of a model degree (Fig. 33), which may be construed as reflection of positive selection leading to amplification of family members.
Figure 29. Mean number of events before the formation of a family of the given size for the linear BDIM (double logarithmic scale). The model parameters are for D. melanogaster (blue), C. elegans (purple), H. Sapiens (red), Arabidopsis thaliana (green).
Figure 30. Mean number of events before the formation of a family of the given size for the quadratic BDIM (double logarithmic scale). The model parameters are for D. melanogaster (blue), C. elegans (purple), H. Sapiens (red), Arabidopsis thaliana (green).
Figure 31. Mean number of events before extinction of a family of the given size for the linear BDIM. The model parameters are for D. melanogaster (blue), C. elegans (purple), H. Sapiens (red), Arabidopsis thaliana (green).
Figure 32. Mean number of events before extinction of a family of the given size for the quadratic BDIM. The model parameters are for D. melanogaster (blue), C. elegans (purple), H. Sapiens (red), Arabidopsis thaliana (green).
Figure 33. Mean number of events before the formation of the largest family against the model degree for the rational BDIM (double logarithmic scale). The model parameters are for Drosophila melanogaster
6. Monte Carlo simulation of evolution of gene family ensembles under BDIMs of different degrees
As noticed previously [43], it is the minimum rather than the mean evolution time that is important for modeling the dynamics of evolution of genomes consisting of many gene families. Due to the large variance of the family formation time estimates (see the detailed discussion above), this value is likely to be much less than the mean. Although an analytical solution to this problem is hard to obtain, it can be examined in detail by Monte Carlo simulation analysis. As described previously [43], we employed for this analysis model parameters estimated for the human proteome. The simulated evolution started from 3000 families of size one (singletons) and continued until the largest family reached 1024 members (a convenient arbitrary number to approximate the size of the largest family in eukaryotic genomes); the simulation was run from 10 to several hundred times depending on the model degree (the time required for the simulation showed a complex, nonlinear dependence on the model degree). In the course of the simulation, the number of families fluctuated due to stochastic births, deaths, and innovations of genes but, generally, tended toward the equilibrium number of ~1700, which is close to the empirically determined number of families in the human genome and is predetermined by the choice of model parameters (the initial number of singletons did not have much impact on the model's dynamics). The time scale was adjusted such that r_{du }= 2 × 10^{8 }duplications/gene/year [24]. A series of simulations was performed for nonlinear rational BDIMs with different degrees d.
As shown in Fig. 34, the time at which the family size of 1024 members is reached for the first time depends on d in a similar fashion as the mean time for a single family, i.e., there is clear minimum at a particular value of d. At the optimal value of d ≈ 2.2, the model reaches this family size in 2.2 ± 0.5 Ga, which is comparable to the time of evolution of eukaryotes. Compared to the minimal evolution time predicted by BDIMs of different degrees for a single family, the genomesize ensemble of gene families reached the threshold size much faster (by 1.5–2.5 orders of magnitude), and the optimum values of d was lower by ~0.5 (Fig. 35). The much faster formation of large families from an ensemble of singletons was predictable due to the large variation coefficient of the family formation and extinction times, but the simulation was necessary in the absence of knowledge of the exact distribution of these values.
Figure 34. The time required for the formation of a first family with 1024 members determined by Monte Carlo simulation starting from an ensemble of 3000 singletons. The model parameters are for Homo sapiens.
Figure 35. The time required for the formation of a first family with 1024 members starting from an ensemble of 3000 singletons (blue) compared to the mean time predicted by BDIMs of different orders (magenta). The model parameters are for Homo sapiens.
7. General discussion
Here and in the previous publications [12,43,50], we describe a general class of models, which are based on the classical concept of a birthanddeath process and seem to naturally apply to the genome evolution process. Similar, although not identical and apparently less general, modeling approaches have been considered by others [6,34,51]. Even earlier, evolution of gene families has been modeled within the distinct mathematical framework of multiplicative processes [52]. The utility of birthanddeath type models in evolutionary genomics in itself is not a trivial matter but rather stems from fundamental features of genome evolution. As captured in the title of Ohno's famous book [16], although foreseen even in the early days of genetics [15,53], gene duplication probably is the principal mechanism of genome evolution. Of course, genomes cannot grow ad infinitum and, through most of the evolutionary history, the number of genes within a given phylogenetic lineage probably remains roughly constant. Hence duplication is intrinsically coupled to gene loss. The results of comparative genomics further show that many genes in each lineage cannot be obviously linked to other genes through duplication. Without necessarily specifying the biological mechanisms (these could involve rapid change after duplication, gene acquisition via horizontal transfer, and possibly, birth of genes from noncoding sequences), it is reasonable to view these unique genes as resulting from innovation. For genomes to maintain equilibrium, the combined rates of duplication and innovation over the entire ensemble of gene families should equal the rate of gene loss, at least when averaged over long time spans. The observed distribution of family sizes, which asymptotically tends to a power law, dictates a much more specific connection between the gene birth and death rates, namely, the second order balance. It should be noted that this form of balance does not amount to particularly fine tuning of the gene birth and death rates. The only requirement is that these rates tend to the same value when the family size tends to infinity according to the condition (1.5). In contrast, for small families, the rates may substantially differ, without significantly changing the shape of equilibrium distribution.
The incentive to examine BDIMs in detail stems from at least two fundamental questions: i) are the above elementary evolutionary mechanisms sufficient to account for the empirically observed characteristics of genomes, ii) what is the contribution of natural selection to the general, quantifiable features of genomes, such as the size distribution of gene families. The analysis of BDIMs starts to provide some answers, albeit preliminary ones. The critical observation made in the course of BDIM analysis was that different versions of these models could be readily distinguished on the basis of goodness of fit to the empirical data. This being the case, we found that the simplest possible model, in which all paralogs are considered independent, is incompatible with the data. Thus, turning to the first of the above questions, we had to conclude that, in addition to the three elementary processes, "something else" was required to model genome evolution. This "something" is the dependence or "interaction" between gene family members which results in selfaccelerating family growth. In order to account for the observed stationary distribution of family sizes, it is sufficient to introduce a very weak dependence as embodied in the linear BDIM. However, when we switched from the deterministic to the stochastic version of BDIMs, which provide for the possibility of analysis of the dynamics of the systems evolution, we found that evolution under the linear BDIM was much too slow to account for the emergence of the large families of paralogs found in all genomes during the time of life's evolution. Only higher order BDIMs, with degrees between 2 and 3, i.e., with "strong interactions" between family members were found to provide for sufficiently fast evolution to be compatible with the real biological timescale.
Obviously, these findings beg the question: what is the nature of the mysterious "interactions" between paralogs? Although, on some occasions, paralogous protein do form physical complexes or interact functionally, the situation when such interaction does not exist is much more common. Therefore, the "interactions" in our models should not be perceived literally. This brings us to the second of the above major problems. BDIMs do not explicitly include the notion of selection. However, the simplest interpretation of the virtual interactions implied by the higher order BDIMs seems to be that these reflect differential tendencies of genes to form paralogous families of different sizes depending on the intensity of selection. Recent studies have shown that evolutionary fixation of gene duplications is linked to the evolutionary rates of genes. Specifically, duplications of slowly evolving genes, i.e., those that are subject to stronger purifying selection, are fixed more often [54,55]. The strong dependence of per gene duplication rates on family size in higher order BDIMs could be an abstraction of this trend. Should that be the case, we are justified to conclude that very weak selection would suffice to explain the stationary distribution of family sizes, but much stronger selective pressure is needed to account for the dynamics of genome evolution. However, the interpretation of BDIM degree as a manifestation of selection is, at this point, no more than a guess. One of the further developments of genome evolution modeling involves introducing selection explicitly and determining whether the resulting more sophisticated models will be equivalent to the higher order BDIMs explored here.
Conclusions
In this work, we extended our analysis of stochastic Birth, Death and Innovation Models (BDIMs) of gene family evolution and showed that:
• the behavior of logistic BDIMs models, in which birth/death rates are limited for the largest families, is essentially the same as that of previously investigated BDIMs that included no such limitation
• the mean time required for the growth of large families is limited by the overall number of duplications and does not increase indefinitely with the increase of the model degree but instead passes through a minimum; even under the bestcase scenario, which corresponds to a nonlinear rational BDIM with d ≈ 2.7, the mean time of the largest family formation is orders of magnitude greater than any realistic estimates based on the timescale of life's evolution;
• using the embedding chains technique, we estimated the expected number of elementary evolutionary events (gene duplications and deletions) preceding the formation of gene families of the observed size; the mean number of events exceeds the family size by orders of magnitude, suggesting a highly dynamic process of genome evolution;
• the variance of the time required for the formation of the largest families is large (coefficient of variation >> 1), which means that some families might grow much faster than the mean rate; thus, the minimal time required for family formation is more relevant for a realistic representation of genome evolution than the mean time;
• Monte Carlo simulations of family growth from an ensemble of simultaneously evolving singletons show that the time elapsed before the formation of the largest family was much shorter than the estimated mean time and approached realistic values (2.2 ± 0.5 Ga for the nonlinear rational BDIM with d ≈ 2.2).
Contributions of individual authors
GPK developed most of the mathematical formalism and wrote the draft of the mathematical part of the manuscript; YIW performed the imitation modeling and wrote the draft of the corresponding part of the manuscript; FSB derived some of the mathematical statements; EVK contributed to the inception of the work and the formulation of the models, gave the biological interpretation of the results, wrote the background and discussion sections and extensively edited the entire manuscript.
Acknowledgements
The authors thank B.Shraiman and other members of the Computational Biology Program at the Kavli Institute for Theoretical Physics, University of California, Santa Barbara, for helpful discussions. The work of F. Berezovskaya was supported by NSF Grant #634156.
References

Gisiger T: Scale invariance in biology: coincidence or footprint of a universal mechanism?
Biol Rev Camb Philos Soc 2001, 76:161209. PubMed Abstract  Publisher Full Text

Barabasi AL: Linked: The New Science of Networks. New York, Perseus Pr; 2002.

Mendes JF, Dorogovtsev SN: Evolution of Networks: From Biological Nets to the Internet and Www. Oxford, Oxford University Press; 2003.

Reed WJ, Hughes BD: From gene families and genera to incomes and internet file sizes: why power laws are so common in nature.
Phys Rev E Stat Nonlin Soft Matter Phys 2002, 66:67103. Publisher Full Text

Luscombe N,, Qian J, Zhang Z, Johnson T, Gerstein M: The dominance of the population by a selected few: powerlaw behaviour applies to a wide variety of genomic properties.
Genome Biol 2002, 3:research0040.140.7. BioMed Central Full Text

Qian J, Luscombe NM, Gerstein M: Protein family and fold occurrence in genomes: powerlaw behaviour and evolutionary model.
J Mol Biol 2001, 313:673681. PubMed Abstract  Publisher Full Text

Jeong H, Mason SP, Barabasi AL, Oltvai ZN: Lethality and centrality in protein networks.
Nature 2001, 411:4142. PubMed Abstract  Publisher Full Text

Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL: The largescale organization of metabolic networks.
Nature 2000, 407:651654. PubMed Abstract  Publisher Full Text

Barabasi AL, Oltvai ZN: Network biology: understanding the cell's functional organization.
Nat Rev Genet 2004, 5:101113. PubMed Abstract  Publisher Full Text

Kuznetsov VA: Distribution associated with stochastic processes of gene expression in a single eukariotic cell.
EUROSIP Journal on Applied Signal Processing 2001, 4:285296. Publisher Full Text

Kuznetsov VA: Statistics of the numbers of transcripts and protein sequences encoded in the genome. In `Computational and Statistical Approaches to Genomics. Edited by Zhang W and Shmulevich I. Boston, Kluwer; 2002:125171.

Karev GP, Wolf YI, Rzhetsky AY, Berezovskaya FS, Koonin EV: Birth and death of protein domains: A simple model of evolution explains power law behavior.
BMC Evol Biol 2002, 2:18. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Koonin EV, Wolf YI, Karev GP: The structure of the protein universe and genome evolution.
Nature 2002, 420:218223. PubMed Abstract  Publisher Full Text

Barabasi AL, Albert R: Emergence of scaling in random networks.
Science 1999, 286:509512. PubMed Abstract  Publisher Full Text

Fisher RA: The possible modification of the response of the wild type to recurrent mutations.
Am Nat 1928, 62:115126. Publisher Full Text

Ohno S: Evolution by gene duplication. BerlinHeidelbergNew York, SpringerVerlag; 1970.

Henikoff S, Greene EA, Pietrokovski S, Bork P, Attwood TK, Hood L: Gene families: the taxonomy of protein paralogs and chimeras.
Science 1997, 278:609614. PubMed Abstract  Publisher Full Text

Jordan IK, Makarova KS, Spouge JL, Wolf YI, Koonin EV: Lineagespecific gene expansions in bacterial and archaeal genomes.
Genome Res 2001, 11:555565. PubMed Abstract  Publisher Full Text

Lespinet O, Wolf YI, Koonin EV, Aravind L: The role of lineagespecific gene family expansion in the evolution of eukaryotes.
Genome Res 2002, 12:10481059. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kellis M, Birren BW, Lander ES: Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae.
Nature 2004, 428:617624. PubMed Abstract  Publisher Full Text

Chervitz SA, Aravind L, Sherlock G, Ball CA, Koonin EV, Dwight SS, Harris MA, Dolinski K, Mohr S, Smith T, Weng S, Cherry JM, Botstein D: Comparison of the complete protein sets of worm and yeast: orthology and divergence [In Process Citation].
Science 1998, 282:20222028. PubMed Abstract  Publisher Full Text

Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, StangeThomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, DoucetteStamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blocker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowski J, ThierryMieg D, ThierryMieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, Szustakowki J: Initial sequencing and analysis of the human genome.
Nature 2001, 409:860921. PubMed Abstract  Publisher Full Text

Lynch M, Force A: The probability of duplicate gene preservation by subfunctionalization.
Genetics 2000, 154:459473. PubMed Abstract  Publisher Full Text

Lynch M, Conery JS: The evolutionary fate and consequences of duplicate genes.
Science 2000, 290:11511155. PubMed Abstract  Publisher Full Text

Aravind L, Watanabe H, Lipman DJ, Koonin EV: Lineagespecific loss and divergence of functionally linked genes in eukaryotes.
Proc Natl Acad Sci U S A 2000, 97:1131911324. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Katinka MD, Duprat S, Cornillot E, Metenier G, Thomarat F, Prensier G, Barbe V, Peyretaillade E, Brottier P, Wincker P, Delbac F, El Alaoui H, Peyret P, Saurin W, Gouy M, Weissenbach J, Vivares CP: Genome sequence and gene compaction of the eukaryote parasite Encephalitozoon cuniculi.
Nature 2001, 414:450453. PubMed Abstract  Publisher Full Text

Koonin EV, Fedorova ND, Jackson JD, Jacobs AR, Krylov DM, Makarova KS, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Rogozin IB, Smirnov S, Sorokin AV, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes.
Genome Biol 2004, 5:R7. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles.
Proc Natl Acad Sci U S A 1999, 96:42854288. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Galperin MY, Koonin EV: Who's your neighbor? new computational approaches for functional genomics.
Nat Biotechnol 2000, 18:609613. PubMed Abstract  Publisher Full Text

Huynen M, Snel B, Lathe W., 3rd, Bork P: Predicting protein function by genomic context: quantitative evaluation and qualitative inferences.
Genome Res 2000, 10:12041210. PubMed Abstract  Publisher Full Text

Koonin EV, Galperin MY: Sequence  Evolution Function. Computational Approaches in Comparative Genomics. New York, Kluwer Acad Publ; 2002.

van Nimwegen E: Scaling laws in the functional content of genomes.
Trends Genet 2003, 19:479484. PubMed Abstract  Publisher Full Text

Gardiner CW: Handbook fo Stochastic Models for Physics, Chemistry and the Natural Sciences. Berlin, SpringerVerlag; 1985.

Rzhetsky A, Gomez SM: Birth of scalefree molecular networks and the number of distinct DNA and protein domains per genome.
Bioinformatics 2001, 17:988996. PubMed Abstract  Publisher Full Text

Dokholyan NV, Shakhnovich B, Shakhnovich EI: Expanding protein universe and its origin from the biological Big Bang.
Proc Natl Acad Sci U S A 2002, 99:1413214136. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

PastorSatorras R, Smith E, Sole RV: Evolving protein interaction networks through gene duplication.
J Theor Biol 2003, 222:199210. PubMed Abstract  Publisher Full Text

Wagner A: How the global structure of protein interaction networks evolves.
Proc R Soc Lond B Biol Sci 2003, 270:457466. PubMed Abstract  Publisher Full Text

Darwin C: On the Origin of Species.
1859.

Gould SJ: The Structure of Evolutionary Theory. Cambrdige, MA, Harvard Univ. Press; 2002.

Kimura M: The Neutral Theory of Molecular Evolution. Cambridge, Cambridge University Press; 1983.

Khaitovich P, Weiss G, Lachmann M, Hellmann I, Enard W, Muetzel B, Wirkner U, Ansorge W, Paabo S: A neutral model of transcriptome evolution.
PLoS Biol 2004, 2:E132. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yanai I, Graur D, Ophir R: Incongruent expression profiles between human and mouse orthologous genes suggest widespread neutral evolution of transcription control.
Omics 2004, 8:1524. PubMed Abstract  Publisher Full Text

Karev GP, Wolf YI, Koonin EV: Simple stochastic birth and death models of genome evolution: was there enough time for us to evolve?
Bioinformatics 2003, 19:18891900. PubMed Abstract  Publisher Full Text

Bhattacharya R, Waymire E: Stochastic processes with applications. New York, Wiley; 1990.

MarchlerBauer A, Panchenko AR, Shoemaker BA, Thiessen PA, Geer LY, Bryant SH: CDD: a database of conserved domain alignments with links to domain threedimensional structure.
Nucleic Acids Res 2002, 30:281283. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Anderson W: Continuous time Markov chains: an applicationsoriented approach. New York, Springer Verlag; 1991.

Murray JD: Mathematical Biology, 2nd Ed. Berlin, Springer; 1993.

McKane A, Alonso D, Sole RV: Meanfield stochastic theory for speciesrich assembled communities.
Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics 2000, 62:84668484. PubMed Abstract  Publisher Full Text

Alonso D, McKane A: Extinction dynamics in mainlandisland metapopulations: an Npatch stochastic model.
Bull Math Biol 2002, 64:913958. PubMed Abstract  Publisher Full Text

Karev GP, Wolf YI, Koonin EV: Mathematical modeling of the evolution of domain composition of proteomes: A birthanddeath process with innovation. In Computational genomics: from sequence to function. Volume 3. Edited by Galperin M Y Koonin E V. Amsterdam, Horizon Press; 2002::261314.

Reed WJ, Hughes BD: A model explaining the size distribution of gene and protein families.
Math Biosci 2004, 189:97102. PubMed Abstract  Publisher Full Text

Huynen MA, van Nimwegen E: The frequency distribution of gene family sizes in complete genomes.
Mol Biol Evol 1998, 15:583589. PubMed Abstract  Publisher Full Text

Davis JC, Petrov DA: Preferential duplication of conserved proteins in eukaryotic genomes.
PLoS Biol 2004, 2:E55. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Jordan IK, Wolf YI, Koonin EV: Duplicated genes evolve slower than singletons despite the initial rate increase.
BMC Evol Biol 2004, 4:22. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text