Abstract
Background
Most eukaryotic genes are interrupted by spliceosomal introns. The evolution of exonintron structure remains mysterious despite rapid advance in genome sequencing technique. In this work, a novel approach is taken based on the assumptions that the evolution of exonintron structure is a stochastic process, and that the characteristics of this process can be understood by examining its historical outcome, the presentday size distribution of internal translated exons (exon). Through the combination of simulation and modeling the size distribution of exons in different species, we propose a general random fragmentation process (GRFP) to characterize the evolution dynamics of exonintron structure. This model accurately predicts the probability that an exon will be split by a new intron and the distribution of novel insertions along the length of the exon.
Results
As the first observation from this model, we show that the chance for an exon to obtain an intron is proportional to its size to the 3rd power. We also show that such size dependence is nearly constant across gene, with the exception of the exons adjacent to the 5^{′} UTR. As the second conclusion from the model, we show that intron insertion loci follow a normal distribution with a mean of 0.5 (center of the exon) and a standard deviation of 0.11. Finally, we show that intron insertions within a gene are independent of each other for vertebrates, but are more negatively correlated for nonvertebrate. We use simulation to demonstrate that the negative correlation might result from significant intron loss during evolution, which could be explained by selection against multiintron genes in these organisms.
Conclusions
The GRFP model suggests that intron gain is dynamic with a higher chance for longer exons; introns are inserted into exons randomly with the highest probability at the center of the exon. GRFP estimates that there are 78 introns in every 10 kb coding sequences for vertebrate genomes, agreeing with empirical observations. GRFP also estimates that there are significant intron losses in the evolution of nonvertebrate genomes, with extreme cases of around 57% intron loss in Drosophila melanogaster, 28% in Caenorhabditis elegans, and 24% in Oryza sativa.
Keywords:
Evolution of exonintron structure; General random fragmentation process; SimulationBackground
Most eukaryotic genes contain spliceosomal introns, which are removed from mRNA after transcription by the RNA splicing apparatus. The biological origins of introns are uncertain. Since the discovery of introns, there has been significant debate as to whether introns in modernday organisms were inherited from a common, ancient ancestor, the intronearly hypothesis [13], or whether they appeared in genes more recently in the evolutionary process, the intronlate hypothesis [4,5], or indeed whether they result from a mixed model [6,7]. The mixed model suggests that most introns were gained very early in the evolution of eukaryotic genes, followed by intron loss/gain during the course of eukaryotic diversification. The details of such processes, however, remain elusive.
One way to understand the process is to examine the size distribution of internal translated exons, referring to exons that are fully translated and referred to as itexon in [8]. However, to avoid introducing an unfamiliar term, we will simply refer to itexons as “exon” within this communication unless specified. Although the distribution of exons is just a snapshot of the present day world, fitting it to a model based on well characterized mathematical functions may provide insights into the evolution of exonintron structure. In a previous study, Gudlaugsdottir et al. [9] have suggested that the size distribution of exons can be approximated with a combination of Weibull [10] and exponential distributions. The Weibull distribution is a particular case of the generalized extreme value distribution. It is widely used in survival analysis, describing the size of particles generated by grinding, milling and crushing operations, etc. The exponential distribution can be used to describe the length of intervals between uniformly distributed points. Therefore, GJudlaugsdottir et al. hypothesized that the exponential distribution is the outcome of random insertion of introns (intronlate). However, they then related the Weibull distribution to the intronearly theory without providing a stochastic model that explains the observed distribution.
In later work, Ryabov and Gribskov [11] showed that a combination of two lognormal distributions gives the best fit quality to the size distribution of exons. The lognormal distribution could result from a random Kolmogoroff fractioning process [12], which assumes that the chance of fragmentation is independent of exon size. Inserting an intron into an exon is equivalent to fragmentation (splitting) of the exon. Therefore, they hypothesized that the process of intron insertion is independent of exon size.
On the other hand, Tenchov and Yanev [13] demonstrated that the Weibull distribution could result from a uniform random fragmentation process. Here, “uniform” means that the chance of fragmentation is linearly proportional to the size of the particle (or exon). Under certain conditions, the resulted Weibull distribution is indistinguishable from lognormal distribution. Hence they concluded that the model of random fragmentation could not be inferred based on the basis of fit quality. Therefore, the hypothesis that intron insertion is independent of exon size is debatable.
One assumption made in the exon size based approaches [9,11] is that introns are inserted into exon randomly. The notion of random insertion of intron has also been proposed based on the analysis of intron distribution in ancient paralogs [14]. Others have argued that there exist certain favored sites for intron insertion  the socalled protosplice sites [4,15,16]. Unfortunately, none of the sizebased approaches provide evidence to support the assumption of random intron insertion.
In this work, we aim to revisit these competing hypotheses by addressing the following open questions: Do longer exons have an increased chance of gaining a new intron? For intron gain events, will the intron be inserted into exon randomly or at some protosplice sites? Is there an intron gain/loss bias? Are intron insertion events independent of each other? Is there a common mechanism to explain intron gain/loss in different species? In order to answer these and other related questions, we propose a General Random Fragmentation Process (GRFP) to characterize the evolution dynamics of exonintron structures. The parameters of GRFP are determined by combining simulation and analysis of real genomic data.
Methods
GRFP model
The model of GRFP is motivated by generalizing both Kolmogoroff fractioning process and the uniform random fragmentation process. In GRFP, the probability for an exon to split (gaining an intron) is assumed to be exponentially proportional to the length of the kth exon (L_{k}) as L_{k}^{a}. Under such a generalization, the Kolmogoroff fractioning process, in which insertion events are independent of exon length, is a particular case of GRFP with α = 0, while the uniform random fragmentation process, in which insertions are linearly proportional to exon length, is another special case with α = 1. The generalization not only allows GRFP to model either Kolmogoroff or uniform fragmentation process but also allows it to model the fragmentation process of exons (intron gain) with varying α. In the results section, we will use the empirical size distribution of exons to determine the value of α. GRFP also assumes that introns insert into exons randomly and independently, and these assumptions are confirmed by the analysis of real genome data.
The model of GRFP, illustrated in Figure 1, is summarized below:
Figure 1. Demonstration of GRFP on splitting a long exon with initial size L_{0}. The probability of picking which exon to split is proportional to the length of the exon, P_{S}(k) ~ L_{k}^{α}. The probability of picking an inserting point (xє(O, L_{k})) for exon k follows a normal distribution, P_{I}(x) ~ L_{k}*N(^{μ} _{I}, σ_{I}).
1. Given a set of n exons, the opportunity for kth exon to acquire a new intron is proportional to its size to the αth power:
2. Within kth exon, the new intron insertion loci follow a normal distribution:
3. Intron gains are independent of each other.
Where Р_{Ѕ}(K) denotes the probability for kth exon to obtain an intron; Р_{I}(x), probability of inserting an intron after xth position within kth exon; L_{k}, length of the exon k; α, dependency value; μ_{I} and σ_{I}, mean and standard deviation of the distribution of insertion loci. The model of GRFP has three unknown parameters to be determined, α, μ_{I} and σ_{I}.
Simulation testing
We start each simulation with a long exon. The diagram in Figure 1 shows how the splitting of a long exon during evolution is simulated. The diagram shows intron gains in the following order. After inserting the first intron, Ρ_{Ѕ}(KL_{k=1,2}) denotes the picking probability between L_{1} and L_{2}; Assuming L_{1} is selected and split by a new intron; the next exon to be split will be chosen from L_{3}, L_{4} and L_{2} with probability Ρ_{Ѕ}(KL_{k=2,3,4}); Assuming L_{2} is selected, and so on.
For simulations, sequence of pseudorandom number is obtained using the Mersenne Twister algorithm [17] implemented in standard MATLAB 7.12. In this study, all of the simulations start with one single exon. This simplifies the simulation, but it does not imply that the evolution of eukaryotic genes always starts with one long exon. Under such simplification, two more parameters are the initial size of starting exon (L_{0}) and the number of splitting (m). For a given species, we can estimate them from the annotated gene sets.
We evaluate the properties of GRFP using three simulation experiments. In each, we simulate a set of ordered fragments and quantify their statistical characters given different parameters. The three sets of quantifications listed below are used to justify the three assumptions of GRFP respectively for both simulated fragments and real exons.
1. Mean and standard deviation of the size distribution by fitting it with lognormal distribution (equation (3)) or Weibull distribution (equation (4)):
Where z=E/λ, E is exon length, dN the number of exons in a bin (bin size is 0.1 unless specified), N the amplitudes of the peak; k shape parameter, and λ scale parameter of the Weibull distribution; μ_{E} the mean position, and σ_{E} the standard deviation of the lognormal distribution. These and subsequent fittings in this study are performed using the nonlinear TrustRegionReflective curvefitting algorithm [18,19] implemented in MATLAB 7.12. Simulation demonstrates that σ_{E} is primarily determined by the choice of α (in equation (1)).
2. Mean and standard deviation of the insertion ratio defined below:
Where L_{i} and L_{i+1} are the length of two adjacent fragments (exons). This is an indirect estimation of insertion loci (x in equation (2)). Both simulation and real genome data indicate that x follows a normal distribution (with a standard deviation σ_{x}).
3. Correlation between x_{i} and x_{j} defined by equation (6):
Where σ_{x} is estimated from fitting the histograms of ratio x with a normal distribution. In theory, x_{i} + x_{j} is still normally distributed, and the mean value is the sum of the means. However, the variances are not additive if x_{i} and x_{j} are correlated. We can estimate the relationship between x_{i} and x_{j} with equation (6).
In the first experiment, we examine the relationship between GRFP parameters and the size distribution of the simulated fragments. With fixed μ_{I}, σ_{I}, initial size of starting exon (L_{0}), and the number of splitting (m), one long exon is fragmented with different choices of α. μ_{E} and σ_{E} are estimated through fitting a lognormal distribution to the size distribution of the resulted fragments. The correlation between μ_{E}, σ_{E} and α is examined. Then, with fixed μ_{I}, σ_{I}, and α, the relationship between μ_{E}, σ_{E} and initial size of starting exon (L_{0}), the number of splitting (m) is examined through similar simulations.
In the second experiment, we examine the relationship between real σ_{I} (in equation (2)) and estimated σ_{x} (from equation (5)). By fragmenting a long exon, we construct a binary tree to track the splitting process. We classify the adjacent fragments pair (the order is maintained during fragmentation) into four groups based on whether they have the same parent nodes, or if not same parents, comparing their depths. The size distribution of each group and the mixture (equation (5)) is examined. With fixed μ_{I}, L_{0}, m, and α, the correlation between σ_{I} and σ_{x} is examined by simulations with different choices of σ_{I}. Then, by coupling with empirical observations, we use ExpectationMaximization (EM) iteration to determine the value of α and σ_{I}.
In the third experiment, we examined the effects of intron loss on the statistical characters of resulted fragments. By introducing various percentages of intron loss after intron gain, we evaluate how σ_{E}, σ_{x}, and ρ(i,j) of the resulted fragments are changed.
Empirical data analysis
In this study, we obtained the cDNA sequences of 14 species (Homo sapiens (GRCh37.p8), Mus musculus (GRCm38), Rattus norvegicus (RGSC3.4), Danio rerio (Zv9), Caenorhabditis elegans (WBcel215), Drosophila melanogaster (BDGP5), Bos taurus (UMD3.1), Pan troglodytes (CHIMP2.1.4), Gallus gallus (WASHUC2), Sus scrofa (Sscrofa10.2), Arabidopsis thaliana (TAIR10), Oryza sativa (MSU6), Sorghum bicolor (Sorbi1), Zea mays (AGPv2)) from Ensembl and plant Ensembl database [20]. To ensure the quality of the data, we only use the cDNA sequences of protein coding genes with both RefSeq mRNA ID and the known status of both gene and transcript. To examine the size distribution of exons, we extracted the genomic positions of the exons from cDNA sequences to compute exon sizes. We also extracted the genomic positions of the 5′ and 3′ UTRs and used them to identify internal translated exons.
For testing the first assumption of GRFP, we fitted both Weibull and normal distribution to the size distribution of vertebrate exons (logarithm scale). We also grouped exons by positions for testing position bias of intron gain/loss. For the second assumption, we fitted a normal distribution to x (equation (5)). For the third assumption, we computed ρ(i,j) for exon pairs at ith and jth position in all protein coding genes. Finally, we examined the differences in the parameters fitted to vertebrate and nonvertebrate species.
Results
Empirical data analysis
Statistical counts of empirical data
Statistical counts of the extracted data are shown in Table 1. The transcript with the longest CDS (Coding DNA Sequence) for each gene is used for counting the number of protein coding genes, number of splitting, and total CDS length. In these counts, a coding gene is excluded if it does not contain any internal translated exons. The total CDS length is the summation of the length of all exons. The number of splitting events is the total number of exons minus one (for reversion of splitting, a long exon can be reconstructed by connecting all exons together). The last two columns of Table 1 are estimated through GRFP simulations that will be discussed later.
Table 1. Statistical counts of coding genes, splitting (number of exons minus one) and total CDS length (b.p.)
In this study, we ignored noninternal translated exons considering the rate of indels (a type of mutations affecting exon size distribution) is significantly lower in the coding region than the noncoding region [21]. It is true that introns can be inserted anywhere, including noncoding exons or even another intron, but their size distribution is confounded by the appearance of more frequent indels.
Size distribution of exons
Figure 2 shows the histograms of exons for eight vertebrate species. Both Weibull (solid line) and normal (dashed line) functions provide a reasonable fit to the histograms of exons, with the fitted parameters shown in Table 2. Notably, in Table 2, the fitted parameters are almost identical across species (e. g., μ_{E} and σ_{E}), which might indicate that these vertebrate genomes have undergone a similar stochastic process on the exonintron structure during evolution. For the six nonvertebrate species, a mixture of two normal functions (dashed line) fits the histograms well (Additional file 1: Figure S1). This might suggest that the evolution of nonvertebrate exonintron structure has undergone other processes.
Figure 2. Size distributions of vertebrate exons fitting with normal distribution. The histograms of exons are fitted with a Weibull function (solid line) and normal function (dashed line).
Additional file 1: Figure S1. has the size distribution of nonvertebrate exons. Figure S2 has the size distributions of H. sapiens exons grouped by position, supporting the plot in Figure 3. Figure S3 shows that the distribution of protosplice sites within H. sapiens coding sequences is uniform. Figure S4 shows the size distribution of simulated exons with different dependency values. Figure S5 shows the linear relationship between expected and observed standard deviation of insertion ratios. Figure S6 illustrates four different groups of insertion ratios. Figure S7 shows the distribution of insertion ratio for each of the four groups and their mixture. Figure S8 shows the distribution of fragment size after a certain percentage of intron losses, supporting Figure S9A. Figure S10 shows the distribution of insertion ratios after a certain percentage of intron loss, supporting Figure S9B. Figure S11 shows the linear relationship between the number of splitting and total CDS length for each H. sapiens chromosome.
Format: PDF Size: 382KB Download file
This file can be viewed with: Adobe Acrobat Reader
Table 2. Fitted parameters for size distributions of observed vertebrate exons
In order to assess whether the size distribution of vertebrate exons is positiondependent, we grouped their exons from all protein coding genes according to their positions relative to 5′ UTRs/3′ UTRs. For the five well annotated vertebrates, the standard deviations (σ_{E}) of the fitted normal functions at each position (e.g. Additional file 1: Figure S2 for H. sapiens) are shown in the left panel of Figure 3. The right panel shows corresponding α values calculated using equation (9). The mean values of these distributions are almost constant at all positions (results not shown). Figure 3 shows that σ_{E} is almost constant for exons across gene body, with exceptions of the first three exons right after 5′ UTR (see solid line), where it increases markedly. For exons next to the 3′ UTR (in dashed line), no similar trend is observed.
Figure 3. Fitted standard deviation (σ_{E}) and dependency (α) for internal exons with positions relative to 5′ UTRs (solid line) or 3′ UTRs (dashed line). The dependency value α is calculated using equation (9).
These observations suggest that the size distribution of vertebrate exons could be properly fit with either Weibull or normal distribution. The Weibull distribution gives a better fit to both left and right tails (e.g., Additional file 1: Figure S2) because the distribution is skewed to the left. For numerical simulations, we will show that similar size distribution of fragments will be generated based on GRFP model.
Distribution of insertion ratio
For every gene of the selected species, we calculated the insertion ratio x (equation (5)) for each adjacent exon pairs L_{i} and L_{i+1}. Figure 4 shows the histograms of the ratios for the 14 species. The histograms are fitted well with a normal distribution, and the fitted parameters are shown in Table 3. The fitted parameters are almost identical across vertebrate species with μ_{x}= 0.5 and σ_{x}= 0.13. The insertion ratio for nonvertebrates fits a normal distribution reasonably well but with much larger σ_{x}.
Figure 4. Genome wide distribution of L_{i}/(L_{i}+ L_{i+1}). The histograms are drawn with bin size of 0.01, and fitted with a Normal function.
Table 3. Fitted parameters for distribution of insertion ratios from empirical data
Another interesting observation in Figure 4 is the sharp spike at 0.5, which suggests that there are excessive adjacent exons pairs with the same length. This is consistent with the observation of tandem exon duplication [22]. Because the spikes are located right on the center, mathematically such deviation has little effect on the fitting of the histogram.
The normal function fitted in Figure 4 describes where introns get inserted into an exon. To assess whether it is consistent or against the hypothesis of protospliced sites, we calculate the position distribution of four possible protosplice sites (tested in [23,24]) within human coding sequences, and the results are shown in Additional file 1: Figure S3. The position for each of the four sites (GG, AGG, AGGT and (C/A)AG(A/G)) is calculated by dividing the distance between the intron starting site and start codon by the length of the coding sequence. All coordinates are extracted from Ensembl annotation of H. sapiens. The symbol “” stands for the intron position and “/” indicates wo alternative states of one nucleotide site. Additional file 1: Figure S3 shows that these protospliced sites are distributed nearly uniformly within CDS. If introns strongly prefer to be inserted into these sites, the insertion ratio should follow a uniform distribution instead of normal distribution as we observed. Therefore, the analysis here does not favor the protosplice site hypothesis.
Correlation between insertion ratios
The correlations calculated using equation (6) are shown in Figure 5. For better visualization purpose, we set the correlation between x and itself to zero (the diagonal blocks). The color bar on the right indicates the correlation value between insertion ratio and ratios with i or j distance to it. Figure 5 shows that ρ(i, j) is significantly negative if i  j = 1. While ρ(i, j) is close to zero for vertebrates if i  j > 1. The negative correlation between adjacent insertion ratio is not surprising since the calculation uses the same exon length as dividend (equation (5)) but with opposite signs. For example, considering three exons in order with length p, Lp, and q, the insertion ratios are x_{1}= p/(p+Lp) = p/L and x_{2} = (Lp)/(Lp+q) ≈ 1p/L if p ≈ q; Both x_{1} and x_{2} are proportional to p but with different signs.
Figure 5. Correlation of insertion ratios for different species. The correlation between x and itself is set to zero (the diagonal blocks). The color bar on the right indicates the correlation value between insertion ratio and ratios with i or j distance to it.
The key observation in Figure 5 is that nonadjacent insertion ratios are nearly uncorrelated for vertebrate genomes. However, the correlation between intron insertion ratios has quite different patterns for nonvertebrates. Their insertion ratios are more negatively correlated with each other than those for vertebrates, especially for C. elegans, D. melanogaster, and O. sativa.
In summary, analysis of empirical data reveals three significant differences between vertebrate and nonvertebrate genomes. First, a mixture of two normal functions gives a better fit to the size distribution of nonvertebrate exons, instead of one normal function for that of vertebrate exons; Second, the insertion ratio of nonvertebrates also follows a normal distribution but with larger standard deviation than that of vertebrates; Third, the insertion ratios of nonvertebrates are more negatively correlated than that of vertebrates.
Simulation testing
Default values for L_{0}, m, σ_{I}, and μ_{I}
As mentioned before, we start each simulation with a long exon. Using the counts for H. sapiens (Table 1), we set the initial exon size and number of splitting to following values for all simulations unless specified:
For the remaining unknown parameters of GRFP, α, σ_{I}, and μ_{I}, we chose to examine α first with following values for σ_{I} and μ_{I}:
These values are determined through an EM iteration process that will be discussed in the simulation testing section. The EM iteration uses observed values of σ_{x} and μ_{x} for vertebrates (Table 3). The simulation described below shows that σ_{x} overestimates but is linearly proportional to σ_{I}, while μ_{x} approximates μ_{I} extremely well.
Relationship between α, L_{0}, m and σ_{E}, μ_{E}
Using the values of L_{0}, m, σ_{I} and μ_{I} in equations (7) and (8), we performed three GRFP simulations with α values of 0.3, 1 and 3. The size distributions of the GRFP fragments are shown in Additional file 1: Figure S4. Both the Weibull and normal functions were used for fitting to the logarithm size distribution. Fitted parameters are shown in Table 4. Numerically the Weibull function is unstable for fitting the histogram when α is close to zero. Therefore, we picked α = 0.3 to mimic a random Kolmogoroff fractioning process. α = 1 would correspond to a uniform random fragmentation process, and α = 3 will generate size distribution similar to that of real exons for the vertebrate genomes (in both shape and standard deviation). We found that although the Weibull function fits the tails of the distributions better, the fitted parameters for Weibull are extremely sensitive to the minimum size of the GRFP fragments. Therefore, in this study we use σ_{E} of the fitted normal function to characterize the peak width of the size distribution. It is worthwhile reemphasizing that both empirical and simulated distributions are skewed to the left; thus both tails of the peak are better fitted by the Weibull distribution.
Table 4. Fitted parameters for size distribution of simulated exons
These simulations show that σ_{E} (or width of the peak) decreases as α increases. To quantify how σ_{E} is dependent on α, we performed three GRFP simulations for each α value range from 0 to 4. The size distribution of GRFP was fitted to a normal function, and the fitted σ_{E} and μ_{E} values (mean ± 3 standard deviations) were plotted against α in Figure 6A. For α values ranging between 2 and 4, we found that the relationship between σ_{E} and α can be fitted with the following equation:
Figure 6. Relationship between GRFP parameters and σ_{E}, μ_{E}. (A) Plot of σ_{E} and (B) μ_{E} as a function of α; (C) Plot of σ_{E} and (D) μ_{E} as a function of L_{0}); (E) Plot of σ_{E} and (F) μ_{E} as a function of m. 3 simulations are performed for each test and plus/minus three standard deviations are shown in vertical bars. The dashed lines show where σ_{E} and μ_{E} of H. sapiens are (Table 2).
From equation (9), we estimate that α ≈ 3 gives the observed σ_{E} ≈ 0.43 (Table 2). This suggests the chance of intron gain is proportional to the exon length to the 3^{rd} power, which disagrees with the independency hypothesis of earlier work [11].
Similarly, we performed a series of GRFP simulations with different choices of L_{0} and m, and the results are shown in Figure 6CF. Figure 6C and 6E show that the estimated σ_{E} is independent of both L_{0} and m, while Figure 6D and 6F demonstrates that the mean value (μ_{E}) of the resulting size distribution is dependent on both L_{0} and m. From Figure 6F, we can estimate m via the GRFP simulation given σ_{E} and L_{0}, using the intersection between the dashed line (μ_{E} of H. sapiens) and the solid curve. Given that μ_{E} is approximately 4.81 across vertebrate genomes, we used GRFP simulation to estimate the number of splitting (m_{e}) for each species (Table 1). The percentage of intron loss is estimated by comparing m_{e} with m. For other species, the corresponding L_{0} in Table 1 is used to estimate m_{e}. m_{e} and the percentages of intron loss are calculated in the same way. Note that here we use the same μ_{E} value of 4.8 for invertebrates although the size distributions of their exons (Additional file 1: Figure S1) are quite different (we hypothesize that they are resulting from intron loss). These estimations show that there are approximately 24% intron loss in O. sativa, 28% intron loss in C. elegans, and 57% intron loss in D. melanogaster, relative to what is predicted by GRFP model based on CDS length.
In Figure 7, we plot m_{e} (estimated number of splitting, open circle) against CDS length and fit it with a linear function (equation (10)). The observed number of splitting (closed circle) events is also plotted for comparison. The first observation from Figure 7 is that, under the GRFP model, number of splitting is linearly proportional to CDS length. On average, 78 splitting events will occur in an exon with length of 10000 bps (or around 8 events per 1000 bps). The second observation is that the observed number of splitting agrees well with the estimation from GRFP model, with the exception of nonvertebrates, especially O. sativa, C. elegans, and D. melanogaster.
Figure 7. Plot of the number of splittings as a function of total CDS length. Estimated splitting (open circle) is from GRFP simulation with different CDS lengths (Table 1) and fits with a linear function (solid line) shown in equation (10). The observed splitting (closed circle) is also plot. Species are Arabidopsis thaliana (A), Bos taurus (B), Sus scrofa (C), Drosophila melanogaster (D), Danio rerio (F), Gallus gallus (G) Homo sapiens (H), Mus musculus (M), Oryza sativa (O), Pan troglodytes (P), Rattus norvegicus (R), Sorghum bicolor (S), Caenorhabditis elegans (W), and Zea mays (Z).
Parameterizing GRFP via EM iteration
In the previous simulation studies, with the assumption of known σ_{I}, we have shown that σ_{E} is dependent on α but not on L_{0} and m, which suggests that the value of α can be estimated from σ_{E}. However, simulations show that σ_{E} is also dependent on σ_{I}. To derive the values of α and σ_{I} simultaneously without assuming knowing any one of them, here we determine their values through EM iterations, by combining simulations with empirically observed σ_{I} (Table 2), μ_{x}, and σ_{x} (Table 3) for vertebrate genomes.
Before performing EM iteration, we need to quantify how σ_{I} is related to σ_{x}. We performed a series of GRFP simulations with σ_{I} ranging from 0.06 to 0.18 and α = 3. For each simulation, we calculate the insertion ratio (equation (5)) from the resulted fragments, and estimated σ_{x} from fitting a normal function to the histogram of insertion ratios. σ_{x} is plotted against given σ_{I} in Additional file 1: Figure S5A. The plot shows that there is a linear relationship between the two. From this relationship, we estimate that real σ_{I} is closer to 0.11 than the 0.13 estimated from Figure 4. To see the over estimation of σ_{I}, we show the simulation process and results in Additional file 1: Figure S6 and Additional file 1: Figure S7. By classifying adjacent exon pairs into four different groups, we show that the mixture of these four groups still follows a normal distribution but with larger σ_{x}.
For EM iteration, we use σ_{I} to estimate α, then use estimated α to reestimate σ_{I}. The iteration start with σ_{I} = σ_{x} = 0.13 (Table 3).
1. Given σ_{I}, determine the relationship between α and σ using simulation (Figure 6A)
2. With observed σ_{E} (Table 2) and the estimated relationship (equation (9)), estimate α
3. With α, determine the relationship between σ_{x} and σ_{I} (Additional file 1: Figure S5A)
4. With the relationship and σ_{x} = 0.13, estimate σ_{I}
5. Return to step (1), iterate until convergence
The results of the EM iteration are shown in Additional file 1: Figure S5B and Additional file 1: Figure S5C. At the end, σ_{I} converges to 0.11; α converges to approximately 3. Again, α ≈ 3 suggests that, during evolution, longer exon has much higher chance to gain an intron than the shorter one, with a probability proportional to its size to the 3rd power.
Intron losses accounting for increasing σ_{I}, σ_{E} and more negative ρ(i, j)
In Table 3 and Figure 4, we show that σ_{x} of nonvertebrates is significantly larger than those of vertebrate genomes. In Figure 5, we also observed more negative correlation between insertion ratios for nonvertebrates. Additional file 1: Figure S1 also shows that the size distributions of nonvertebrate exons are different from those vertebrates (Figure 2). In Table 1, we have estimated that there is a significant amount of intron losses in nonvertebrate genomes. Next, numerical simulations indicate that these three differences could result from excessive intron loss during the evolution of nonvertebrate genomes.
With simulated GRFP fragments, we gradually introduce 550% of “intron loss” by randomly reconnecting adjacent fragment pairs. The size distributions of the resulted fragments are fitted with a normal function, (Additional file 1: Figure S8) and the fitted σ_{E} is plotted against percentage of intron loss in Additional file 1: Figure S9A. The insertion ratio between each adjacent fragment pairs is calculated using equation (5). Their histograms are fitted with a normal function (Additional file 1: Figure S10) with the fitted σ_{x} plot against intron loss in Additional file 1: Figure S9B. In Additional file 1: Figure S9C, we calculate the correlation of insertion ratios between i and i+4 sites using equation (6) for each of the intron loss simulations and plot them against intron loss. Results in Additional file 1: Figure S9 suggest that intron loss might account for increasing σ_{x} (Figure 4 and Table 3), and more negative correlation between nonadjacent insertion ratios (Figure 5).
Additional file 1: Figure S8 also shows that the size distribution of exons no longer can be fit properly to a normal function. As the percentage of intron loss increases, a second peak is appearing, as the size distribution of exons for nonvertebrates in Additional file 1: Figure S1.
Discussion and conclusion
In this study, we analyze the size distribution of exons for 14 species, including eight vertebrates and six nonvertebrates. Our approach overcomes the limits of using orthologous genes, thus allowing us to infer evolutionary processes affecting the exonintron structure across widely divergent species. The use of size distributions is more reliable than alignment based approaches if considering the accumulation of repeating intron gain/loss. Based on the size distribution of exons, we propose GRFP to characterize the evolution of eukaryotic genes. The solid agreements between GRFP simulations and observations on genomic data provide several key findings on the evolution of exonintron gene structures.
Chance of intron gain is proportional to exon size to the 3rd power
GRFP reveals that longer exons have a higher chance to gain an intron during evolution, and reveals the novel finding that the chance of intron gain is proportional to the exon length to the third power. This finding was derived after investigating real genome data, comparing with numerical simulations, and excluding various effects on GRFP through EM iterations. This finding might explain why long exons are rare in modern organisms. E.g., statistical study has shown that only 3.5% of the primate exons are longer than 300 nt [8,9,25].
The “third power” is derived from σ_{E} (or width) of the exon size distributions. The model of GRFP indicates that σ_{E} will remain constant given the same dependency value α. Thus, the nearly identical σ_{E} (0.43) in Table 2 suggests the existence of a common dependency value (α ≈ 3) across vertebrate species. However, the 5′ deviations of α value might indicate that intron is less preferred there. For nonvertebrate species, α cannot be directly estimated since their exons follow a mixture of two lognormal distributions (instead of one) possibly due to excessive intron loss. For estimation of intron loss in Table 1, we simply assume that α ≈ 3 holds for nonvertebrates.
Why is the probability of intron gain proportional to the exon length to the third power? Given that the third power is usually related to volume, it might be possible that exon occupies a volume proportional to its length to the third power due to dynamic movement, and the chance of an intron attacking it is proportional to this volume. Further investigation will be needed to support this hypothesis.
No evidence for sitespecific bias of intron insertion
We derive this finding from indirectly estimating the position distribution of intron insertion loci. We demonstrate that the insertion loci follow a normal distribution, peaking around the center of the exon with a standard deviation (σ_{I}) of 0.11. This observation does not support the protosplice site hypothesis. If there were protosplice sites in the exon, the insertion loci would follow the position distribution of these sites, which will most likely be a uniform distribution (Additional file 1: Figure S3).
In Figure 5, we also demonstrate that, for vertebrate genomes, intron gains are independent of each other. This observation is also one of the core assumptions of GRFP simulation. It holds on vertebrate genomes and nonvertebrate genomes if the effect of intron losses is considered. Another simplification of GRFP is that the effect of exon duplications is ignored. As mentioned earlier, the sharp spikes in Figure 4 are related to tandem exon duplication [22]. Such effect is not considered since overall their contribution is not significant in estimating either insertion loci or the chance of intron gain. This is illustrated in Additional file 1: Figure S7 and Additional file 1: Figure S10, where no such spikes are observed.
The assumption behind the estimation of insertion ratio (equation (5)) is that the order of the exons within each gene is maintained during evolution. In the cases of tandem exon duplication, exon shuffling, or intron loss, the order is just locally disrupted. Simulation also shows that the estimated insertion ratio is a mixture of four different groups of adjacent fragment pairs (Additional file 1: Figure S7, Additional file 1: Figure S8), but σ_{x} is linearly related to σ_{I} (Additional file 1: Figure S5A).
Suggesting 5′ intron gain/loss bias
By grouping exons by positions within a gene, we demonstrate that exons next to the 5′ UTR have bigger standard deviation (σ_{E}) than other exons. One may argue that the deviation near the 5′ UTR is caused by the fact that on average exons are longer for genes contain fewer exons. If this is the case, similar trend near the 3′ UTR should have been observed. From equation (9), bigger σ_{E} indicates smaller GRFP dependency value (α). The dropping of α values for exons adjacent to the 5′ UTR implies that introns are not favored there; In the GRFP model (Equation (1)), a smaller dependency value indicates a lower chance in acquiring introns during evolution. Alternatively, it might be explained as intron loss bias, that is, introns right after 5′ UTR has a tendency to lose than other introns. Certainly such comparison is limited to introns in the coding region and debatable due to the unclear stochastic process of intron loss.
Excessive intron losses accounting for deviations from GRFP
In this study, we show that exons of nonvertebrates are different from those of vertebrates in three aspects. First, the size distribution of their exons fit a mixture of two normal distributions (Additional file 1: Figure S1) instead of one for vertebrates (Figure 2). Second, their insertion ratios have much larger standard deviations (σ_{x}) as shown in Table 3. Third, their nonadjacent insertion ratios are more negatively correlated as shown in Figure 5.
The estimations in Table 1 (also Figure 7) suggest that there are excessive intron losses in nonvertebrate genomes. Based on this, we performed simulations of intron loss after GRFP fragmentation. Additional file 1: Figure S9 demonstrates that, with increasing intron losses, σ_{E} increases, σ_{I} increases, and ρ(i, j) decreases from zero, consistent with the observation on empirical data analysis here. Therefore, GRFP model holds on nonvertebrate genomes when the effect of intron loss is considered. Comparative approaches also show that frequent intron loss has been inferred during the evolution of Nematode [26,27] and Drosophila genomes [28], though they cannot provide a genome wide estimation of percentage of intron losses.
Here, the excessive intron loss hypothesis in nonvertebrate genomes is interpreted as breaking the equilibrium between intron gain and intron loss. Although GRFP model is built on modeling intron gain events, it does not assume that introns in vertebrate genomes are never lost. Instead, we interpret the straight line in Figure 7 as a “dynamic equilibrium line for vertebrates”, where the genome reaches a state of stability for its intron count. In Additional file 1: Figure S11, we observed the similar linear relationship between intron counts and CDS length by grouping genes by chromosomes for H. sapiens. This further proves that the “equilibrium” is reached in each chromosome and the linear relationship is independent of lineage (since human chromosomes are not related by a simple lineage relationship). If there are many intron losses during evolution, subsequent gains must have also occurred to bring the genome back to the equilibrium line. Therefore, the statistical measurements of the exons can remain the same across examined vertebrate genomes. For the nonvertebrate genomes that fall below the line, equilibrium is shifted to intron loss relative to vertebrates. The shifting (variation of intron density) might be related to the differences in the generation time of each species [29].
The size distribution of exons (Additional file 1: Figure S1), the insertion ratios (Table 3) and the correlation map (Figure 5) suggest that A. thaliana, S. bicolor, and Z. mays underwent intron loss during evolution though not as significant as O. sativa. The estimated intron losses are around 7% for A. thaliana and S. bicolor, and 4% for Z. mays (Table 1). However, such percentage of intron loss is not significant enough to justify the size distribution of exons for these three plant species, a mixture of two Gaussians instead of one as shown in Additional file 1: Figure S1. Another possible explanation is that plants have undergone significant genome duplications and the rate of indels is higher for the duplicate genes [30]. The reason is that one copy of the duplicate genes is freed from the selection pressure.
Weakness and strength of GRFP
In this work, we propose the GRFP model to capture the dynamic processes describing the evolution of exonintron structures. For vertebrate genomes, the model fits well with the well annotated genome data, including exon size distribution, distribution of insertion loci, total CDS length, number of introns, independency among intron gains, and 5′ intron gain bias. For nonvertebrate genomes, simulations show that the deviations from the vertebrate genome can be explained by excessive intron loss. The GRFP model implies that the evolution of gene structure is purely random, from picking which exon to split (gains intron) to picking intron insertion loci on the selected exon. The solid agreements between GRFP simulations and real genome data confirm that GRFP model provides one possible explanation on the exonintron structure evolution.
It is well known that a modern genome is a collection of introns that have accreted (and been deleted) over at least a billion years. Here, by considering the whole process as a black box, we reproduce the output of this box (the current day genomes) with numerical simulations. The size distribution of exons serves as the key component in building GRFP model because of two reasons. First, the dominant factor that can shape such distribution is intron gain/loss (fragmentation). Second, the most prominent cofounding factor on exon sizes  the rate of indels in them during evolution is low. Certainly, the mechanism of intron gain is complicated considering differences across lineage, differences in rates of insertion across sites, the age of introns, the possibility of indels to maximize fit to epigenomic structures that can occur following intron gain, alternative splicing, the different models of intron gain, and so on. The process of exon fragmentation (or intron gain) might be as straightforward as the model of GRFP describes. By focusing on internal translated exons only, we have demonstrated outstanding agreements between empirical observations and GRFP simulations.
It is crucial to note that GRFP does not make any assumptions on the rate of intron gain/loss. Recent studies [6,7,3136] have suggested that the early eukaryotes experienced a rapid burst of intron gain. In later evolution, intron loss dominates the landscape, with occasional bursts of intron gains. This does not contradict the results presented here since GRFP makes no assumptions about the rate of intron gain or loss during evolution, but instead estimates the chance for an exon to gain an intron somewhere along its length and predicts the distribution of those insertion events.
One may argue that the agreement between the GRFP model and well annotated genome structure could be fortuitous. While we cannot rule out that other models might reproduce the exons of modern genomes, the predictive power of GRFP is striking, and we believe that it is a promising approach to understanding the evolution of exonintron structures, and an excellent starting point for new models for revealing the hidden stochastic processes of evolution.
Unanswered questions and future studies
GRFP model provides explicit rules on the exonintron structure evolution. However, it does not address the origin of introns, the mechanism of intron insertion, and the rate of intron gain/loss. In other words, GRFP addresses where introns are inserted (which exon and where in the exon), but not when and how introns are inserted. Future research will focus on extending GRFP to model the evolution of noncoding exon, and developing GRFPbased methods for comparative genomics studies.
Abbreviations
GRFP: General Random Fragmentation Process; UTR: Untranslated region; CDS: Coding DNA Sequence; EM: Expectation Maximization.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
LW developed the model, performed the data analysis and designed the simulation experiment. LW and LDS wrote the manuscript. All authors read and approved the final manuscript.
Acknowledgements
We thank the National Science Foundation (DBI0735191) and National Institute of Health (P41HG02223) for funding aspects of this work.
References

Gilbert W: The exon theory of genes.
Cold Spring Harb Symp Quant Biol 1987, 52:901905. PubMed Abstract  Publisher Full Text

Gilbert W, Glynias M: On the ancient nature of introns.
Gene 1993, 135(1–2):137144. PubMed Abstract

Penny D, Hoeppner MP, Poole AM, Jeffares DC: An overview of the intronsfirst theory.
J Mol Evol 2009, 69(5):527540. PubMed Abstract  Publisher Full Text

Stoltzfus A, Spencer DF, Zuker M, Logsdon JM Jr, Doolittle WF: Testing the exon theory of genes: the evidence from protein structure.
Science 1994, 265(5169):202207. PubMed Abstract  Publisher Full Text

Logsdon JM Jr, Tyshenko MG, Dixon C, DJ J, Walker VK, Palmer JD: Seven newly discovered intron positions in the triosephosphate isomerase gene: evidence for the intronslate theory.
Proc Natl Acad Sci USA 1995, 92(18):85078511. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Rogozin IB, Carmel L, Csuros M, Koonin EV: Origin and evolution of spliceosomal introns.
Biol Direct 2012, 7:11. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Csuros M, Rogozin IB, Koonin EV: A detailed history of intronrich eukaryotic ancestors inferred from a global survey of 100 complete genomes.
PLoS Comput Biol 2011, 7(9):e1002150. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhang MQ: Statistical features of human exons and their flanking regions.
Hum Mol Genet 1998, 7(5):919932. PubMed Abstract  Publisher Full Text

Gudlaugsdottir S, Boswell DR, Wood GR, Ma J: Exon size distribution and the origin of introns.
Genetica 2007, 131(3):299306. PubMed Abstract  Publisher Full Text

Weibull GW: Citation Classic  a Statistical Distribution Function of Wide Applicability.

Ryabov Y, Gribskov M: Spontaneous symmetry breaking in genome evolution.
Nucleic Acids Res 2008, 36(8):27562763. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kolmogoroff AN: Concerning the logarithmic normal distribution principle of dimensions of particles during dispersal.

Tenchov BG, Yanev TK: Weibull Distribution of Particle Sizes Obtained by Uniform Random Fragmentation.
J Colloid Interface Sci 1986, 111(1):17. Publisher Full Text

Cho G, Doolittle RF: Intron distribution in ancient paralogs supports random insertion and not random loss.
J Mol Evol 1997, 44(6):573584. PubMed Abstract  Publisher Full Text

Rogozin IB, Sverdlov AV, Babenko VN, Koonin EV: Analysis of evolution of exonintron structure of eukaryotic genes.
Brief Bioinform 2005, 6(2):118134. PubMed Abstract  Publisher Full Text

Logsdon JM Jr, Palmer JD: Origin of intronsearly or late.
Nature 1994, 369(6481):526.
author reply 527–528
PubMed Abstract  Publisher Full Text 
Matsumoto M, Nishimura T: Mersenne twister: a 623dimensionally equidistributed uniform pseudorandom number generator.
ACM Trans Model Comput Simul 1998, 8(1):330. Publisher Full Text

Coleman TF, Li YY: An interior trust region approach for nonlinear minimization subject to bounds.
Siam J Optim 1996, 6(2):418445. Publisher Full Text

Thomas FC, Li YY: On the Convergence of InteriorReflective Newton Methods for Nonlinear Minimization Subject to Bounds.

Flicek P, Amode MR, Barrell D, Beal K, Brent S, CarvalhoSilva D, Clapham P, Coates G, Fairley S, Fitzgerald S: Ensembl 2012.
Nucleic Acids Res 2012, 40(Database issue):D8490. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chen JQ, Wu Y, Yang H, Bergelson J, Kreitman M, Tian D: Variation in the ratio of nucleotide substitution and indel rates across genomes in mammals and bacteria.
Mol Biol Evol 2009, 26(7):15231531. PubMed Abstract  Publisher Full Text

Letunic I, Copley RR, Bork P: Common exon duplication in animals and its role in alternative splicing.
Hum Mol Genet 2002, 11(13):15611567. PubMed Abstract  Publisher Full Text

Long MY, De Souza SJ, Rosenberg C, Gilbert WE: Relationship between "protosplice sites" and intron phases: Evidence from dicodon analysis.
Proc Natl Acad Sci USA 1998, 95(1):219223. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Long MY, Rosenberg C: Testing the "protosplice sites" model of intron origin: Evidence from analysis of intron phase correlations.
Mol Biol Evol 2000, 17(12):17891796. PubMed Abstract  Publisher Full Text

Berget SM: Exon recognition in vertebrate splicing.
J Biol Chem 1995, 270(6):24112414. PubMed Abstract  Publisher Full Text

Cho S, Jin SW, Cohen A, Ellis RE: A phylogeny of caenorhabditis reveals frequent loss of introns during nematode evolution.
Genome Res 2004, 14(7):12071220. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kiontke K, Gavin NP, Raynes Y, Roehrig C, Piano F, Fitch DH: Caenorhabditis phylogeny predicts convergence of hermaphroditism and extensive intron loss.
Proc Natl Acad Sci U S A 2004, 101(24):90039008. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

CoulombeHuntington J, Majewski J: Intron loss and gain in Drosophila.
Mol Biol Evol 2007, 24(12):28422850. PubMed Abstract  Publisher Full Text

Jeffares DC, Mourier T, Penny D: The biology of intron gain and loss.
Trends Genet 2006, 22(1):1622. PubMed Abstract  Publisher Full Text

Xu G, Guo C, Shan H, Kong H: Divergence of duplicate genes in exonintron structure.
Proc Natl Acad Sci USA 2012, 109(4):11871192. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yenerall P, Zhou L: Identifying the mechanisms of intron gain: progress and trends.
Biol Direct 2012, 7:29. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Koonin EV, Csuros M, Rogozin IB: Whence genes in pieces: reconstruction of the exonintron gene structures of the last eukaryotic common ancestor and other ancestral eukaryotes.
Wiley Interdiscip Rev RNA 2013, 4(1):93105. PubMed Abstract  Publisher Full Text

Roy SW, Gilbert W: The evolution of spliceosomal introns: patterns, puzzles and progress.
Nat Rev Genet 2006, 7(3):211221. PubMed Abstract  Publisher Full Text

RodriguezTrelles F, Tarrio R, Ayala FJ: Origins and evolution of spliceosomal introns.
Annu Rev Genet 2006, 40:4776. PubMed Abstract  Publisher Full Text

Koonin EV: The origin of introns and their role in eukaryogenesis: a compromise solution to the intronsearly versus intronslate debate.
Biol Direct 2006, 1:122. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Whitney KD, Garland T: Did Genetic Drift Drive Increases in Genome Complexity.