Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA

Ontario Institute for Cancer Research, 101 College St. Suite 800, Toronto, ON M5G0A3, Canada

Department of Molecular Genetics, University of Toronto, 1 King's College Circle, Toronto, ON M5S 1A8, Canada

Abstract

Background

Most eukaryotic genes are interrupted by spliceosomal introns. The evolution of exon-intron structure remains mysterious despite rapid advance in genome sequencing technique. In this work, a novel approach is taken based on the assumptions that the evolution of exon-intron structure is a stochastic process, and that the characteristics of this process can be understood by examining its historical outcome, the present-day size distribution of internal translated exons (exon). Through the combination of simulation and modeling the size distribution of exons in different species, we propose a general random fragmentation process (GRFP) to characterize the evolution dynamics of exon-intron structure. This model accurately predicts the probability that an exon will be split by a new intron and the distribution of novel insertions along the length of the exon.

Results

As the first observation from this model, we show that the chance for an exon to obtain an intron is proportional to its size to the 3rd power. We also show that such size dependence is nearly constant across gene, with the exception of the exons adjacent to the 5^{′} UTR. As the second conclusion from the model, we show that intron insertion loci follow a normal distribution with a mean of 0.5 (center of the exon) and a standard deviation of 0.11. Finally, we show that intron insertions within a gene are independent of each other for vertebrates, but are more negatively correlated for non-vertebrate. We use simulation to demonstrate that the negative correlation might result from significant intron loss during evolution, which could be explained by selection against multi-intron genes in these organisms.

Conclusions

The GRFP model suggests that intron gain is dynamic with a higher chance for longer exons; introns are inserted into exons randomly with the highest probability at the center of the exon. GRFP estimates that there are 78 introns in every 10 kb coding sequences for vertebrate genomes, agreeing with empirical observations. GRFP also estimates that there are significant intron losses in the evolution of non-vertebrate genomes, with extreme cases of around 57% intron loss in

Background

Most eukaryotic genes contain spliceosomal introns, which are removed from mRNA after transcription by the RNA splicing apparatus. The biological origins of introns are uncertain. Since the discovery of introns, there has been significant debate as to whether introns in modern-day organisms were inherited from a common, ancient ancestor, the intron-early hypothesis

One way to understand the process is to examine the size distribution of internal translated exons, referring to exons that are fully translated and referred to as

In later work, Ryabov and Gribskov

On the other hand, Tenchov and Yanev

One assumption made in the exon size based approaches

In this work, we aim to revisit these competing hypotheses by addressing the following open questions: Do longer exons have an increased chance of gaining a new intron? For intron gain events, will the intron be inserted into exon randomly or at some proto-splice sites? Is there an intron gain/loss bias? Are intron insertion events independent of each other? Is there a common mechanism to explain intron gain/loss in different species? In order to answer these and other related questions, we propose a General Random Fragmentation Process (GRFP) to characterize the evolution dynamics of exon-intron structures. The parameters of GRFP are determined by combining simulation and analysis of real genomic data.

Methods

GRFP model

The model of GRFP is motivated by generalizing both Kolmogoroff fractioning process and the uniform random fragmentation process. In GRFP, the probability for an exon to split (gaining an intron) is assumed to be exponentially proportional to the length of the _{
k
}) as _{
k
}
^{
a
}. Under such a generalization, the Kolmogoroff fractioning process, in which insertion events are independent of exon length, is a particular case of GRFP with

The model of GRFP, illustrated in Figure

Demonstration of GRFP on splitting a long exon with initial size _{0}

**Demonstration of GRFP on splitting a long exon with initial size **_{0}**.** The probability of picking which exon to split is proportional to the length of the exon, _{S}(_{k}^{α}. The probability of picking an inserting point (_{k})) for exon _{I}(_{k}*N(^{μ} _{I}, _{I}).

1. Given a set of

2. Within

3. Intron gains are independent of each other.

Where _{
Ѕ
}
_{
I
}
_{
k
}, length of the exon _{
I
} and _{
I
}, mean and standard deviation of the distribution of insertion loci. The model of GRFP has three unknown parameters to be determined, _{
I
} and _{
I
}.

Simulation testing

We start each simulation with a long exon. The diagram in Figure _{
Ѕ
}
_{
k=1,2}) denotes the picking probability between _{1} and _{2}; Assuming _{1} is selected and split by a new intron; the next exon to be split will be chosen from _{3}, _{4} and _{2} with probability _{
Ѕ
}
_{
k=2,3,4}); Assuming _{2} is selected, and so on.

For simulations, sequence of pseudorandom number is obtained using the _{0}) and the number of splitting (

We evaluate the properties of GRFP using three simulation experiments. In each, we simulate a set of ordered fragments and quantify their statistical characters given different parameters. The three sets of quantifications listed below are used to justify the three assumptions of GRFP respectively for both simulated fragments and real exons.

1. Mean and standard deviation of the size distribution by fitting it with lognormal distribution (equation (3)) or Weibull distribution (equation (4)):

Where z=E/λ, _{
E
} the mean position, and σ_{
E
} the standard deviation of the lognormal distribution. These and subsequent fittings in this study are performed using the nonlinear Trust-Region-Reflective curve-fitting algorithm _{
E
} is primarily determined by the choice of

2. Mean and standard deviation of the insertion ratio defined below:

Where _{
i
} and _{
i+1
} are the length of two adjacent fragments (exons). This is an indirect estimation of insertion loci (_{
x
}).

3. Correlation between _{
i
} and _{
j
} defined by equation (6):

Where _{
x
} is estimated from fitting the histograms of ratio _{
i
} + _{
j
} is still normally distributed, and the mean value is the sum of the means. However, the variances are not additive if _{
i
} and _{
j
} are correlated. We can estimate the relationship between _{
i
} and _{
j
} with equation (6).

In the first experiment, we examine the relationship between GRFP parameters and the size distribution of the simulated fragments. With fixed _{
I
}, _{
I
}, initial size of starting exon (_{0}), and the number of splitting (_{
E
} and _{
E
} are estimated through fitting a lognormal distribution to the size distribution of the resulted fragments. The correlation between _{
E
}, _{
E
} and _{
I
}, _{
I
}, and _{
E
}, _{
E
} and initial size of starting exon (_{0}), the number of splitting (

In the second experiment, we examine the relationship between real _{
I
} (in equation (2)) and estimated _{
x
} (from equation (5)). By fragmenting a long exon, we construct a binary tree to track the splitting process. We classify the adjacent fragments pair (the order is maintained during fragmentation) into four groups based on whether they have the same parent nodes, or if not same parents, comparing their depths. The size distribution of each group and the mixture (equation (5)) is examined. With fixed _{
I
}, _{0}, _{
I
} and _{
x
} is examined by simulations with different choices of _{
I
}. Then, by coupling with empirical observations, we use Expectation-Maximization (EM) iteration to determine the value of _{
I
}.

In the third experiment, we examined the effects of intron loss on the statistical characters of resulted fragments. By introducing various percentages of intron loss after intron gain, we evaluate how _{
E
}, _{
x
}, and

Empirical data analysis

In this study, we obtained the cDNA sequences of 14 species (

For testing the first assumption of GRFP, we fitted both Weibull and normal distribution to the size distribution of vertebrate exons (logarithm scale). We also grouped exons by positions for testing position bias of intron gain/loss. For the second assumption, we fitted a normal distribution to

Results

Empirical data analysis

Statistical counts of empirical data

Statistical counts of the extracted data are shown in Table

**Number of coding genes**

**Total CDS length (10**
^{
7
}**)**

**Number of splitting (****
m,
**

**Estimated splitting (****
m
**

Annotation data for each species is extracted from Ensembl database. Protein coding genes are counted only if they contain at least one internal translated exon. Total CDS length is the summation of all internal translated exon length in these genes. Number of splitting is estimated by the number of internal translated exons minus one. Estimated splitting is determined from GRFP simulation.

17275

2.443

1.827

1.901

- 3.9%

16319

2.276

1.705

1.768

- 2.7%

17354

2.193

1.722

1.703

1.0%

15068

1.932

1.462

1.501

- 2.7%

5416

0.655

0.537

0.509

5.4%

12508

1.694

1.295

1.316

- 1.6%

11948

1.413

1.142

1.099

4.0%

5498

0.524

0.433

0.408

6.1%

17684

1.833

1.024

1.425

- 28.4%

8063

1.141

0.383

0.886

- 57.1%

16547

1.501

1.083

1.167

- 7.2%

23566

2.255

1.329

1.749

- 24.0%

17769

1.445

1.041

1.123

- 7.3%

15887

1.320

0.987

1.025

- 3.7%

In this study, we ignored non-internal translated exons considering the rate of indels (a type of mutations affecting exon size distribution) is significantly lower in the coding region than the non-coding region

Size distribution of exons

Figure _{
E
} and _{
E
}), which might indicate that these vertebrate genomes have undergone a similar stochastic process on the exon-intron structure during evolution. For the six non-vertebrate species, a mixture of two normal functions (dashed line) fits the histograms well (Additional file

Size distributions of vertebrate exons fitting with normal distribution

**Size distributions of vertebrate exons fitting with normal distribution.** The histograms of exons are fitted with a Weibull function (solid line) and normal function (dashed line).

has the size distribution of non-vertebrate exons. **Figure S2** has the size distributions of **Figure S3** shows that the distribution of proto-splice sites within **Figure S4** shows the size distribution of simulated exons with different dependency values. **Figure S5** shows the linear relationship between expected and observed standard deviation of insertion ratios. **Figure S6** illustrates four different groups of insertion ratios. **Figure S7** shows the distribution of insertion ratio for each of the four groups and their mixture. **Figure S8** shows the distribution of fragment size after a certain percentage of intron losses, supporting **Figure S9A**. **Figure S10** shows the distribution of insertion ratios after a certain percentage of intron loss, supporting **Figure S9B**. **Figure S11** shows the linear relationship between the number of splitting and total CDS length for each

Click here for file

**Weibull**

**Normal**

**
λ
**

**
κ
**

**
μ
**

**
σ
**

The size distribution of exon (logarithmic scale) for each vertebrate species is shown in Figure

2.81

6.98

4.81

0.432

2.81

7.01

4.82

0.431

2.80

6.89

4.81

0.437

2.79

6.87

4.79

0.437

2.80

6.80

4.81

0.442

2.81

6.87

4.81

0.440

2.80

6.69

4.80

0.449

2.82

6.44

4.81

0.472

In order to assess whether the size distribution of vertebrate exons is position-dependent, we grouped their exons from all protein coding genes according to their positions relative to 5′ UTRs/3′ UTRs. For the five well annotated vertebrates, the standard deviations (_{
E
}) of the fitted normal functions at each position (e.g. Additional file _{
E
} is almost constant for exons across gene body, with exceptions of the first three exons right after 5′ UTR (see solid line), where it increases markedly. For exons next to the 3′ UTR (in dashed line), no similar trend is observed.

Fitted standard deviation (_{E}) and dependency (α) for internal exons with positions relative to 5′ UTRs (solid line) or 3′ UTRs (dashed line)

**Fitted standard deviation (**_{E}**) and dependency (α) for internal exons with positions relative to 5′ UTRs (solid line) or 3′ UTRs (dashed line).** The dependency value α is calculated using equation (9).

These observations suggest that the size distribution of vertebrate exons could be properly fit with either Weibull or normal distribution. The Weibull distribution gives a better fit to both left and right tails (e.g., Additional file

Distribution of insertion ratio

For every gene of the selected species, we calculated the insertion ratio _{i} and _{i+1}. Figure _{
x
}= 0.5 and _{
x
}= 0.13. The insertion ratio for non-vertebrates fits a normal distribution reasonably well but with much larger _{
x
}.

Genome wide distribution of _{i}/(_{i} + _{i+1})

**Genome wide distribution of **_{i}**/(**_{i}**+ **_{i+1}**).** The histograms are drawn with bin size of 0.01, and fitted with a Normal function.

**
μ
**

**
σ
**

The distribution of insertion ratios (equation (5)) for each species is shown in Figure

0.501

0.132

0.501

0.132

0.501

0.135

0.501

0.132

0.501

0.132

0.502

0.134

0.502

0.136

0.502

0.142

0.499

0.185

0.502

0.215

0.501

0.152

0.502

0.226

0.501

0.157

0.501

0.152

Another interesting observation in Figure

The normal function fitted in Figure

Correlation between insertion ratios

The correlations calculated using equation (6) are shown in Figure _{1}
_{2} = (_{1} and _{2} are proportional to

Correlation of insertion ratios for different species

**Correlation of insertion ratios for different species.** The correlation between

The key observation in Figure

In summary, analysis of empirical data reveals three significant differences between vertebrate and non-vertebrate genomes. First, a mixture of two normal functions gives a better fit to the size distribution of non-vertebrate exons, instead of one normal function for that of vertebrate exons; Second, the insertion ratio of non-vertebrates also follows a normal distribution but with larger standard deviation than that of vertebrates; Third, the insertion ratios of non-vertebrates are more negatively correlated than that of vertebrates.

Simulation testing

Default values for _{0}, _{
I
}, and _{
I
}

As mentioned before, we start each simulation with a long exon. Using the counts for

For the remaining unknown parameters of GRFP, _{
I
}, and _{
I
}, we chose to examine _{
I
} and _{
I
}:

These values are determined through an EM iteration process that will be discussed in the simulation testing section. The EM iteration uses observed values of _{
x
} and _{
x
} for vertebrates (Table _{
x
} overestimates but is linearly proportional to _{
I
}, while _{
x
} approximates _{
I
} extremely well.

Relationship between α, _{0}, _{
E
}, _{
E
}

Using the values of _{0}, _{
I
} and _{
I
} in equations (7) and (8), we performed three GRFP simulations with _{
E
} of the fitted normal function to characterize the peak width of the size distribution. It is worthwhile reemphasizing that both empirical and simulated distributions are skewed to the left; thus both tails of the peak are better fitted by the Weibull distribution.

**Weibull**

**Normal**

**
λ
**

**
κ
**

**
μ
**

**
σ
**

The size distribution of simulated exon (logarithmic scale) for each choice of dependency value (

4.58

3.95

4.24

1.216

2.88

4.42

4.72

0.688

2.93

7.25

4.85

0.430

These simulations show that _{
E
} (or width of the peak) decreases as _{
E
} is dependent on _{
E
} and _{
E
} values (mean ± 3 standard deviations) were plotted against _{
E
} and

Relationship between GRFP parameters and _{E}, _{E}

**Relationship between GRFP parameters and **_{E}**, **_{E}**.** (**A**) Plot of _{E} and (**B**) _{E} as a function of **C**) Plot of _{E} and (**D**) _{E} as a function of _{0}); (**E**) Plot of _{E} and (**F**) _{E} as a function of _{E} and _{E} of

From equation (9), we estimate that _{
E
} ≈ 0.43 (Table ^{rd} power, which disagrees with the independency hypothesis of earlier work

Similarly, we performed a series of GRFP simulations with different choices of _{0} and _{
E
} is independent of both _{0} and _{
E
}) of the resulting size distribution is dependent on both _{0} and _{
E
} and _{0}, using the intersection between the dashed line (_{
E
} of _{
E
} is approximately 4.81 across vertebrate genomes, we used GRFP simulation to estimate the number of splitting (_{e}) for each species (Table _{e} with _{0} in Table _{e}. _{e} and the percentages of intron loss are calculated in the same way. Note that here we use the same _{
E
} value of 4.8 for invertebrates although the size distributions of their exons (Additional file

In Figure _{e} (estimated number of splitting, open circle) against CDS length and fit it with a linear function (equation (10)). The observed number of splitting (closed circle) events is also plotted for comparison. The first observation from Figure

Plot of the number of splittings as a function of total CDS length

**Plot of the number of splittings as a function of total CDS length.** Estimated splitting (open circle) is from GRFP simulation with different CDS lengths (Table ** A**),

Parameterizing GRFP via EM iteration

In the previous simulation studies, with the assumption of known _{
I
}, we have shown that _{
E
} is dependent on _{0} and _{
E
}. However, simulations show that _{
E
} is also dependent on _{
I
}. To derive the values of α and _{
I
} simultaneously without assuming knowing any one of them, here we determine their values through EM iterations, by combining simulations with empirically observed _{
I
} (Table _{
x
}, and _{
x
} (Table

Before performing EM iteration, we need to quantify how _{
I
} is related to _{
x
}. We performed a series of GRFP simulations with _{
I
} ranging from 0.06 to 0.18 and α = 3. For each simulation, we calculate the insertion ratio (equation (5)) from the resulted fragments, and estimated _{
x
} from fitting a normal function to the histogram of insertion ratios. _{
x
} is plotted against given _{
I
} in Additional file _{
I
} is closer to 0.11 than the 0.13 estimated from Figure _{
I
}, we show the simulation process and results in Additional file _{
x
}.

For EM iteration, we use _{
I
} to estimate _{
I
}. The iteration start with _{
I
} = _{
x
} = 0.13 (Table

1. Given _{
I
}, determine the relationship between

2. With observed _{
E
} (Table

3. With _{
x
} and _{
I
} (Additional file

4. With the relationship and _{
x
} = 0.13, estimate _{
I
}

5. Return to step (1), iterate until convergence

The results of the EM iteration are shown in Additional file _{
I
} converges to 0.11;

Intron losses accounting for increasing _{
I
}, _{
E
} and more negative

In Table _{
x
} of non-vertebrates is significantly larger than those of vertebrate genomes. In Figure

With simulated GRFP fragments, we gradually introduce 5-50% of “intron loss” by randomly reconnecting adjacent fragment pairs. The size distributions of the resulted fragments are fitted with a normal function, (Additional file _{
E
} is plotted against percentage of intron loss in Additional file _{
x
} plot against intron loss in Additional file _{
x
} (Figure

Additional file

Discussion and conclusion

In this study, we analyze the size distribution of exons for 14 species, including eight vertebrates and six non-vertebrates. Our approach overcomes the limits of using orthologous genes, thus allowing us to infer evolutionary processes affecting the exon-intron structure across widely divergent species. The use of size distributions is more reliable than alignment based approaches if considering the accumulation of repeating intron gain/loss. Based on the size distribution of exons, we propose GRFP to characterize the evolution of eukaryotic genes. The solid agreements between GRFP simulations and observations on genomic data provide several key findings on the evolution of exon-intron gene structures.

Chance of intron gain is proportional to exon size to the 3rd power

GRFP reveals that longer exons have a higher chance to gain an intron during evolution, and reveals the novel finding that the chance of intron gain is proportional to the exon length to the third power. This finding was derived after investigating real genome data, comparing with numerical simulations, and excluding various effects on GRFP through EM iterations. This finding might explain why long exons are rare in modern organisms. E.g., statistical study has shown that only 3.5% of the primate exons are longer than 300 nt

The “third power” is derived from _{
E
} (or width) of the exon size distributions. The model of GRFP indicates that _{
E
} will remain constant given the same dependency value _{
E
} (0.43) in Table

Why is the probability of intron gain proportional to the exon length to the third power? Given that the third power is usually related to volume, it might be possible that exon occupies a volume proportional to its length to the third power due to dynamic movement, and the chance of an intron attacking it is proportional to this volume. Further investigation will be needed to support this hypothesis.

No evidence for site-specific bias of intron insertion

We derive this finding from indirectly estimating the position distribution of intron insertion loci. We demonstrate that the insertion loci follow a normal distribution, peaking around the center of the exon with a standard deviation (_{
I
}) of 0.11. This observation does not support the proto-splice site hypothesis. If there were proto-splice sites in the exon, the insertion loci would follow the position distribution of these sites, which will most likely be a uniform distribution (Additional file

In Figure

The assumption behind the estimation of insertion ratio (equation (5)) is that the order of the exons within each gene is maintained during evolution. In the cases of tandem exon duplication, exon shuffling, or intron loss, the order is just locally disrupted. Simulation also shows that the estimated insertion ratio is a mixture of four different groups of adjacent fragment pairs (Additional file _{
x
} is linearly related to _{
I
} (Additional file

Suggesting 5′ intron gain/loss bias

By grouping exons by positions within a gene, we demonstrate that exons next to the 5′ UTR have bigger standard deviation (_{
E
}) than other exons. One may argue that the deviation near the 5′ UTR is caused by the fact that on average exons are longer for genes contain fewer exons. If this is the case, similar trend near the 3′ UTR should have been observed. From equation (9), bigger _{
E
} indicates smaller GRFP dependency value (

Excessive intron losses accounting for deviations from GRFP

In this study, we show that exons of non-vertebrates are different from those of vertebrates in three aspects. First, the size distribution of their exons fit a mixture of two normal distributions (Additional file _{
x
}) as shown in Table

The estimations in Table _{
E
} increases, _{
I
} increases, and

Here, the excessive intron loss hypothesis in non-vertebrate genomes is interpreted as breaking the equilibrium between intron gain and intron loss. Although GRFP model is built on modeling intron gain events, it does not assume that introns in vertebrate genomes are never lost. Instead, we interpret the straight line in Figure

The size distribution of exons (Additional file

Weakness and strength of GRFP

In this work, we propose the GRFP model to capture the dynamic processes describing the evolution of exon-intron structures. For vertebrate genomes, the model fits well with the well annotated genome data, including exon size distribution, distribution of insertion loci, total CDS length, number of introns, independency among intron gains, and 5′ intron gain bias. For non-vertebrate genomes, simulations show that the deviations from the vertebrate genome can be explained by excessive intron loss. The GRFP model implies that the evolution of gene structure is purely random, from picking which exon to split (gains intron) to picking intron insertion loci on the selected exon. The solid agreements between GRFP simulations and real genome data confirm that GRFP model provides one possible explanation on the exon-intron structure evolution.

It is well known that a modern genome is a collection of introns that have accreted (and been deleted) over at least a billion years. Here, by considering the whole process as a black box, we reproduce the output of this box (the current day genomes) with numerical simulations. The size distribution of exons serves as the key component in building GRFP model because of two reasons. First, the dominant factor that can shape such distribution is intron gain/loss (fragmentation). Second, the most prominent cofounding factor on exon sizes - the rate of indels in them during evolution is low. Certainly, the mechanism of intron gain is complicated considering differences across lineage, differences in rates of insertion across sites, the age of introns, the possibility of indels to maximize fit to epigenomic structures that can occur following intron gain, alternative splicing, the different models of intron gain, and so on. The process of exon fragmentation (or intron gain) might be as straightforward as the model of GRFP describes. By focusing on internal translated exons only, we have demonstrated outstanding agreements between empirical observations and GRFP simulations.

It is crucial to note that GRFP does not make any assumptions on the rate of intron gain/loss. Recent studies

One may argue that the agreement between the GRFP model and well annotated genome structure could be fortuitous. While we cannot rule out that other models might reproduce the exons of modern genomes, the predictive power of GRFP is striking, and we believe that it is a promising approach to understanding the evolution of exon-intron structures, and an excellent starting point for new models for revealing the hidden stochastic processes of evolution.

Unanswered questions and future studies

GRFP model provides explicit rules on the exon-intron structure evolution. However, it does not address the origin of introns, the mechanism of intron insertion, and the rate of intron gain/loss. In other words, GRFP addresses where introns are inserted (which exon and where in the exon), but not when and how introns are inserted. Future research will focus on extending GRFP to model the evolution of noncoding exon, and developing GRFP-based methods for comparative genomics studies.

Abbreviations

GRFP: General Random Fragmentation Process; UTR: Untranslated region; CDS: Coding DNA Sequence; EM: Expectation Maximization.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

LW developed the model, performed the data analysis and designed the simulation experiment. LW and LDS wrote the manuscript. All authors read and approved the final manuscript.

Acknowledgements

We thank the National Science Foundation (DBI-0735191) and National Institute of Health (P41-HG02223) for funding aspects of this work.