Dipartimento di Informatica Sistemistica e Comunicazione, Univ. degli Studi di Milano-Bicocca, Milano, 20126, Italy

Centro Ricerche e Studi Agroalimentari, Parco Tecnologico Padano, Lodi, 26900, Italy

Dipartimento di Biochimica e Biologia Molecolare "E. Quagliariello", Univ. degli Studi di Bari, Bari, 70126, Italy

Istituto di Biomembrane e Bioenergetica, Consiglio Nazionale delle Ricerche, Bari, 70126, Italy

Dipartimento di Statistica, Univ. degli Studi di Milano-Bicocca, Milano, 20126, Italy

Abstract

Background

A challenging issue in designing computational methods for predicting the gene structure into exons and introns from a cluster of transcript (EST, mRNA) sequences, is guaranteeing accuracy as well as efficiency in time and space, when large clusters of more than 20,000 ESTs and genes longer than 1 Mb are processed. Traditionally, the problem has been faced by combining different tools, not specifically designed for this task.

Results

We propose a fast method based on

The method was implemented into the PIntron package. PIntron requires as input a genomic sequence or region and a set of EST and/or mRNA sequences. Besides the prediction of the full-length transcript isoforms potentially expressed by the gene, the PIntron package includes a module for the CDS annotation of the predicted transcripts.

Conclusions

PIntron, the software tool implementing our methodology, is available at

Background

A key step in the post-transcriptional modification process is called

A great extent of work has been performed to solve two basic problems on AS: characterizing the exon-intron structure of a gene and finding the set of different transcript isoforms that are produced from the same gene. Some computational approaches, based on transcript data, for these crucial problems have been proposed; indeed good implementations are available

In this paper we provide a specifically designed algorithm - efficient from both a theoretical and an empirical point of view - to predict the exon-intron structure of a gene from general transcript data that is optimal with respect to constraints derived by the input data. The algorithm is implemented in a tool, called PIntron. Similarly as recent programs

On the other hand, finding putative spliced alignments (first phase) could be a challenging task when more than one alignment exists for the same transcript. Indeed, for instance, there could be different possible splicing junctions between consecutive exons because of the presence sequencing errors or repeated genomic regions. As a consequence, choosing the correct spliced alignment of a single EST sequence requires to perform a multiple comparison between several spliced alignments of all the EST sequences in order to find the ones that support a common putative gene structure. In

Methods

In this paper we show how to efficiently solve the integration of the two steps of finding the (possibly different) spliced alignments of a cluster of transcripts and using them to compute a common gene structure.

Overall, our new combinatorial method for exon-intron structure prediction can be summarized as a four-stage pipeline where we:

1. Compute and implicitly represent all the spliced alignments of a transcript sequence (EST or mRNA) against a genomic reference sequence by a novel graph representation, called

2. Filter all biologically meaningful spliced alignments. This step is performed with a carefully tailored visit of the embedding graph.

3. Reconcile the spliced alignments of a set of correlated transcript sequences into a maximum parsimony consensus gene structure. To complete this task we use the

4. Extract, classify, and refine the resulting introns in order to provide a putative gene structure supported by transcript evidences.

We point out that our implementation also has a fifth step where it predicts a set of full-length isoforms by employing the graph-based method in

Our method computes a consensus gene structure minimizing the number of exons, called maximum parsimony consensus gene structure. Such a structure is strictly associated to a set of spliced alignments for each sequence in the cluster of transcript data that is also output by our algorithm. Informally, a gene structure (depicted in Figure

The colored directed graph representing a gene structure

**The colored directed graph representing a gene structure**. The represented gene structure, induced by compositions, is composed by 6 genomic exons:

In this paper, we will evaluate all steps of the pipeline. Accuracy and efficiency of PIntron have been assessed by an experimental comparison with ASPic

In this experimental comparison, we focused on human genes given their excellent annotation status. However, PIntron has been conceived to facilitate genome annotation in a variety of organisms in which expressed sequences as well as the reference genome are available. Given the experimental results we summarized above, our program enables the investigation of the impact of alternative splicing on large-scale.

The rest of this section is devoted to present each algorithmic step of our four-stage pipeline.

Implicit computation of spliced alignments

The first stage of our gene structure prediction method computes the set of all possible spliced alignments of a transcript (EST or mRNA) sequence against the genomic sequence.

A spliced alignment is a particular kind of alignment that takes into account the effects of the excision of the intronic regions during the RNA splicing process. The spliced sequence alignment problem requires to compute, given a sequence _{P }_{1},..., _{k}_{1 }... _{k}_{i }

In our novel alignment method, we exploit the small edit distance between each pair _{i }^{2}), where

In the following we detail the notion and construction of the embedding graph. Let us first recall, that according to the traditional notation, given a string _{1}_{2 }... _{q}_{i}s_{i}_{+1 }... _{j}

A fundamental notion is that of

We say that a pairing _{1 }is a substring of the factor induced by _{1 }is a

A sequence of non-overlapping pairings (i.e. pairings that represent non-overlapping occurrences of common substrings) is called an

An embedding and its relationships with the genome and a transcript

**An embedding and its relationships with the genome and a transcript**. The _{1},...,_{9 }are substrings shared by the genome and the transcript corresponding to pairings. Each common substring (pairing) is longer than a fixed threshold ℓ_{E}_{D }_{I }

Not all embeddings induce a biologically meaningful composition. For example, an embedding made of several short pairings "scattered" along the genome cannot be considered a valid spliced alignment. In order to restrict embeddings to be useful for building a spliced alignment, we fix three parameters ℓ_{E}_{D }_{I}_{E }_{D }_{I }_{i }_{E}_{i}_{+1 }-_{i }_{i }_{D}_{i}_{+1 }-_{i }_{i}_{+1 }- _{i}_{D }_{i}_{+1 }- _{i }_{i}_{+1 }- _{i}_{I }

Indeed, a careful choice of the three parameters ℓ_{E}_{D }_{I }_{E}_{D }_{I}

In this first stage of the pipeline, we tackle the RE problem by using the embedding graph defined as follows.

**Definition **(Embedding Graph)**. Given a pattern P and a text T, the embedding graph of P in T is a directed graph G = (V, E) such that the vertex set V is the set of maximal pairings of P and T that are longer than ℓ**

Basically the conditions of the definition of Embedding Graph ensure the following crucial property: Two maximal pairings

We will use this property to build representative embeddings from an embedding graph. Observe that such a property derives from the maximality of the representative embeddings and from the uniqueness of the maximal pairing containing a pairing which belongs to a representative embedding.

We designed an algorithm that builds the embedding graph of a pattern ^{2}). The algorithm is composed of two steps. In the first step, the vertex set ^{2}) procedure. Since the number of maximal pairings is usually very small compared to the length of

Extraction of relevant spliced alignments

The next stage of our pipeline is devoted to analyzing and mining the embedding graph to compute the representative embeddings that also induce **ComputeCompositions **is a two-step procedure. Initially it extracts a subset of representative embeddings by performing a visit of the embedding graph. Then the algorithm computes the compositions by merging consecutive pairings that are separated by short gaps.

Embedding graph visit

The first step of **ComputeCompositions **is a recursive visit of the embedding graph starting from a subset of vertices that we call

Such a procedure visits the embedding graph examining and extracting only pairwise-distinct representative embeddings that are biologically meaningful (for example with respect to the length of gaps representing errors or introns). More precisely, the visit of a vertex

We will now explain the main steps of the procedure. During the visit of vertex _{k}_{k}_{k}_{+1}) and we "extend" each embedding _{k }_{k}_{+1 }that are depicted in Figure _{k }_{k}_{k}_{k}_{k}_{+1 }= (_{k}_{+1}, _{k}_{+1}, _{k}_{+1}). Observe that given two pairings that are connected by an edge in the embedding graph, the corresponding factors might be overlapping in the text or in the pattern. To simplify the notation, in the following we identify a pairing with the factor it induces.

Possible relative positions of two maximal pairings connected by an embedding graph edge

**Possible relative positions of two maximal pairings connected by an embedding graph edge**. The figure presents the possible configurations of relative positions of two maximal pairings _{k }_{k}_{k}_{k}_{k}_{+1 }= (_{k}_{+1}, _{k}_{+1}, _{k}_{+1}) connected by an embedding graph edge (_{k}_{k}_{+1}). Each box represents a common maximal factor on _{k }_{k }_{k }_{k}_{+1 }is the left border of the upper normal box, and _{k}_{+1 }is the left border of the lower normal box. Distance |(_{k}_{+1 }- _{k}_{k}_{+1 }- _{k}_{k}_{k}_{+1 }overlap on both _{k}_{k}_{+1 }overlap on _{k}_{k}_{+1 }overlap on _{k}_{k}_{+1 }do not overlap neither on

**Case (a)**. Factors _{k }_{k}_{+1 }overlap on both _{k }_{k}_{+1 }on _{D}_{I}_{k}_{+1 }-_{k}_{k}_{+1 }- _{k}_{D }_{k }_{k }_{k}_{+1 }of _{k}_{+1 }such that they do not overlap and that both _{k}_{+1 }are at least ℓ_{E }_{k}_{+1 }- _{k}_{k}_{+1 }- _{k}_{I}_{k }_{k}_{+1 }to produce a unique factor (exon) of the embedding

**Case (b)**. Factors induced by _{k }_{k}_{+1 }overlap in

**Case (c)**. Factors _{k }_{k}_{+1 }overlap in _{k}_{+1 }- _{k}_{k}_{+1 }-_{k}_{D }_{k}_{+1 }-_{k }_{k}_{+1 }- _{k}_{I}

**Case (d)**. Factors _{k }_{k}_{+1 }do not overlap neither in _{T }_{P }_{k }_{k}_{+1 }in _{P }_{T }_{k }_{k}_{+1 }are part of the same factor or (ii) there is an intron between _{k }_{k}_{+1}. Similarly to Case (a), two different sub-cases may arise. If |(_{k}_{+1 }- _{k}_{k}_{+1 }- _{k}_{D}_{k }_{k}_{+1 }might belong to the same factor of the induced composition. More precisely, _{k }_{k}_{+1 }belong to the same factor if the edit distance between _{T }_{P }_{k}_{+1 }is added to embedding _{k}_{+1 }-_{k }_{k}_{+1 }+ _{k }_{I}_{T }_{P }_{k}_{k}_{+1}) is discarded, otherwise _{k}_{+1 }is added to _{D}

The definition of embedding graph allows the presence of directed cycles, which potentially might be troublesome. However, we claim that the embeddings, computed from a path **ComputeCompositions** guarantees that each possible representative embedding is analyzed. However, the biological criteria that we employ allow to consider only pairings belonging to biologically meaningful embeddings. Since the visit computes pairwise-distinct representative embeddings and every case presented above requires

Composition reconstruction

The set _{k }_{k}_{k}_{k}_{k}_{+1}= (_{k}_{+1}, _{k}_{+1}, _{k}_{+1}) separated by small gaps, that is |_{k}_{+1 }- _{k }_{k}_{+1 }+ _{k}_{D}

Building a gene structure

The first two stages of our pipeline are applied separately to each transcript sequence _{i }_{i}_{i}

We aim to produce a maximum parsimony _{i }_{i}_{i}_{i }_{1}, _{2},..., _{|F|}〉 of genomic factors induced by the compositions in ∪_{i}_{i }

Now, the CG problem can be faced by using the approach

Let us recall the definition of the MFA problem. Let _{1}, _{2},..., _{|F|}〉 be a finite ordered set of sequences over alphabet Σ, called _{j }< i_{j}_{+1 }for 1 ≤ _{s}_{∈}_{S}F

The _{i }_{i}_{i}

By applying the algorithm in

Intron reduction

Although the intron boundaries of the EST spliced compositions are computed by finding the best transcript-genome alignment over the splice site regions and the most frequent intron pattern (i.e. the first and the last two nucleotides of an intron) according to

In the following, let the pair (_{j}_{j}_{+1 }inducing intron _{j }_{j}_{+1 }of a new spliced composition of _{1}, _{2}, _{3 }and _{1}, _{2}, _{3 }that allows the reduction.

Results

We implemented the approach described in the previous section as a set of programs in the software package PIntron. PIntron receives a genomic sequence and a set of transcripts - ESTs and/or mRNAs - and computes a representation of the exon-intron structure of the gene as well as a set of predicted full-length annotated isoforms. PIntron outputs the list of the predicted introns with information such as relative and absolute start and end positions, intron lengths, the donor and the acceptor splice sites, and intron types (U12, U2 or unclassified). The output gives the composition as exons of each isoform and, for each exon, the start and end positions as relative and absolute coordinates, if a polyA signal is present, and the length of 5'UTR and 3'UTR. Moreover several additional information are given for each predicted isoform, such as its length, the CDS starting and ending positions, the RefSeqID (if it exists) and the length of the associated protein.

PIntron source code and binaries are available under the GNU AGPLv3 license at

In the following, we discuss an experimental

We have assessed the accuracy achieved by PIntron by comparing it with ASPic

Exogean is a gene prediction tool based on pre-aligned (by Blat

The accuracy assessment has been performed on 13 ENCODE human regions

Main characteristics of the dataset used for the accuracy assessment of PIntron

**Region**

**Genomic**

**length (nt)**

**Number**

**of genes**

**Number of**

**transcripts**

**Overall transcript**

**length (nt)**

ENm004

1,700,000

18

6,964

4,497,709

ENm006

1,338,447

35

18,230

11,377,148

ENr111

500,000

2

171

113,356

ENr114

500,000

1

35

120,734

ENr132

500,000

4

855

551,266

ENr222

500,000

2

461

277,554

ENr223

500,000

5

50,607

32,732,634

ENr231

500,000

11

5,637

3,534,406

ENr232

500,000

9

4,779

2,505,934

ENr323

500,000

5

1,670

997,647

ENr324

500,000

1

487

343,220

ENr333

500,000

12

7,179

4,381,534

ENr334

500,000

7

989

611,795

Total

8,538,447

112

98,064

62,044,937

**Supplementary tables**. Characteristics of the first dataset and detailed results obtained in the experimental comparison.

Click here for file

The results of our first assessment are summarized in Table

Summary of the experimental results on the **112 **gene loci on the **13 **ENCODE regions

**PIntron**

**Exogean**

**ASPic**

Exon level

Sn

**0.529**

0.444

0.390

Sp

**0.622**

0.606

0.427

Intron level

Sn

**0.874**

0.733

0.633

Sp

**0.789**

0.777

0.567

Transcript level

Sn

**0.564**

0.251

0.342

Sp

0.418

**0.450**

0.252

Nucleotide level

Sn

**0.889**

0.657

0.635

Sp

**0.916**

0.865

0.632

Annotated genes

**112**

104

93

Total running time (seconds)

**2,961**

3,446

168,607

The best value of each row is highlighted in boldface.

Running times of PIntron and Exogean on the 26 "critical" genes

**Gene**

**Genomic**

**length (nt)**

**Number of**

**transcripts**

**Running time (seconds)**

**PIntron**

**Exogean**

ACTB

36,634

26,248

287.35

371.22

ALB

24,299

16,920

144.17

369.38

ANKS1B

1,258,645

406

15.60

0.92

ANXA1

512,535

2,087

20.65

7.63

ATP1A1

619,226

3,241

27.82

11.90

ATP5A1

405,213

9,864

143.33

70.93

CDH13

1,169,823

507

10.34

1.02

CNTNAP2

2,304,964

227

30.86

1.01

CTNNA2

1,463,710

261

12.71

0.96

CUGBP2

1,081,163

864

18.04

2.42

DAB1

1,551,956

164

14.51

0.85

DLG2

2,172,263

279

21.18

1.15

DMD

2,241,933

329

35.35

2.21

ENO1

185,661

13,131

119.84

125.51

FGG

579,042

2,033

15.40

3.56

FHIT (^{†})

1,502,110

134

202.35

n.a.

GAPDH

46,975

15,518

149.64

232.81

HINT1

873,331

844

12.02

3.08

HSP90AA1

384,611

6,710

47.37

13.87

HSPA8

90,642

15,850

118.47

152.84

KCNIP4

1,220,613

107

10.09

0.65

MBP

154,857

21,071

251.70

1,344.42

NCAM1

317,404

1,293

12.54

1.63

RPL3

187,677

12,208

90.15

108.12

TBC1D22A

1,378,585

467

115.99

2.27

TTN

304,814

1,349

1,952.58

6.77

Total

22,068,686

152,112

3,880.05

2,837.94

^{† }Exogean did not successfully compute a gene structure for FHIT.

Prediction quality has been evaluated by calculating sensitivity (Sn) and specificity (Sp) between ENCODE annotations and predictions at nucleotide, exon, intron, and transcript level, according to Burset and Guigó

Accuracy achieved by PIntron, Exogean and ASPic at various levels

**Accuracy achieved by PIntron, Exogean and ASPic at various levels**. The boxplot presents the distribution of specificity and sensitivity achieved by the three tools at the exon, intron, transcript and nucleotide levels. The vertical edges of the boxes represent the first quartile, the median and the third quartiles (from left to right). The cross is the average. The vertical dashed lines represent an estimate of the 95% confidence interval of the median. The circles are all the outliers with respect to such confidence interval.

Our second experimental analysis is devoted to evaluating the efficiency and the scalability of our approach on a subset of

To this aim, we selected 26 "critical" genes and we processed them with PIntron and Exogean on a 4-node linux cluster running CentOS 5.5. Each node is equipped with a quad-core 2.40 GHz CPU and 32 GiB of RAM. The genomic sequence has an average length of about 848 Kb, and is longer than 1 Mb for 11 of the 26 genes. Moreover, the selected genes have on average more than 5,000 transcripts, and 5 genes have more than 15,000 transcripts. The total running time was 65 minutes for PIntron and 48 minutes for Exogean. In this evaluation, we did not take into account ASPic since it was not able to give a solution for any of these genes within an acceptable time. Table

We want to point out that our second experiment has limited scope. In fact a complete comparison of PIntron and Exogean would also include the accuracy dimensions. The results of the first experiment suggests that PIntron is more accurate than Exogean. If confirmed, the greater accuracy would justify the small increase in the running times that we have observed.

The analysis of the running times of the first and the second part of the experimentation has not shown any significant correlation between the length of the genes and the running times, hence confirming our conjecture that the behavior of our algorithm depends on some properties of the Embedding Graph, and not on the size of the instance. In particular, the structure of the Embedding Graph is strictly related to the quality of the transcripts and to the presence of repetitions and highly duplicated regions in the genomic sequence that, in turn, could influence the size of the graph. Also these results have confirmed our beliefs, since the average running time of the second experiment (149 sec/gene) is not too far from the running times on the smaller genes of the first experiment, where the average value is 26 sec/gene. A fundamental observation is that PIntron has successfully completed the analysis of all 26 "critical" genes, while Exogean did not complete the analysis for

Conclusions

In this work, we presented a new computational pipeline - PIntron - for predicting the gene structure into exons and introns from a cluster of transcript (EST, mRNA) sequences. PIntron combines two ideas: a novel algorithm of proved small time complexity for computing spliced alignments of a transcript against a genome, and an efficient algorithm that exploits the inherent redundancy of information in a cluster of transcripts to select, among all possible factorizations of EST sequences, those allowing to infer splice site junctions that are largely confirmed by the input data. PIntron is freely available at

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

YP and RR designed the algorithm, developed the pipeline, designed and helped to perform the experiments, and drafted the manuscript. EP helped to design and to perform the experiments, and interpreted the results. GP helped to design the experiments and supervised the interpretation of the results. GDV helped to design the algorithm, to develop the pipeline, and to draft the manuscript. PB designed the algorithm, helped to draft the manuscript, and supervised the research. All authors read and approved the final manuscript.

Acknowledgements

We thank Marcello Varisco for the implementation of some parts of the pipeline. This research was supported in part by FAR MIUR 60% grant "Algorithmic methods and combinatorial structures in Bioinformatics" (Univ. di Milano-Bicocca) to YP, RR, GDV, and PB, grant "Dote ricerca applicata" 21_ARA (FSE, Regione Lombardia) to YP, and Ministero dell'Istruzione, dell'Università e della Ricerca, Italy: Fondo Italiano Ricerca di Base, "Laboratorio Internazionale di Bioinformatica" (LIBI), "Laboratorio di Bioinformatica per la Biodiversità Molecolare" (DM19410), PRIN 2009; Progetto Strategico Regione Puglia PS 012; Progetto EPIGEN (CNR) to GP.

This article has been published as part of