Department of Mathematics and Statistics, University of Ottawa, Canada

Technische Fakultät, Universität Bielefeld, Germany

Abstract

Background

Mancheron, Uricaru and Rivals (

Results

In this paper we introduce a series of algorithms that improve over the approach of Mancheron

Background

Comparative approaches are an important source of information when it comes to the analysis of newly sequenced genomes. On the level of genes, the use of reciprocal BLAST hits is the most widely accepted approach suitable for tasks like gene annotation and the inference of homologies. However, it is a notoriously slow process, especially when it comes to all-against-all comparisons of several genomes, as commonly used in multiple genome comparisons. The alignment of whole genomes is known to be a computationally hard problem

Recently, Mancheron

In this paper, we re-visit the above problem and introduce three new algorithms that improve the asymptotic time complexity as well as the practical performance over the approach of Mancheron

Methods

Preliminaries

For the basic definitions of maximum overlapping intervals we follow closely the notation of Mancheron _{1 }⊆ _{2 }if and only if _{2}) ≤ _{1}) and _{1}) ≤ _{2}).

In the following a target genome is given as a sequence _{1}, . . ., _{k}_{j }

An interval _{1}_{k}_{j }

The computational problem we study in this paper is the following: given a set of collections of base intervals, find all its maximum overlapping intervals (MOIs).

Upper bound for the number of MOIs

Before presenting our algorithms, we derive a tight upper bound for the number of MOIs in a set of

As already shown in supplementary file 1 of

**Lemma 1**.

** Proof**. Assume the MOIs to be ordered by their beginning from left to right. Clearly, the leftmost MOI must contain at least

To show that the bound is tight, we construct an example where the number of MOIs is actually

Interval collections with a maximized number of MOIs

**Interval collections with a maximized number of MOIs**. Example of interval collections with a maximized number of MOIs. The bars indicate the location of the

Algorithms for finding maximum overlapping intervals

In this section we present three new algorithms to find maximum overlapping intervals. Without loss of generality we assume that the set of base intervals is

Algorithm LinearMOI

The outline of our first algorithm, LinearMOI, is as follows: while going through the sorted list of base intervals, we track for each of the

Obviously,

**Observation 1**.

Using Observation 1 we can find the current leftmost non-zero entry in

Pseudocode of the algorithm is shown in Algorithm 1 (LinearMOI). An example with

**Algorithm 1 (LinearMOI)**

**Input: **sorted list of all intervals

**Variables: **largest end point seen so far in each collection

1:

2:

3:

4:

5: **for all **

6: **if ****then **

7:

8:

9:

10: **end if **

11: **if **all intervals with recent start position processed **then **

12: **while ****do **

13:

14: **end while **

15: **if ****and ****then **

16: output MOI(

17:

18: **end if **

19: **end if **

20: **end for **

Example of algorithm LinearMOI

**Example of algorithm LinearMOI**. Example of algorithm LinearMOI with _{1 }= {(1, 4), (6, 8), (6, 9)} (dark gray), _{2 }= {(2, 6), (4, 9)} (light gray) and _{3 }= {(3, 5), (7, 10)} (medium gray). The left column shows the state of

Algorithm CircularMOI

The space needed to store the counter array

**Observation 2**.

**Observation 3**.

Based on these observations we can replace array

To employ the new data structure, several modifications need to be made in algorithm LinearMOI. Obviously we need to use the modulo operation whenever accessing

To keep

Finally, we know that we can only have an MOI if

In practice this approach is quite slow because of the extensive use of the modulo operation. However when extending the length of

Algorithm TestMOI

Having concentrated on asymptotic runtime so far, we now focus on practical performance. Even though the previous algorithms perform very well in our benchmark experiments, as we will demonstrate in the Results section they use linear extra memory which might limit their usability for larger datasets.

**Algorithm 2 (CircularMOI)**

**Input: sorted list of intervals interval[1..n]; number of collections k; length of the longest interval ℓ**

**Variables: largest end point seen so far in each collection endPoint[1..k]; c[0..l]**

1:

2:

3:

4:

5:

6: **for all ****∈ ****do **

7: **while ****do **

8:

9:

10:

11: **end while **

12: **if ****then **

13: **if ****then **

14:

15: **else **

16:

17: **end if **

18:

19:

20: **end if **

21: **if **all intervals with recent start position processed **and ****then **

22: **while ****do **

23:

24: **end while **

25: **if ****then **

26: output MOI(

27:

28: **end if **

29: **end if **

30: **end for **

We present a third algorithm, TestMOI, that works without a counter array. To find

**Observation 4**.

Therefore a test for a new minimum needs to be performed only when the smallest of the current endpoints changes. This gives rise to the procedure shown in Algorithm 3 (TestMOI).

Assuming the endpoints of the intervals are randomly distributed, the chance of two or more values in

**Algorithm 3 (TestMOI)**

**Input: sorted list of all intervals interval[1..n]; number of collections k**

**Variables: largest end point seen so far in each collection endPoint[1..k]**

1:

2:

3:

4:

5: **for ****∈ ****do **

6: **if ****then **

7: **if ****then **

8:

9: **end if **

10:

11: **end if **

12: **if ****and **all intervals with recent start position processed **then**

13: _{i = 1..k}{

14:

15: **if ****and ****then **

16: output MOI(

17:

18: **end if **

19: **end if **

20: **end for **

Weighted maximum overlapping intervals

We now introduce the concept of

Given a positive integer weight _{1}, . . ., _{k}_{j }

A

We refer to maximal _{1 }and one interval either in _{2 }or in _{3}. An example of weighted MOIs for different thresholds is given in Figure

Example of weighted MOIs

**Example of weighted MOIs**. Example for weighted MOIs in four collections (distinguished by different gray shadings). The weights of the collections are _{1}) = 2, _{2}) = 3, _{3}) = 3 and _{4}) = 4.

Algorithm LinearWeightedMOI

This algorithm follows the same ideas as Algorithm 1 (LinearMOI) but we have to make several small adjustments. First we need to change the counter array such that it stores the weights of the collections. The second modification is a bit more complicated. In the unweighted case all values in

In the beginning,

**Algorithm 4 (LinearWeightedMOI)**

**Input: sorted list of intervals interval[1..n]; number of collections k; weight of the collections weight[1..k]; length l of the target genome; minimum weight a weighted MOI must have minWeight**

**Variables: largest end point seen so far in each collection endPoint[1..k]; c[0..l]**

1:

2:

3:

4:

5:

6: **for all ****∈ ****do **

7: **if ****then **

8:

9:

10: **if ****then **

11:

12: **end if **

13:

14: **end if **

15: **if **all intervals with recent start position processed **then **

16: **while ****do **

17:

18:

19: **end while **

20: **if ****and ****then **

21: output MOI(

22:

23: **end if **

24: **end if **

25: **end for **

Algorithm CircularWeightedMOI

In order to adapt Algorithm 4 (LinearWeightedMOI) to using the circular memory structure, we basically follow the same strategy as we did in the unweighted case. However there is no need to introduce an additional variable as was done for Algorithm 2. This is because we already use variable **if **statement in line 12 to ensure that the endpoint of the current base interval is not smaller than the current

Results and discussion

To analyze the practical run times of our algorithms, we implemented them in C++ and compared them to the original implementation of the approach by Mancheron

All benchmarks were performed using an Intel(R) Core(TM)2 Duo CPU T8100 @ 2.10GHz processor with 4GB RAM running Linux 3.3.4. We used the gcc version 4.7 with compiler flags -Ofast and -march = nativeset. If no other values are given, the length of the target genome is 10 Mb and the length of the base intervals is normally distributed around a mean of

We ran several tests in order to evaluate the performance of the algorithms for various parameter settings. The results are shown in Figure

Benchmark experiments

**Benchmark experiments**. Dependency of practical runtimes on parameter settings. First line: number of collections, on average 25.000 intervals per collection; Second line: number of intervals, fixed number of collections; Third line: length of target genome, fixed number of collections and intervals; Fourth line: average interval length relative to genome length for collections generated with ^{5}); the dashed line shows the number of MOIs (scaled by secondary ^{3}).

Conclusions

In this paper we studied an algorithmic problem that was recently introduced by Mancheron

We have presented efficient algorithms to solve this problem, two of which have asymptotically optimal, linear runtime. The third one excelled in terms of practical performance. All three algorithms were shown to outperform the approach introduced by Mancheron

For further work it may be interesting to assign individual weights to the base intervals. However we would then have to consider also intervals that are nested into other intervals of lower weight and therefore lose the strict-linear ordering for processing the intervals. Hence, we can not expect that the algorithms we presented here will be easily adaptable to this problem and still run in linear time.

Authors' contributions

KJ, HS and JS jointly developed the algorithms for the unweighted MOI problem. HS developed the algorithm for the weighted case and carried out the computational studies. KJ and JS drafted the manuscript. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

Part of this work was funded by a postdoctoral fellowship of the German Academic Exchange Service (DAAD).

This article has been published as part of