Department of Basic Sciences, Faculty of Engineering, Sinai University, El-Arish, Egypt

Systems and Biomedical Engineering Department, Faculty of Engineering, Cairo University, Giza, Egypt

Center for Informatics Sciences, Nile University, Giza, Egypt

Computer Science Division, Department of Mathematics, Faculty of Science, Ain ShamsUniversity, Cairo 11566, Egypt

Abstract

Background

Given a set of DNA sequences _{1}, ..., _{t}_{i}

Results

In this paper we present a new efficient method to improve the performance of the exact algorithms for the motif finding problem. Our method is composed of two main steps: First, we process

Conclusions

Our method speeds up the solution of the exact motif problem. Our method is generic, because it can accommodate any new faster algorithm based on traditional methods. We expect that our method will help to discover longer motifs. The software we developed is available for free for academic research at

Background

DNA motifs are short sequences in the genome that play important functional roles in gene regulation. Due to their short length, it is difficult to identify these regions using features intrinsic in their composition. Assuming that the motifs are conserved in closely related species due to the importance of their function, it is possible to discover them by comparing the respective DNA sequences to identify the sub-sequences that are very similar to each other.

There are two common combinatorial formulations that identify the motifs: The first is the consensus motif problem which made its first appearance in 1984

Given a set of _{i }_{i}_{i }_{H}(p_{i }, M) _{H }_{i }

The planted (_{i }_{i}.

Due to its combinatorial nature, the consensus motif problem and its variant defined above is extremely challenging. Over a benchmark data of 20 sequences, each of length 600 characters, large instances of (15, 5), (17, 6), (19, 7) and (21, 8) have been addressed and many algorithms have been developed to solve them one after another. These algorithms can be classified into two major categories: approximation algorithms

Exact algorithms are based on exhaustive search techniques. The brute force algorithm proceeds by testing all possible motifs of length ^{l}

The algorithms SMILE ^{2}_{i }

PMSP

Our contribution

In a previous work

In this paper, we present a theoretical method which can be used to determine the appropriate value of

Definitions and related work

In this section, we introduce some notations and definitions that will help us to describe our algorithm and related work in a concise manner.

**Definition 1 adapted from [29]: **For any string _{d}_{H}_{H }_{d}_{d}

**Definition 2 adapted from [29]: **Let _{l }s

**Definition 3 adapted from [29]: **Given an _{1}, ..., _{t}_{i}

**Definition 4 adapted from [29]:**A string _{1}, ..., _{t}

1)

2)

**Proposition 1 adapted from [10]: **Let _{d }_{H}_{d}^{n-l+1})^{t}^{l}_{d}^{n-l+1})^{t}

PMSprune Algorithm

Because the first stage of our method will depend on the PMSprune algorithm. We will review the basic steps of it in the notions presented above.

The main strategy of PMSprune is to generate _{d}_{1}, using a branch and bound technique. An element _{d}

where _{2d }is the probability that the hamming distance between two strings is at most 2

Implementation

Our proposed strategy

Our new strategy, referred to as _{d}

**Algorithm 2: HEP **(£, _{1},..., _{t}, n, l, d

**Begin**

1) Determine the number of sequences

2) Implement the exact algorithm £ on

3) For each pattern

**End**.

**Theorem 1: **Algorithm 2 correctly finds all (

**Proof: **Step 2 of the algorithm is exhaustive and finds the whole set of

**Theorem 2: **The running time of the HEP is equal to

where T_{£(q) }is the running time of step 2 involving the use of an exact algorithm £ on the _{d}^{n - l + 1})^{q}

Determination of the best

The range of the number of sequences

**Definition 5: **We define _{HEP }

Implementing HEP based on PMSprune

We decided to use PMSprune for implementing the first step in our method, because of its superiority compared to other algorithms as discussed in

Determining

Replacing _{£(q) }by the time of PMSprune on

Replacing _{HEP }_{HEP_PMSprune }_{£ }with _{PMSprune }

Substituting the value of

Dividing both sides by 4^{l }

The inequality (4) provides the range of the values of

Determining

For fixed values of _{HEP_PMSprune }for 1 ≤

**Algorithm 3: Find ons**

**Begin**

1)

2)

3)

4) **for ****do**

**if **_{min }**then**

_{min }=

5) **return **

**End**

The above algorithm computes

Parallel version of HEP_PMSprune(

We propose a parallel version for HEP_PMSprune(

We parallelize the PMSprune algorithm by assigning a set of _{1 }to each processor for establishing the set of neighboring motifs. The resulting sets are stored in candidate motif lists _{i}, i _{i }

We incrementally construct the partial list _{j }_{j }_{j }_{j-1 }such that all elements in _{j }_{j-1 }are discarded. This continues until _{p }^{l}^{l }^{th }entry in this table contains one if a string in _{j-1 }is mapped to _{j }_{j-1}, and check if a strings in _{j }_{j }

In the second step, we validate each candidate motif independently in parallel over the available processors. The running time of this algorithm is _{s}_{s }

The first step in the parallel algorithm does not lead to loss of any motifs. This is because the set _{2}, s_{3}, ..., s_{q}_{1}. That is, each substring is not processed. The second step in the parallel algorithm is also correct, because the elements in

Results and discussion

Experiments on simulated datasets

We used the simulated data sets that are used in many articles

Experiments overview

Our experiments address three major issues: The first is the performance of our method compared to the use of PMSprune only. The second, we show that our method for selecting

Performance of HEP on PMSprune

Tables

Time Comparison of PMSPrune and HEP_PMSprune(

**
l
**

**
d
**

**
T**

**
mns
**

** T**(

**
Improvement
**

11

3

1.92 s

9

1.4 s

27.1 %

13

4

33.95 s

7

26.05 s

23.27 %

15

5

7.7 m

6

6.4 m

16.8 %

17

6

1.55 h

7

1.26 h

18.5 %

19

7

18.62 h

6

14.93 h

19.8 %

21

8

8.59 dy

6

6.68 dy

22.23 %

Time Comparison of PMSPrune and HEP_PMSprune(

**
l
**

**
D
**

**
T**

**
ons
**

** T**(

**
Improvement
**

11

3

1.92 s

10

1.34 s

30 %

13

4

33.95 s

9

24.55 s

27.69 %

15

5

7.7 m

7

6.02 m

21.8 %

17

6

1.55 h

8

1.26 h

18.65 %

19

7

18.62 h

7

14.39 h

22.74 %

21

8

8.59 dy

6

6.68 dy

22.23 %

Evaluating the choice of

In this section, we experimentally evaluate our algorithm for determining the best

1. We run HEP_PMSprune(_{exp}.

2. Compare the _{exp }against our

Figure _{exp}.

Performance of our method for different challenging instances

**Performance of our method for different challenging instances**. Behavior of HEP_PMSprune(

We also conducted another experiment, where the problem instances were generated with different

The performance of the HEP_PMSprune(

**
n
**

**
d
**

**
l
**

**
ons
**

**
T_ons
**

**
ons_exp
**

**
T_onsexp
**

**
T_pms
**

300

3

11

9

0.0001

3-20

0.0001

0.0001

600

3

11

10

1.34

10

1.34

1.92

900

3

11

14

4

11-16

4

5

1200

3

11

17

7

17

7

8

1500

3

11

20

16

20

16

16

300

3

12

6

0.05

4-20

0.05

0.05

600

3

12

8

0.83

4-20

0.83

0.83

900

3

12

8

1.5

6-20

1.5

1.5

1200

3

12

9

3

6-15

3

4

1500

3

12

10

5

8-12

5

7

300

4

13

7

3

5-20

3

3

600

4

13

9

24.55

9

24.55

33.95

900

4

13

11

81

11

81

109

1200

4

13

14

190

14

190

217

1500

4

13

17

353

17-19

356

360

300

4

14

6

1

4-20

1

1

600

4

14

7

6.5

7-18

6.5

7

900

4

14

8

21.5

8-9

21.5

24

1200

4

14

8

54

8

54

67

1500

4

14

9

107

9

107

146

300

4

15

5

0.25

4--20

0.25

0.25

600

4

15

5

1.25

4-20

1.25

1.25

900

4

15

6

5

5-20

5

5

1200

4

15

6

12

8

10

13

1500

4

15

7

16.5

7-13

16.5

20

300

4

16+

5

0.002

3-20

0.002

0.002

600

4

16+

5

0.25

4-20

0.25

0.25

900

4

16+

5

1

4-20

1

1

1200

4

16+

6

2.34

5-20

2.34

2.34

1500

4

16+

6-8

4.89

5-20

4.89

4.89

300

5

15

7

38

6-10

38

46

600

5

15

8

361.2

8

360

462

900

5

15

9

1250

9

1250

1847

1200

5

15

11

2976

11

2976

4060

1500

5

15

13

5829

13

5829

6969

300

5

17

5

2

5-20

2

2

600

5

17

6

27

13-20

19

19

900

5

17

5

103

7-20

92

92

1200

5

17

6

231

6-8

224

264

1500

5

17

6

439

6-8

439

552

300

5

18+

5

1

5-20

1

1

600

5

18+

6

5

6-20

4

4

900

5

18+

6-7

14

6-20

14

14

1200

5

18+

6-7

33

6-20

33

33

1500

5

18+

6-8

74

6-20

74

74

The first column includes the sequence length _{exp}"

Note that it was not feasible to list the results for all possible values _{exp }and its time published in this table.

Performance of PHEP_PMSprune(

In Table

Running time of PHEP_PMSprune(

**
l
**

**
d
**

**
Time
**

13

4

24.86 s

12.4 s

8.35 s

6.1 s

4.95 s

4.35 s

3.6 s

3.2 s

15

5

6.34 m

3.19 m

2.13 m

1.61 m

1.28 m

1.07 m

55.2 s

48.5 s

17

6

1.28 h

38.28 m

25.58 m

19.16 m

15.34 m

12.81 m

10.98 m

9.61 m

19

7

14.56 h

7.24 h

4.81 h

3.61 h

2.98 h

2.42 h

2.07 h

1.82 h

21

8

6.68 dy

3.33 dy

2.23 dy

1.67 dy

1.34 dy

1.12 dy

23.18 h

20.42 h

Scalability plot of the parallel version

**Scalability plot of the parallel version**. The plots show speed-up for different number of processors and problem instances.

Experiments on real datasets

We used two collections of real datasets used in previous research papers

Tables

Application of the PHEP_PMSprune(

**Transcription Factor**

**Genes**

**Detected motif (s) & parameters**

**Published Motif (s) & reference(s)**

**Time**

PHO4 (600 bp)

PHO5, PHO8, PHO81, PHO84,

CACGTG (6,0)

CACGT[G|T]

38 (5%)

HSE_HSTF

(600 bp)

SSA1, HSP26, SSA4, HSC82, SIS1, CUP1-1

TTCAGTGAA

(9,2)

TTCNNGAA

TTCNNNGAA

37 (35%)

PDR

(600 bp)

PDR3, SNQ2,

PDR15, HXT9, HXT11, PDR5,

YOR1

TCCGTGGA

(8,1)

TCCGCGGA

(8,1)

TCCG[C|T]GGA

27(13%)

MCB

(600 bp)

CDC2, CDC9,

CDC6, CLN1,

POL1, CDC21

ACGCGT

(6,0)

[A|T]CGCG[A|T]

31(20%)

ECB

(600 bp)

SWI4, MCM5

MCM7, CDC6

CLN3

TTTCCCATTAAGGAAA (16,3)

TTtCCcnntnaGGAAA

41(49%)

The first column includes the transcriptional factors (regulatory elements) and the length of upstream sequences. The second column includes the regulated genes. The first three factors and their related genes are available at the SCPD

Application of the PHEP_PMSprune(

**DNA region**

**Seq**.

**no**.

** Detected motif**

**Published Motif**

**Time**

Insulin family

5' promoter

(500 bp)

8

CCTCAGCCCC (10, 1)

CCTCAGCCCC

87(10%)

AAGACTCTAA (10,2)

AAGACTCTAA

GCCATCTGCC (10,1)

GCCATCTGCC

CTATAAAG (8,0)

CTATAAAG [36, GB]

GGGAAATG (8,1)

GGGAAATG

Metallothionein

5'UTR+Promoter

(590 bp)

26

TTTGCACACGC (11,3)

TTTGCACACG

7.87(1%)

TGCACAC (7,1)

TGCACACGG

Interleukin-3 5'UTR+Promoter

490 bp

6

TTGAGTACT (9,2)

TTGAGTACT

GATGAATAAT (10,1)

GATGAATAAT

TCTTCAGAG, (9,2)

TCTTCAGAG

AGGACCAG, (8,1)

AGGACCAG

466(10%)

AGGTTCCATGTCAGATAAAG,

ATGGAGGTTCCATGTCAGAT,

CTATGGAGGTTCCATGTCAG,

GAGGTTCCATGTCAGATAAA,

GGAGGTTCCATGTCAGATAA,

TATGGAGGTTCCATGTCAGA,

TGGAGGTTCCATGTCAGATA,

all these motifs found with (20,0)

Novel

Growth-hormone

5^{'}UTR+promoter

(380 bp)

16

AACTTATCCAT (11,3)

ATTATCCAT

3.43(0%)

ATAAATGTAAA (11,3)

ATAAATGTA

TATAAAAAG (9,2)

TATAAAAAG

c-fos

5^{' }UTR+promoter

(800bp)

6

CCATATTAGGAC (12,3)

CCATATTAGGACATCT

350(15%)

GAGTTGGCTGC (11,3)

GAGTTGGCTG

CACAGGATGT (10,2)

CACAGGATGT

AGGACATCTGCT (12,3)

AGGACATCTG

c-myc

5'+promoter

(100bp)

7

GTTTATTC (8,1)

GTTTATTC

83.5(42%)

CTTGCTGGG (9,2)

TTGCTGGG

TGTTTACATC (10,2)

TGTTTACATC

CCCTCCCC (8,1)

CCCTCCCC

Histone H1

5^{'}UTR+Promoter

650 bp

**4**

CAATCACCAC, (10,2)

CAATCACCAC, [36, GB]

47.6(9%)

AAACAAAAGT (10,1)

AAACAAAAGT, [36, GB]

The first column includes the gene family and the length of upstream sequences. The second column includes the number of sequences. The third column includes the motif detected by our tool and the respective parameters (

Tables

Conclusions

In this paper, we introduced an efficient method that can enhance the performance of exact algorithms for the motif finding problem. Our method depends on dividing the sequence space into two sets. Over the first set, we generate a set of candidate motifs. Then, we use the remaining set of sequences to verify if each candidate motif is a real one. The experimental results show that our method is superior to the best methods available so far and could tackle large problems like (21, 8). Finally, we introduced a scalable and efficient parallel version for the proposed method. Our tool is available for free for academic research at

Availability and requirements

**Project name: **hymotif.

**Project home page: **

**Operating system(s): **Linux.

**Programming language: **C.

**Other requirements: **C/C++ libraries.

**License: **GPL.

**Any restrictions to use by non-academics: **No restrictions.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

All authors contributed to theoretical and practical developments which form the basis of HEP method. All authors wrote and approved the manuscript.

Acknowledgements

The authors are grateful to M.M. Mohie Eldin for useful discussion. The authors also thank Sanguthevar Rajasekaran for providing us with the source code of PMSprune and real datasets.

This article has been published as part of