Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27606, USA

Biostatistics Branch, The National Institute of Environmental Health Sciences, National Institutes of Health, RTP, NC 27709, USA

Institute for Genome Sciences & Policy, Duke University Medical Center, Durham, NC 27708, USA

Abstract

Background

Identifying functional elements, such as transcriptional factor binding sites, is a fundamental step in reconstructing gene regulatory networks and remains a challenging issue, largely due to limited availability of training samples.

Results

We introduce a novel and flexible model, the

- O

- Mi

- Ma

Conclusion

Our optimized mixture of Markov models represents an alternative to the existing methods for modeling dependent structures within a biological motif. Our model is conceptually simple and effective, and can improve prediction accuracy and/or computational speed over other leading methods.

Background

Biological sequences, including DNA, RNA and proteins, contain functionally important motifs, such as transcription factor binding sites (TFBS), RNA splice sites, and protein domains. With the increasing-availability of genome sequences, identification of such functional motifs not only plays important roles in gene finding and function prediction but also is a fundamental step in reconstructing gene regulatory networks and in revealing gene evolutionary mechanisms

A commonly used model for motif identification is the Weight Matrix Model (WMM) proposed by Staden

Many models have been developed to incorporate position dependencies. Motif models, such as the Dinucleotide Weight Matrix Model (DWMM) ^{nd }order WAM

In this paper, we present a new and flexible motif model, the OMiMa, to incorporate position dependencies within a motif. OMiMa can not only adjust model complexity according to motif dependency structures but also minimize model complexity without compromising prediction accuracy. As an integrated part of OMiMa, we also introduce the Directed Neighbor-Joining (DNJ) method to optimally rearrange positions to minimize Markov order. We then describe and discuss the methods for selecting the best model. We implement our model into the OMiMa system that is freely available to the public.

Results

Mixed Markov models

Let _{i }be the discrete random variable associated with position **X **of length _{i }takes values from set B = {_{i }takes values from 20 different amino acids. _{i }follows a multinomial distribution. Let _{i-k}..._{i-1 }and _{i-k}..._{i-1}, where **X **(_{i}) is a random variable and lower case **x** (_{i}) is a particular value. The _{-j }= _{w-j}, _{-j }= _{w-j}, where ^{th }order Markov model (_{k}), the probability of observing a motif sequence x is just the product of conditional/transition probabilities. Let ^{th }order Markov model of a linear chain, and ^{th }order Markov model of a circular chain. The probability of a motif sequence is given by equation (1) for a linear chain and equation (2) for a circular chain, respectively.

Compared to a linear Markov chain, a circular Markov chain incorporates additional dependencies that may contain subtle signals that allow the model to distinguish true motifs from false ones, especially when false motifs are similar to true motifs.

Suppose a motif **X **can be divided into **X **= **Y**_{1},... **Y**_{m }and each sub-motif is modeled as an independent Markov chain, that is

These independent Markov models, each of which is position-optimized for its corresponding sub-motif, form an

- O

- Mi

- Ma

The graphic representation of a mixture of Markov models

**The graphic representation of a mixture of Markov models**. A graphic representation of a mixture of Markov models. On the top is a motif of length 14 bases. On the left, 6 positions, which are independent of each other and all other positions, form a 0^{th }order Markov chain. In the middle, 3 positions form a linear chain of 1^{st }order Markov model. On the right, the remaining positions that closely depend on each other form a circular chain of the 2^{nd }order Markov model.

The graphic representation of the 0-k mixture model for TFBS

**The graphic representation of the 0-k mixture model for TFBS**. The simple mixture of Markov models for TFBS. Since TFBS are short (5–16 bases), a mixture model consisting of the 0^{th }order and 1^{st}/2^{nd }order Markov chains is generally adequate for predicting new binding sites. The sub-motif formed by independent positions is modeled by a 0^{th }order Markov order model. The sub-motif forming by the remaining positions is modeled by either a 1^{st }or 2^{nd }order Markov chain, which can be either linear (break at dotted arrows) or circular.

Conceivably, the different parts of a motif could have distinct roles in the interaction with their partners. Motif positions involved in the same role can be highly dependent, whereas those involved unrelated roles are likely independent. A mixture of Markov models seems an ideal fit by modeling different signals with different sub-models. A 0^{th }order Markov chain can effectively model strong signals such as those embedded in highly conserved positions where the probability of a certain base occurring is almost one. In addition, positions where base composition contributes little or nothing to motif function need no more complex model than a 0^{th }order Markov model. On the other hand, a higher order Markov model is necessary for detecting subtle dependency signals that can be essential for distinguishing true motifs from false ones.

Motif dissection

To apply the mixture of Markov models to a motif, the first step is to dissect the motif into several independent sub-motifs, each of which is modeled as a Markov chain. For a given set of sequences of a motif, we employ chi-square tests to find significant pairwise dependencies between positions within the motif (see also

1. Calculate base frequencies for each position, and find highly conserved positions where the observed frequency of a certain base (almost) equals 1. These conserved positions then are put into set

where

2. Place remaining positions in the set

where B_{i }and B_{j }are the sets of bases observed in positions _{i}, _{j}) and _{i}, _{j}) are the observed and expected counts of pair (_{i}, _{j}), respectively. _{i}, _{j}) is the product of observed base frequencies _{i }and _{j}. The degrees of freedom of this test is (|B_{i}| - 1) × (|B_{j}| - 1), where |B_{i}| and |B_{j}| are the number of different bases in sets B_{i }and B_{j}, respectively.

3. Based on the above ^{2 }tests, find all positions that show little dependence on any other positions in

Here _{i,j }is the p-value corresponding to

4. The remaining positions in

(a) Set s = 1.

(b) Calculate _{i }= ∑_{j∈M,j≠i }_{i,j }<_{i}, and move position _{i,j }<_{s}.

(c) For each remaining position, check if it significantly depends on any position in _{s}. If it does, then move it from _{s}.

(d) If

Step 4 above essentially groups positions into independent subsets, each potentially forming a functional unit. For the special 0-k mixture model, we simply set _{1 }at this step.

Markov chain optimization

The next step is to arrange the positions in each subset into a Markov chain. Since the positions in sets ^{th }order Markov chain. The positions in _{s }are different. The position arrangement for each set _{s }needs to be optimized so that the Markov model can account for most dependencies while minimizing the Markov order. For a given set _{s}, we use the median (_{s}) of _{j}, _{s}}) as the maximum order of its potential Markov model. We then optimize position arrangement for the ^{th }order Markov chain (_{s}) by the Directed Neighbor-Joining (DNJ) method described below.

The neighbor-joining (NJ) method proposed by Saitou and Nei ^{th }order Markov chain from a given subset (_{s}) of motif positions is described in the following steps (see Figure

Illustration of the DNJ method for Markov chain optimization

**Illustration of the DNJ method for Markov chain optimization**. An example of the DNJ method to optimize the 2^{nd }order Markov chain.

1. For a given set _{s}, put each position in the set into a different vector. Here a vector is represented by a letter, an arrow at the top of the letter may be used to indicate the direction of a vector,

2. Create an initial distance matrix (_{i,j}, where _{i,j }is the p-value of chi-square test described above.

3. Convert the distance matrix

Where

4. Find the minimum **Algorithm 1 **[see ^{th }order Markov chain.

The supplement includes the mathematical formulas for computing the probability of a motif site given a Markov model, the algorithmic pseudo-code for the DNJ method, and the description of the parameter estimation for our model. It also contains supplemental materials for the main results as well as other additional results, such as the application for protein domain identification, the comparison of computational time, and so on.

Click here for file

5. Update the matrix

6. Go back to step 3 if the number of vectors in _{s }is larger than 2, otherwise join the last two vectors according to **Algorithm 1**.

The order of positions in the final vector is the optimized linear chain for Markov model. Joining the first position to the last position in the vector forms a circular chain. A linear chain could be further optimized by forming a circular chain first from the final vector, then breaking the circular chain between positions with the weakest dependency, _{i,j }is the largest or the log-likelihood of the corresponding linear chain model is maximized. DNJ not only optimizes position order for linear chain models but also improves circular chain models, particularly when the order of Markov model is low, ^{st }or 2^{nd }order Markov models.

Model selection

Many different mixtures of Markov models can be formed from the combination of different Markov chains. It is essential to choose the model that minimizes prediction error. In model selection, we first fit each model using maximum likelihood smoothed by a Dirichlet prior [see

-2·

where

Effective degrees of freedom

Let B be the set of bases (|B| denotes the number of different bases in B), ^{k }- 1) × (^{k }- 1) × ^{th }order Markov models for all 61 DNA regulatory motifs when using the DF. To avoid picking overly simple models, we used the EDF described below to calculate AIC and BIC.

Generally, only a subset of bases from B appears in a particular position of a set of biological motifs. The more conserved a position, the fewer bases are in the subset. The EDF for a model is related to the observed bases in training samples. For example, suppose that one would like to estimate nucleotide frequencies occurring in a position in a set of DNA training motifs. If only base A is observed in the position, then one needs to estimate only the frequency of A, the remaining parameters, _{i }be the base set observed in a position ^{k }be the sequence of motif positions in the ^{th }order Markov chain, ^{th }element of ^{k}, and ∑|^{k}| = ^{k}| is the number of positions in ^{k }), then we define the EDF for the ^{th }Markov chain as

where ^{th }and the ^{th }order chains.

Performance assessment

We test the effectiveness of our method on TFBS data and the donor splice sites, where training data for OMiMa are a set of sequences of a motif. For prediction results, we use the following abbreviations for empirical quantities:

Matthews correlation coefficient

OMiMa can use two ways to score a motif site **x**: log-likelihood and log-likelihood ratio, which are defined by

where _{s }is the signal model trained by true motif sites, and _{b }is the background model or false signal model trained by background sequences or false motif sites. A sequence **x** is predicted as a positive site if the score of **x** is larger than a certain threshold. We select a cutoff threshold using one of the following three criteria: balanced sensitivity and specificity, the maximum prediction accuracy, and the maximum Matthews correlation coefficient. Each potential threshold yields an estimated true positive rate and a false positive rate. The plot of true positive rates against false positive rates generates a Receiver Operating Characteristic (ROC) curve, which can be used for comparing and selecting the best model.

We used a three-symbol notation 'k-m-s' to distinguish different models, where 'k' stands for a 0-k mixture Markov model, 'm' is either 'L' or 'C' to indicate whether the ^{th }order chain is linear ('L') or circular ('C'), and 's' is either 0 or 1 to indicate whether log likelihood score (0) or log-likelihood ratio score (1) is used. For example, '1-L-l' stands for a 0–1 mixture of linear Markov models that uses log-likelihood ratio to score a motif site.

Effectiveness of DNJ method for optimization

To assess the ability of our DNJ method for optimizing a Markov chain, we compared the DNJ method with random permutation method. In this evaluation, we used a 0-k mixture model (_{DNJ}) with its ^{th }order Markov chain optimized by the DNJ method. We calculated the log-likelihood of the data given the model _{DNJ }(log Pr(_{DNJ})). Second, with the same data, we fitted a new 0-k mixture model (denoted as _{R}), which is the same as _{DNJ }except that the positions in its ^{th }order chain are ordered by random permutation, and calculated logPr(_{R}). This step was repeated 1,000 times, so we have 1,000 log-likelihoods of the randomly permuted models

where _{DNJ}).

Fifty-three human transcription factors, whose binding sites contain at least four dependent positions by the ^{2 }test given by equation (3), are selected for this evaluation (Table ^{st }order linear chain, 1^{st }order circular chain, 2^{nd }order linear chain, and 2^{nd }order circular chain.

The optimized 1^{st }order Markov chains for TFBS. The optimized arrangement of dependent positions within TFBS for the 1^{st }order Markov model. N and _{D }are total number of motif positions and the number of positions significantly dependent, respectively.

ID#

Name

N

_{
D
}

Position order

1

V$AP1_Q4_01

8

8

7-3-1-2-0-6-5-4

2

V$AP1_Q6_01

9

8

2-3-1-4-5-7-8-6

3

V$AP1_Q2_01

12

9

4-3-5-6-7-10-11-9-1

4

V$CDPCR1_01

10

9

3-4-2-9-6-5-7-8-1

5

V$ATF_01

14

8

1-0-10-9-11-2-13-12

6

V$CHOP_01

13

10

5-4-6-7-9-10-0-8-11-12

7

V$CDPCR3_01

15

10

3-0-1-8-9-13-4-6-2-5

8

V$CDPCR3HD_01

10

5

1-8-9-2-7

9

V$CREB_Q2_01

14

8

1-11-12-0-2-3-9-8

10

V$CREB_Q4_01

11

6

7-6-1-8-9-10

11

V$CREB_Q3

6

4

4-5-1-0

12

V$CEBP_Q3

12

9

8-9-5-6-4-11-3-2-10

13

V$CEBPB_01

14

4

0-13-11-3

14

V$E2F_Q4_01

11

4

1-8-7-0

15

V$E2F_Q6_01

12

8

8-3-7-0-2-11-9-10

16

V$E2F1DP1_01

8

5

3-4-0-6-7

17

V$E2F1DP2_01

8

5

5-6-7-3-4

18

V$E2F4DP1_01

8

4

3-4-0-1

19

V$E2F4DP2_01

8

5

4-3-7-1-0

20

V$ETS_Q4

12

8

11-2-5-10-4-3-0-1

21

V$ELK1_02

14

4

10-11-2-3

22

V$FAC1_01

14

12

12-6-10-11-13-4-9-8-5-1-0-7

23

V$FOXD3_01

12

11

1-3-8-7-9-10-11-2-0-4-6

24

V$FOXO1_02

14

11

8-9-10-12-7-6-2-0-11-1-3

25

V$HNF4_Q6

9

7

4-3-2-6-8-1-7

26

V$HNF1_Q6

18

15

3-11-12-1-4-8-13-5-9-0-6-16-14-2-10

27

V$HNF3_Q6

13

11

1-10-7-5-3-4-12-9-0-2-8

28

V$E2F1DP1RB_01

8

5

1-7-3-0-4

29

V$IRF7_01

18

13

3-2-0-16-15-17-1-7-6-8-14-9-12

30

V$LUN1_01

17

8

8-9-10-7-12-11-14-13

31

V$MZF1_01

8

4

0-1-4-5

32

V$MYC_Q2

7

4

4-5-3-1

33

V$NFAT_Q4_01

10

4

6-8-9-5

34

V$NFKAPPAB_01

10

4

5-7-9-2

35

V$NKX22_01

10

6

9-8-6-1-0-7

36

V$OCT_Q6

11

10

8-2-0-10-5-3-9-6-4-7

37

V$PAX_Q6

11

10

10-6-7-0-9-3-1-5-4-2

38

V$PAX6_01

21

21

15-17-16-18-6-8-19-13-11-3-2-1-0-20-7-10-4-9-5-14-12

39

V$PBX1_02

15

10

6-12-2-0-3-1-11-13-14-4

40

V$RSRFC4_Q2

17

6

6-7-0-13-2-3

41

V$RSRFC4_01

16

8

6-7-9-1-2-13-12-8

42

V$STAT5A_01

15

7

8-12-1-13-0-4-5

43

V$SOX9_B1

14

9

1-13-0-2-11-5-3-10-4

44

V$SRY_01

7

4

4-6-0-1

45

V$SRY_02

12

4

1-3-11-4

46

V$STAT5A_02

24

16

7-12-20-15-16-17-18-19-22-1-21-13-5-6-9-23

47

V$SP1_Q2_01

10

7

7-3-8-0-4-9-5

48

V$SP1_Q4_01

13

13

0-2-11-12-6-1-3-10-9-8-7-4-5

49

V$SP1_Q6_01

10

10

3-5-8-9-0-7-4-2-6-1

50

V$USF_Q6_01

12

8

3-11-4-5-7-2-1-8

51

V$XBP1_01

17

9

13-5-3-4-15-11-10-12-0

52

V$ZID_01

13

8

6-7-4-8-12-10-9-11

53

I$DRI_01

10

7

6-9-8-7-0-1-2

Results suggest that DNJ method performed remarkably well in optimizing the 1^{st }order linear Markov chains, that in 49 out 53 cases, the DNJ optimized models were the best or close to the best (Figure ^{nd }order linear chains was slightly worse than that for the 1^{st }order linear chains, partially because the DNJ method relies only on the pairwise dependencies between two single positions. Nevertheless, most of the DNJ optimized models were still close to the best [see ^{st }order circular Markov chains (Figure ^{nd }order circular Markov chains [see

We used AP1 (activating protein 1) transcription factor binding sites (Transfac ID V$AP1_Q4_01) as an example of how DNJ optimization can improve performance of a 0–1 or 0–2 mixture model. We plotted the histogram of the log-likelihood per instance given a model _{DNJ})/

The performance of the DNJ optimized 0–1 mixture models

**The performance of the DNJ optimized 0–1 mixture models**. The performance of the DNJ optimized 0–1 mixture models of TFBS. The y-axis is 1-

Modeling TFBS V$AP1_Q4_01

**Modeling TFBS V$AP1_Q4_01**. The performance of the optimized model of TFBS V$AP1_Q4_01. The histogram is the log-likelihood score distribution of 1,000 randomly permuted mixture models. The red reference line indicates the relative performance of the DNJ optimized model (a) 0–1 mixture linear model (b) 0–1 mixture circular model.

Theoretically, the optimal model can be found by exhaustively searching through all possible models. An exhaustive search is not always possible in practice, however, as the search space can be very large. The number of possible Markov chains is the factorial of the length of the Markov chain and dramatically increases as the length of chain increases. For example, the computational time for a motif of 15 bases (15! = 1.307674e + 12) can be practically unacceptable. Our DNJ method can deal with such long motifs because of its computational efficiency.

TFBS identification

One interesting application of our mixture model is TFBS identification. In this assessment, we used a couple of examples to show how OMiMa can improve prediction accuracy when there are position dependencies within a TFBS. We first tested our method on simulated data where the exact dependency structure of a TFBS is known. We tested whether OMiMa can capture such dependency and optimize the Markov model accordingly. Next, we tested our method on real motif data for AP1. In both examples, we compared OMiMa performance to PWM, PVLMM, and the 1^{st }order Markov model (1stMM) with its motif positions in the natural order. PVLMM, run on Microsoft Windows, is based on the variable length Markov model (VLMM)

Simulated TFBS prediction

Many TFBS are palindromic sites bound by heterodimers/homodimers (

Simulation of two palindromic TFBS. Simulation of two palindromic TFBS A and B. The first 2 columns are the complementary positions of the palindromic TFBS. The 3^{rd }and 4^{th }columns are simulation parameters, which specify the probabilities of forming a complementary base pair. The last 2 columns are the p-values of OMiMa's pairwise ^{2 }tests of position dependency for the simulated data.

Position pair

Complementary Prob.

p-value

1^{st}

2^{nd}

A

B

A

B

0

11

0.99

0.90

4.88e-88

3.84e-63

1

10

0.95

0.85

6.62e-72

2.66e-56

2

9

0.90

0.75

3.84e-69

2.25e-35

3

8

0.65

0.65

1.44e-19

5.89e-24

4

7

0.50

0.50

2.00e-07

3.05e-03

5

6

0.25

0.25

3.35e-01

1.43e-01

In our simulation, the positions 5 and 6 were generated independently from all other positions, so they should be in the 0^{th }order chains. However, based on OMiMa's pairwise ^{2 }tests for the training data, the position pair 5–8 (with p-value = 0.03) in TFBS A, and the position pair 6–10 (with p-value = 0.04) in TFBS B were declared dependent. That is why the positions 5 and 8 were arranged together in the model for TFBS A, and positions 6 and 10 were together for TFBS B. We compared the prediction results of OMiMa's 0–1 mixture model with those of PWM, 1stMM and the 1^{st }order PVLMM (with depth 1). Results (Table

Performance evaluation using simulated palindromic TFBS. Performance comparison of OMiMa (1-L-0) with PWM, 1stMM, and PVLMM (order 1 and depth 1) for predicting two simulated TFBS A and B. The performance was measured as the maximum

Motif

PWM

1stMM

PVLMM

OMiMa

A

0.306

0.414

0.807

0.914

B

0.253

0.428

0.647

0.794

Performance comparison on the simulated palindromic TFBS

**Performance comparison on the simulated palindromic TFBS**. The performance comparison of different methods for predicting the simulated palindromic TFBS A. The x-axis shows the number of motif sequences used for training. The y-axis is the Matthews correlation coefficient of each method in predicting the same testing dataset (150 false and 150 true sites, respectively). The figure shows that OMiMa performed significantly better than the other methods, regardless the number of training samples.

AP1 TFBS prediction

We chose human AP1 TFBS (see Figure ^{2 }tests on the 119 true sites suggested that all positions showed some level of dependency with the neighboring pairs 0–2, 4–5, 5–6, and 4–6 showing strong dependencies (p-value < 1.0e-6). Noticeably, the positions 4, 5 and 6 are also the most conserved positions, so we expect that PWM would be reasonable good model for the TFBS. We randomly split both the true sites and false sites into 10 roughly equal-sized parts, and used a 10-fold cross validation to compare the performance of OMiMa's 0–1 mixture model with the others. OMiMa had advantage over the other three models in predicting TFBS that do not have strong long-range dependencies (Table

The sequence logos of AP1 TFBS and the donor site

**The sequence logos of AP1 TFBS and the donor site**. Sequence logos of the AP1 TFBS and the donor splice site. The height of bases represents the information content at each position of a sequence motif. (a) the logo of AP1 TFBS. Note that the positions 4 and 6 of AP1 TFBS are not perfectly conserved. (b) the logo of donor splice site. The positions 0 and 1 are perfectly conserved. The logo plot was created by WebLogo [45].

TFBS V$AP1_Q4_01 prediction. Comparison of OMiMa (1-L-0/1-C-0), PWM, 1stMM, and PVLMM (order 1 and depth 1) for AP1 TFBS prediction. The performance results are the average values of 10-fold cross validation.

Model

PWM

0.857

0.997

0.860

1stMM

0.839

0.998

0.870

PVLMM

0.789

0.999

0.847

1-L-0

0.866

0.998

0.882

1-C-0

0.874

0.998

0.884

Donor splice site recognition

The transcription of most higher eukaryotic genes involves RNA splicing, in which primary transcripts become mature mRNA by removing introns. The donor or 5' splice sites and the acceptor or 3' splice sites on the boundaries of exons and introns provides critical signals for precise splicing. Therefore, splice site recognition has been widely used by gene finding tools such as GENESCAN

Comparison with NNSplice and PVLMM

The test dataset of human donor splice sites (Reese data) was from

First, we tested whether OMiMa which uses either AIC or BIC, can correctly pick the best model based on ROC analysis. We fitted a set of 0-k mixture models, in which the ^{th }order chains are either linear or circular and

Comparison of different 0-k mixture models for donor splice site prediction

**Comparison of different 0-k mixture models for donor splice site prediction**. Comparison of different 0-k mixture models for donor splice site prediction by ROC curves. Based on the Area Under Curve (AUC) criterion, the figure indicates that: (a) For training data, the best models were 3-L-1 and 3-C-1 while the worst model is 0-L-1 (same as 0-C-1). (b) For testing data, the best models were 1-L-1 and 1-C-1 while the worst models are 3-L-1 and 3-C-1.

Using the best model selected above, we then compared OMiMa with NNSplice and PVLMM. NNSplice is based on a complex neural network model and is trained by both true sites and false sites. Since both OMiMa and NNSplice used the same training and testing data, their prediction results can be directly compared. We compared OMiMa's 1-L-1 and 1-C-1 models with the first order PVLMM (with depth 1) as all have similar model complexity. The results of NNSplice were reported at the NNSplice Web site

Comparison OMiMa with NNSplice and PVLMM for donor site prediction. Comparing two OMiMa models (1-L-1 and 1-C-1) with NNSplice's neural network model and PVLMM (order 1 and depth 1) for donor splice site prediction.

Network

PVLMM

1-L-1

1-C-1

0.951

0.927

0.955

0.954

0.904

0.793

0.928

0.947

0.963

0.963

0.962

0.955

0.857

0.786

0.869

0.869

0.942

0.889

0.938

0.952

0.951

0.934

0.959

0.954

Comparison with MEM and PVLMM

Given enough training data, we can use more complicated models than the 0–1 mixture model to improve prediction accuracy. In this evaluation, we test whether 0-k mixture models can compete with the MEM on a much larger dataset. This large donor site dataset (Yeo data), used to assess performance of the MEM, was from _{3}_{4}. So a decoy site can have the exactly same sequence as a real site. We applied this original training and testing sets to assess performance of OMiMa, where we used only log-likelihood ratio scoring. In addition, we ran a 3-fold cross-validation, in which the number of sites in new training and testing sets are roughly the same as those in the original ones [see

Briefly, the notation has the form "me_{i}), _{i}, _{i+1}), _{i}, _{i+1}, _{i+2}), _{i}, _{i+1}, _{i+2}, _{i+3}), _{i}, _{i+1}, _{i+2}, _{i+3}, _{i+4}).

Comparison of the top 4 performers from each model class suggested that OMiMa performed comparably with MEM and better than PVLMM (Table

Comparison of OMiMa PVLMM and MEM for donor site prediction. Comparing OMiMa with PVLMM and MEM for donor splice site prediction. The table shows Matthews correlation efficients (

MEM

PVLMM

OMiMa

sub-model

sub-model

sub-model

me2x5

0.659

P:2-2

0.629/0.631

3-C-1

0.658/0.663

me2x4

0.655

P:3-2

0.626/0.632

3-L-1

0.654/0.657

me2x3

0.653

P:4-2

0.625/0.630

2-C-1

0.647/0.657

me5s0

0.653

P:4-3

0.622/0.628

2-L-1

0.643/0.653

TFBS

0^{th }chain

1^{st }chain

A

6

7-4-1-10-3-8-5-0-11-9-2

B

5

2-9-6-10-1-8-3-7-4-11-0

Biological explanation

To compare OMiMa's fitted donor site models to biological knowledge about dependencies among positions, we examined the best donor models for the first donor dataset (Reese data) and for the second donor dataset (Yeo data). For convenience, let us mark the invariant 'GT' nucleotides in the boundary of exon/intron as the positions 0 and 1 of the donor site, respectively (see Figure ^{st }order chain.

-2 5 -1 3 4 -3 -7 -6 -5 -4 7 6 2

We found that this position arrangement is supported by the following biological evidence of base-pairing between U1 snRNA and the donor site: (a) 5'/3' compensation effect: a base pair at position -1 can prevent an aberrant splicing caused by a mis-matched pair at position 5 ^{rd }order chain was:

2 5 -1 4 -2 3 -3

We can see that this model is consistent with the above evidence (a) and (b). In addition, it is well supported by experimentally verified position dependencies of position 4 on the positions -1, -2, 3 and 5

Discussion

The prediction accuracy of a probabilistic model is largely determined by the effectiveness of the model in characterizing a biological motif. Since there is large variation of the signals embedded in biological motifs, an effective model can be as simple as a consensus sequence or as complex as a fully connected network model. In this paper, we described a mixture of Markov models to allow adjustment of model complexity for different motifs. Also, we extended the traditional linear chain Markov model to the circular chain Markov model, which can better represent position dependencies within a motif in some cases. We presented a novel method, DNJ, for efficiently optimizing position arrangement of a non 0^{th }order Markov chain to incorporate most dependencies. We described methods for calculating the EDF and for selecting the best mixture Markov model. We implemented these methods in our motif finding OMiMa system, which is freely available. Finally, we demonstrated from different aspects in several examples that OMiMa can improve motif prediction accuracy in biological sequences.

The interaction of biological macromolecules, such as transcription factors bound to DNA sites, usually involves several highly dependent positions functioning as a unit. Many methods including Markov chains, Bayesian trees, and neutral networks have been used to model dependency structures within a motif. The Markov model is the simplest yet can be very powerful when it is optimized. Our results showed that the optimized Markov models performed better than the neural network model and PVLMM, and comparably with MEM for splice site prediction. The optimized Markov model can incorporate both local and non-local dependencies into the model, which enables it to compete with tree or network models in predicting short biological motifs. We also showed that the optimized Markov model can be an excellent motif predictor. Moreover, it is also computationally efficient due to its simplicity.

Model complexity, measured by parameter number, is an important issue in motif modeling. The more complex a model, the more data are needed for adequate training. For many biological motifs, however, the number of known (experimentally determined) sites is small. This limits the usage of complex models, such as higher order Markov models, Bayesian trees, network models or MEM, even though these models in some cases can perform better than the simpler models given enough training data. For a standard Markov model, the number of its parameters increases exponentially as its Markov order increases. Without sufficient training data, it is difficult to accurately estimate all model parameters, even using more robust methods (

More recently, Zhao

In comparison with other leading methods, OMiMa can incorporate more than the NNSplice's pairwise dependencies; OMiMa avoids model over-fitting better than the PVLMM; and OMiMa requires smaller training samples than the MEM. These are primarily reasons that OMiMa showed superior performance, in terms of prediction accuracy, required size of training data or computational time, over other leading-methods in our results.

With any model selection procedure, the possibility of choosing a model that drastically over- or underfits is a concern. OMiMa employs AIC and BIC, two standard criteria, that are widely used because they tend to avoid extreme over- or underfitting. Both have theoretical support

Our OMiMa approach has two features that can be limitations when the size of the training data is small. First, the chi-square test that partitions motif positions into those with dependencies and those without dependencies will, like any statistical test, make mistakes, and its statistical power to detect dependencies will suffer with small training samples. Although the test will not always provide a correct partition, our approach should adapt to strong or weak dependencies overall and improve prediction when dependencies are strong. In addition, weakly dependent positions mistakenly placed in the set with no dependencies are often adequately modeled by a 0^{th }order chain, whereas independent positions mistakenly assigned to the set with dependencies will be placed by the DNJ algorithm in locations with the least impact on the ^{th }order chain. Second, the EDF that we used in model selection is an estimate based on the training data. For degenerate sites, the estimate should be accurate with even small training samples; whereas for conserved sites a larger training sample might reveal additional bases and change the EDF. Still, such additions should be minimal and would generally induce small changes in the EDF, so we expect little impact on model selection. Any methods that employ chi-square techniques to test for dependent sites face similar limitations. Nevertheless, OMiMa with its relatively small parameter space should adapt to small training datasets better than many competitors. Of course, any motif finding algorithm would do better with larger training samples.

OMiMa places no limit on the length of sequences that it can scan, and it could be used to find TFBS in any sequenced organism as long as a training motif set is available. The larger the genome evaluated, the more false positives are likely to be declared. Although OMiMa's prediction accuracy will help, other approaches to reducing false positives will be needed. Cross-species comparisons and relative location compared to transcription start sites have been used to reduce false positives and could be used with OMiMa too. Furthermore, OMiMa's ability to accurately and quickly identify splice sites should be easy to incorporate into probabilistic gene-prediction programs where correct prediction of splice sites is critical.

Conclusion

Our optimized mixture of Markov models represents an alternative to the existing methods for modeling dependent structures within a biological motif. Unlike existing methods, our model is conceptually simple and effective, which has advantages in a large scale motif prediction. In particular, with its ability to minimize model complexity, our method can work effectively even with limited training data. The optimized mixture of Markov models is implemented in our computational tool OMiMa, which can use a variety of mixture models for motif prediction. OMiMa, in which most parameters are configurable, is freely available to all users.

Authors' contributions

W. Huang provided the principal contributions to the conception and design of this study as well as to its analysis. D. M. Umbach and L. Li contributed to the design of the study and the interpretation of results. All authors contributed to writing and critically revising the manuscript.

Acknowledgements

We thank Drs Bruce Weir and Jeffrey Thorne for critically reading the manuscript, and Drs Clarice Weinberg and Joseph Nevins for helpful comments. This research was supported by Intramural Research Programs of the NIH, National Institute of Environmental Health Sciences.