Department of Biostatistics, Harvard School of Public Health, 677 Huntington Avenue, Boston MA 02115, USA

Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 44 Binney Street, Boston MA 02115, USA

Dipartimento di Matematica ed Informatica, Via Archirafi 34, Palermo 90123, Italy

Department of Statistics, University of Wisconsin, 1300 University Ave Madison, WI 53706, USA

Abstract

Background

Genome-wide mapping of protein-DNA interactions has been widely used to investigate biological functions of the genome. An important question is to what extent such interactions are regulated at the DNA sequence level. However, current investigation is hampered by the lack of computational methods for systematic evaluating sequence specificity.

Results

We present a simple, unbiased quantitative measure for DNA sequence specificity called the Motif Independent Measure (MIM). By analyzing both simulated and real experimental data, we found that the MIM measure can be used to detect sequence specificity independent of presence of transcription factor (TF) binding motifs. We also found that the level of specificity associated with H3K4me1 target sequences is highly cell-type specific and highest in embryonic stem (ES) cells. We predicted H3K4me1 target sequences by using the N- score model and found that the prediction accuracy is indeed high in ES cells.The software to compute the MIM is freely available at:

Conclusions

Our method provides a unified framework for quantifying DNA sequence specificity and serves as a guide for development of sequence-based prediction models.

Background

Of the entire 3GB human genome, only about 2% codes for proteins. The identification of biological functions of the entire genome remains a major challenge

An important question is to what extent a specific protein-DNA interaction is mediated at the level of genomic sequences. While it is well known that specific sequence motifs are crucial for transcription factors (TF) mediated

Despite the success of these recent sequence-based prediction models, it remains difficult to determine which sequences lack intrinsic specificity because a poor prediction outcome might imply than more sophisticated models. A guide is needed for developing sequence-based prediction models. To this end, here we present a simple approach to quantify sequence specificity based on the frequency distribution of

We evaluated the performance of our approach by analyzing one simulated datasets and two real experimental datasets, corresponding to a TF (STAT1) and a histone modification (H3K4me1) respectively. Our results have provided new insights into the role of DNA sequences in modulating protein-DNA interactions regardless of motif presence.

Results

A simple measure of sequence specificity

While specific sequence information has been identified in the absence of distinct motifs, to our knowledge, it is always associated with enrichment of certain

- M

- I

- M

Model Validation

Simulated data

As an initial evaluation, we synthetically generated 8 sequence sets each containing 2000 sequences, mimicking TF ChIPseq experiments for which the corresponding TF recognizes a single motif: **TTGACA**. The difference between these sequence sets is the motif strength, which is parameterized by a real number

We calculated the MIM values for each sequence set and evaluated the statistical significance of the resulting values. We found that the MIM values are statistically significant (p-value < 0.001) for

MIM values for simulated sequences

**MIM values for simulated sequences**. (a) The MIM values and corresponding p-values (above the bars) for the simulated data. Note that the MIM values change in the same direction as motif strength; (b) comparison of the MIM values with respect to the null distribution, which is obtained by using 1000 sets of random sequences.

Top 20

**Cell1**

**KL**

**Bhattacharyya**

**Hellinger**

**tcaa**

**tcaa**

**tcaa**

**gaca**

**gaca**

**gaca**

**gtca**

**gtca**

**gtca**

acag

acag

acag

caag

caag

caag

attg

caac

attg

acat

acac

acat

caac

attg

caac

acac

acat

acac

aatg

aatg

aatg

acaa

acaa

acaa

ttaa

cgga

ttaa

aaat

caaa

aaat

cgga

gacc

cgga

caaa

agat

caaa

aatt

aaat

aatt

cata

taaa

cata

gacc

ccgc

gacc

agat

cgcc

agat

gtaa

aagg

gtaa

agat

cgcc

agat

Distances values on Synthetic dataset

**Cell**

**KL**

**p-value**

**Bhattacharyya**

**p-value**

**Hellinger**

**p-value**

**1**

4.12E-03

<0.001

2.18E-03

<0.001

2.67E-02

<0.001

**2**

3.64E-03

<0.001

1.93E-03

<0.001

2.51E-02

<0.001

**3**

4.06E-03

<0.001

2.16E-03

<0.001

2.65E-02

<0.001

**4**

3.23E-03

<0.001

1.70E-03

<0.001

2.36E-02

<0.001

**5**

2.06E-03

<0.001

1.07E-03

<0.001

1.89E-02

<0.001

**6**

9.59E-04

<0.001

5.03E-04

<0.001

1.29E-02

<0.001

**7**

7.80E-04

0.0262

4.09E-04

0.0497

1.16E-02

0.0207

**8**

6.27E-04

0.2367

3.75E-04

0.1670

1.04E-02

0.2467

Real ChIPseq data

To validate our method using real experimental data, we analyzed a publicly available ChIPseq dataset for STAT1 **TTCCNGGAA **(JASPAR database

We evaluated the level of sequence specificity of the whole set of target sequences by using the MIM measure. The sequences are indeed highly specific (see Figure

MIM values for STAT1 target sequences

**MIM values for STAT1 target sequences**. (a) The MIM values and corresponding p-values (above the bars) for different subsets of STAT1 target sequences: all targets, STAT1 motif containing ones, and STAT1 motif absent ones; (b) comparison of the MIM values with respect to the null distribution, which is estimated by using 1000 sets of random sequences.

Top 20

**STAT1 Motif**

**KL**

**Bhattacharyya**

**Hellinger**

aata

atat

aata

ttaa

tata

ttaa

aaat

aata

aaat

aaaa

ttaa

aaaa

**ggaa**

atta

**ggaa**

atat

aaat

atat

atac

taaa

atac

tcaa

atac

tcaa

aatt

aatt

aatt

acat

ataa

acat

taca

taca

taca

aggg

cata

aggg

cgga

aaaa

cgga

atta

attg

atta

attg

acat

attg

taga

tcaa

taga

caaa

agcg

caaa

acta

gata

acta

ccag

taga

ccag

agca

cgga

agca

Distances values on STAT1 dataset

**Peaks**

**KL**

**p-value**

**Bhattacharyya**

**p-value**

**Hellinger**

**p-value**

**Stat1 Motif**

3.26E-02

<0.001

2.19E-03

<0.001

7.51E-02

<0.001

**Non STAT1 Motif**

3.36E-02

<0.001

2.72E-03

<0.001

7.61E-02

<0.001

**All**

3.76E-02

<0.001

3.65E-03

<0.001

8.05E-02

<0.001

Detecting sequence specificity in absence of a dominant motif

STAT1

As mentioned above, while the presence of STAT1 motif can explain the sequence specificity for 35% of the target sequences, it is unclear how TF is recruited to the other 65% of the targets. In order to evaluate the role of DNA sequence specificity for these motif-absent targets, we compared the MIM values between the motif-present and motif-absent subsets of targets. Surprisingly, we found that the MIM value for motif-absent targets is almost indistinguishable from motif-present targets (see Figure

To gain mechanistic insights, we searched for enrichment of other TF motifs in the JASPAR database ^{-6}): SP1 and ESR1, both have previously been shown to interact with STAT1 ^{-17}). On the other hand, while the motif-present targets are highly enriched for the voltage-gated calcium channel complex (^{-12}), the motif-absent targets are highly enriched for cytoplasmic components instead (^{-12}).

H3K4me1

Unlike TFs, histone (de)modifying enzymes usually do not directly interact with DNA. The role of DNA sequences in the regulation of histone modification patterns remains poorly understood. As an example, the histone modification H3K4me1 plays an important role in gene regulation by demarcating cell-type specific enhancers

Top 20

**H1 cell line**

**KL**

**Bhattacharyya**

**Hellinger**

tcga

tcga

tcga

cgaa

tcca

cgaa

attc

attc

attc

tcca

atgg

tcca

atcg

cgaa

atcg

ggaa

ggaa

ggaa

atgg

aatg

atgg

aatg

atcg

aatg

aacg

tata

aacg

ctta

ttaa

ctta

gcta

ctta

gcta

ttaa

aacg

ttaa

ctaa

gcta

ctaa

agct

taaa

agct

ggta

ggta

ggta

taaa

ataa

cgga

cgga

cgga

taaa

acgg

atta

acgg

ataa

acgg

ataa

atta

aaaa

atta

MIM values for H3K4me1 target sequences

**MIM values for H3K4me1 target sequences**. (a) The MIM values and corresponding p-values (above the bars) for H3k4me1 target sequences in different cell lines. Note that the MIM value for H1 is much higher than for other cell lines; (b) comparison of the MIM values with respect to the null distribution, which is estimated from 1000 sets of random sequences.

Distances values on H3k4me1 dataset

**Cell**

**KL**

**p-value**

**Bhattacharyya**

**p-value**

**Hellinger**

**p-value**

**H1**

4.43E-01

<0.001

1.28E-02

<0.001

2.71E-01

<0.001

**Cd4+**

1.97E-01

<0.001

8.47E-03

<0.001

1.82E-01

<0.001

**NHEK**

1.10E-01

<0.001

4.10E-03

<0.001

1.37E-01

<0.001

**K562**

0.083176

<0.001

0.003584

<0.001

0.119491

<0.001

**Cd133+**

0.064867

<0.001

0.006796

<0.001

0.105424

<0.001

**Cd36+**

0.026875

<0.001

0.002996

<0.001

0.067992

<0.001

**HUVEC**

0.014557

<0.001

0.002912

<0.001

0.050102

<0.001

**Choice of the null model for sequence specificity**. (a) The MIM values for H3k4me1 target sequences in different cell lines experiment with a null model obtained shuffling the original sequences. (b) The MIM values for the same experiment using as a null model a set of random sequences extracted from genome with matching lengths. Note that the the H1 cell line is far more specific than the other cell lines independently of the null model chosen.

Click here for file

Since the H3K4me1 marks cell-type specific enhancers, one possible explanation for the high sequence specificity in ES cells is that the targets might be associated with a few ES-specific TFs. To test this possibility, we searched for enrichment of TF motifs in the JASPAR database using FIMO. Surprisingly, we were unable to find any significantly-enriched motif, suggesting that the specificity is contributed to a different mechanism.

We then investigated whether the H3K4me1 targets in ES cells are indeed highly predictable. In previous work, we developed a sequence-based model, called the N-score model, to predict epigenetic targets

N-score prediction of H3K4me1 target sequences

**N-score prediction of H3K4me1 target sequences**. Receiver operating characteristic (ROC) curves for different cell lines using the N-score. Note as the AUC for H1 is much higher than for other cell lines.

Discussion

Recently it has been shown that a large number of proteins may weakly bind to DNA

We also showed that the MIM measure can provide new biological insights. Specifically, we found that the motif-absent targets of a TF may also contain specific sequence information due to interaction with other TFs. We also found that the sequence specificity for H3K4me1 targets is higher in ES cells than in differentiated cell-types, suggesting a unique role of DNA sequence in the recruitment of H3K4me1 in ES cells. Interestingly, this high specificity cannot be explained by enrichment of known TF motifs, suggesting a yet uncharacterized recruitment mechanism in ES cells. The MIM algorithm is implemented in Python and can be freely accessed at :

Conclusion

The role of DNA sequence in gene regulation remains incompletely understood. Our MIM method has extended previous work by further accounting for sequence specificity due to accumulation of weak sequence features. The information can be used as a guide to systematically investigate the regulatory mechanisms for a wide variety of biological processes.

Methods

Synthetic data generation

We simulated ChIPseq data for a TF whose motif sequence is **TTGACA**. In order to simulate the variation of motif sites among different target sequences, we modeled the position weight matrix (PWM) as illustrated in Table

PWM for synthetic motif generation

**1**

**2**

**3**

**4**

**5**

**6**

**A**

ε

ε

ε

1-3ε

ε

1-3ε

**C**

ε

ε

ε

ε

1-3ε

ε

**G**

ε

ε

1-3ε

ε

ε

ε

**T**

1-3ε

1-3ε

ε

ε

ε

ε

ChIPseq data source

Genome-wide STAT1 peak locations in HeLa S3 cell lines were obtained from the

Motif analysis

Motif analysis was done by using several tools in the MEME suite (

Functional annotation

Functional annotation was done by using the GOrilla software

Details of the MIM measure

Each DNA sequence is mapped to numerical values by enumerating the frequency of each **P = (**
_{
ij
}) be the **S = (**
_{
i
}), where _{
i
}represents a sequence in the set **S**. We generate a set of **R = (
**

The MIM value corresponding to **S **is defined as the expected value _{
kl
}(**S**, **R**), which is estimated by averaging over 1000 sets of random sequences. The MIM value, using the symmetrical KL divergence, can be interpreted as the number of the expected number of extra bits required to code samples from **S **when using a code based on the background distribution. Note that there exist several alternatives to measure the similarity of two probability distributions

1) The Hellinger distance

whose main differences from _{
kl
}are 1) _{
hl
}naturally satisfies the triangle inequality; and 2) the range of _{
hl
}is the interval [0,1].

2) The Bhattacharyya distance

which has been widely used for pattern recognition in computer science

where **P**
_{
j
}(**Q**
_{
j
}).

In order to estimate the null distribution, we generated 1000 sets of random sequences and then calculated MIM values for each random sequence set. The probability density function (pdf) was estimated by using a kernel method

N-score model

The N-score model was described previously

Most informative k-mers selection

Giving **P**
_{
j
}and **Q**
_{
j
}associated to **S **and **R **respectively, it is possible to calculate their Kullback-Leibler (KL) divergence for each

Authors' contributions

LP and GY conceived and designed the study. LP and GL have implemented the MIM methodology. LP and BH analyzed the data. LP and GY interpreted the data. All authors wrote, read and approved the manuscript.

Acknowledgements

We thank Zhen Shao for help with H3K4me1 data collection and initial processing. GY's research was supported by the NIH grant HG005085 and a Career Incubator Award from the Harvard School of Public Health.