Department of Computer Science and Engineering, University of Connecticut, Fairfield Road, Storrs, CT 06269, USA

Abstract

Background

Computational approaches to transcription factor binding site identification have been actively researched in the past decade. Learning from known binding sites, new binding sites of a transcription factor in unannotated sequences can be identified. A number of search methods have been introduced over the years. However, one can rarely find one single method that performs the best on all the transcription factors. Instead, to identify the best method for a particular transcription factor, one usually has to compare a handful of methods. Hence, it is highly desirable for a method to perform automatic optimization for individual transcription factors.

Results

We proposed to search for transcription factor binding sites in vector spaces. This framework allows us to identify the best method for each individual transcription factor. We further introduced two novel methods, the negative-to-positive vector (NPV) and optimal discriminating vector (ODV) methods, to construct query vectors to search for binding sites in vector spaces. Extensive cross-validation experiments showed that the proposed methods significantly outperformed the ungapped likelihood under positional background method, a state-of-the-art method, and the widely-used position-specific scoring matrix method. We further demonstrated that motif subtypes of a TF can be readily identified in this framework and two variants called the

Conclusions

We conclude that the proposed framework is highly flexible. It enables the two novel methods to automatically identify a TF-specific subspace to search for binding sites. Implementations are available as source code at:

Background

Transcription of genes followed by translation of their transcripts into proteins determines the type and functions of a cell. Expression of certain genes even initiates or suppresses differentiation of stem cells. It is therefore crucial to understand the mechanisms of transcriptional regulation. Among them, transcription factor (TF) binding is the one that has been given considerable attention by computational biologists for the past decade and is still being actively researched. A TF is a protein or protein complex that regulates transcription of one or more genes by binding to the double-stranded DNA. A first step in computational identification of target genes regulated by a TF is to pinpoint its binding sites in the genome. Once the binding sites are found, the putative target genes can be searched and located in flanking regions of the binding sites.

In general, there are two approaches to computational transcription factor binding site (TFBS) identification, motif discovery and TFBS search. The former assumes that a set of sequences is given and each of the sequences may or may not contain TFBSs. An algorithm then predicts the locations and lengths of TFBSs. The term motif refers to the pattern that are shared by the discovered TFBSs. These algorithms rely on no prior knowledge of the motif and hence are known as

A typical TFBS search method searches for the binding sites of a particular transcription factor in the following manner. It scans a target DNA sequence and compare each length ^{th} letter in an

One assumption the PSSM representation makes is that positions in a binding site are independent, which is often not the case. Osada

In this work, we approach the TFBS search problem from a different perspective. We propose to search for binding sites in vector spaces. Specifically,

In this framework, we propose two novel approaches to constructing a query vector for a TF of interests. We compare the proposed methods to a state-of-the-art method, the ULPB method, as well as the widely-used PSSM method. Performance of a method is assessed by cross-validation experiments on two data sets collected from RegulonDB

The paper is organized as follows. In Methods, we present the novel negative-to-positive vector and optimal discriminating vector methods, in addition to introducing the existing methods compared in this work. Cross-validation results on prokaryotic and eukaryotic transcription factors are presented and discussed in Results and Discussion. Finally, we give the concluding remarks in Conclusions.

Methods

Data sets

To understand the compared methods in this work, we experimented on prokaryotic as well as eukaryotic transcription factors. The known prokaryotic TF binding sites were collected from from RegulonDB

**Name**

**Length**

**# TFBSs**

**Name**

**Length**

**# TFBSs**

MetJ

8

29

Lrp

12

62

SoxS

18

19

H-NS

15

37

FlhDC

16

20

AraC

18

20

Fis

15

206

ArcA

15

93

IHF

13

101

OmpR

20

22

PhoB

20

17

GlpR

20

23

OxyR

17

41

CpxR

15

37

NarL

7

90

CRP

22

249

TyrR

18

19

NarP

7

20

Fur

19

81

LexA

20

40

NtrC

17

17

FNR

14

87

MalT

10

20

PhoP

17

21

ArgR

18

32

NsrR

11

37

The known eukaryotic TF binding sites were collected from JASPAR CORE database (the 4^{th} release)

**Mus musculus**

**ID**

**Name**

**Length**

**# TFBSs**

MA0039.2

Klf4

10

4336

MA0047.2

Foxa2

12

809

MA0062.2

GABPA

11

87

MA0065.2

PPARG::RXRA

15

839

MA0104.2

Mycn

26

85

MA0141.1

Esrrb

12

3613

MA0142.1

Pou5f1

15

1332

MA0143.1

Sox2

15

666

MA0144.1

Stat3

19

830

MA0145.1

Tcfcp2l1

14

3931

MA0146.1

Zfx

20

477

MA0147.1

Myc

10

682

MA0154.1

EBF1

10

21

**Homo sapiens**

**ID**

**Name**

**Length**

**# TFBSs**

MA0037

GATA3

6

20

MA0052

MEF2A

10

31

MA0077

SOX9

9

45

MA0080.2

SPI1

7

35

MA0083

SRF

12

26

MA0112.2

ESR1

20

472

MA0115

NR1H2::RXRA

17

22

MA0137.2

STAT1

15

2082

MA0138

REST

19

22

MA0138.2

REST

11

871

MA0139.1

CTCF

11

944

MA0148.1

FOXA1

11

896

MA0149.1

EWSR1-FLI1

17

101

MA0159.1

RXR::RAR_DR5

17

23

MA0258.1

ESR2

18

356

Notation

For clarity, we list and define functions and variables used throughout this paper. Please see Additional file

**Detailed notation.**

Click here for file

• _{
i
}(

• _{
i,j
}(

• _{
i
}(

•

•

where

•

where _{1},_{2},_{1},_{2}∈{A, C, G, T}.

• _{
i
}denotes the information content at position _{
i
}attains the maximal entropy of 2 and we are most uncertain about the letter at position _{
i
} is simply defined as

• _{
i,j
} denotes the information content of the position pair (

where the maximal entropy of 4 is attained when

Embedding short sequences in vector spaces

We describe how a short sequence of _{
i
} denote its ^{th} nucleotide. Each nucleotide in _{
i
}is converted to
_{
i
}= 1 for

Illustration of embedding a short sequence in vector space

**Illustration of embedding a short sequence in vector space.** Each nucleotide in the sequence is converted to 4 indicator variables.

We further consider nucleotide pair (_{
i
},_{
j
}), where _{
i
},_{
j
}) only if _{
i
},_{
j
}) is similarly converted to 16 variables,

In this study, we consider two choices of _{
i
}’s and _{
i,j
}’s. For the first choice, all the nucleotides and nucleotide pairs are given the same weight, i.e., _{
i
}= 1 and _{
i,j
}= 1 for all ^{th} nucleotide according to the information content at position

Searching for TFBSs in vector spaces

Given a query vector **
t
** in space, we score an

where **
s
**denote the corresponding vector of

As described above, an **
t
** to search for binding sites in

The NPV method

We first introduce a simple approach to constructing a query vector. Let _{+} binding sites and _{−}non-binding sites of a particular transcription factor. We embed all the

as well as the mean non-binding site vector

The query vector **
t
** is found by subtracting

Illustration of the NPV method

**Illustration of the NPV method.** The solid arrow represents the negative-to-positive vector _{+}−_{−}, pointing from _{−}to _{+}. The hallow triangles denote the known binding sites, whereas the circles represent the known non-binding sites. The center of the binding site vectors is marked by the solid triangle, while the center of the non-binding site vectors is marked by the solid circle.

The score of an

We can see that it computes the similarity between

From the perspective of geometry, we note that Score(**
t
**|| , where ||

we know Score(**
t
**|| equals the orthogonal projection of

The orthogonal projection of

**The orthogonal projection of ** It can be seen that the projection of

The ODV method

We have described the NPV method, which offers a heuristic way of constructing a query vector. We now introduce a way of finding an optimal query vector
_{+} and |_{−}, that is, there are _{+} binding sites and _{−} non-binding sites for a particular TF. Let _{(1)},_{(2)},…,_{(n
+)}} and
_{(i)} denotes the ^{th}
_{+} + _{−}. We find the optimal **
β
**by solving the following minimization problem:

The constraint in (8) ensures that the projection of a TFBS _{(i)} onto the vector **
β
**,

The optimization problem in (7) is known as a quadratic programming problem with linear inequality constraints specified in (8), (9) and (10). There are **
β
**. We can see that (8) and (9) specify

The PSSM and ULPB methods

We briefly describe the ungapped likelihood under positional background (ULPB) method proposed in

where _{
i
} denotes the ^{th} letter in _{
i
}(_{
i
})/_{
i
}) is used in place of _{
i
}(_{
i
}), where _{
i
}) is the background probability of _{
i
}. The simpler form in (11) was compared in

The ULPB models a TFBS by a first-order Markov chain and models the background by another first-order Markov chain. The background transition probabilities are estimated using the entire genome of a species and hence the ULPB method uses negative examples implicitly. It scores an

Although ULPB does not consider background probability in the first term of (12), the score is approximately the log-likelihood ratio of the two Markov chains.

The main difference between the PSSM method and the NPV, ODV and ULPB methods is that the PSSM method does not score nucleotide pairs nor does it utilize a background distribution. The NPV and ODV methods explicitly take advantage of negative binding sites, while the ULPB method does it implicitly by using a background distribution. The flexibility of the proposed framework allows the NPV and ODV methods to easily search in subspaces, further distinguishing the PSSM and ULPB methods from the proposed ones.

Results and discussion

Performance assessment and evaluation metrics

The performance of a TFBS search method is evaluated by _{+ }TFBSs of length _{test}, called the _{train}, called the _{+}binding sites. It is comprised of all the

The _{+} TFBSs are first divided into _{test}is left out. The rest of the TFBSs are therefore called the _{test} along with the non-TFBSs in _{test}are then scored by the scoring function. To score a test sequence, both the forward and reverse strands are scored and, in case the test sequence is longer or shorter than _{test}, we find its rank relative to all the non-TFBSs in _{test}. Formally, the rank of _{test}|Score(

After the _{+} ranks, each of which corresponds to a TFBS. To allow comparison of methods, we use the area under the ROC curve (AUC) to gauge the performance of a method on the TF. The ROC curve is a plot of true positive rate (TPR) against false positive rate (FPR), displaying the trade-off between TPR and FPR. We refer readers to

Prokaryotic transcription factor binding sites

To understand the behavior of search methods on prokaryotic TF binding sites, we conducted 10-fold cross-validation experiments on the 26-TF RegulonDB data set. The proposed NPV and ODV methods were compared to the ULPB method

Figure

Comparison of the PSSM, ULPB, NPV and ODV methods on the RegulonDB data set

**Comparison of the PSSM, ULPB, NPV and ODV methods on the RegulonDB data set. ****(a)** Plot of AUC values across the 26 prokaryotic TFs for each method. **(b)** Matrix of

Eukaryotic transcription factor binding sites

Here we compare the proposed NPV and ODV methods to the ULPB and PSSM methods on eukaryotic TF binding sites. As in the previous section, we conducted 10-fold cross-validation experiments on the 28-TF JASPAR data set. Figure

Comparison of the PSSM, ULPB, NPV and ODV methods on the JASPAR data set

**Comparison of the PSSM, ULPB, NPV and ODV methods on the JASPAR data set. ****(a)** Plot of AUC values across the 28 eukaryotic TFs for each method. **(b)** Matrix of

Similarly, statistical tests

Motif subtype identification in vector spaces

It has been shown that the binding sites of a TF can be better represented by 2 motif subtypes than by a single motif

We demonstrate that motif subtypes can be readily identified once we embed **
μ
**

where **
μ
**

Illustration of the

**Illustration of the ****NPV method.** The solid arrows represent the negative-to-positive vectors _{+ 1}−_{−}and _{+ 2}−_{−}, pointing from _{−}to _{+ 1}and _{+ 2}, respectively. The hallow triangles denote the known binding sites, whereas the circles represent the known non-binding sites. The centers of the binding site vectors are marked by the solid triangles, while the center of the non-binding site vectors is marked by the solid circle.

Similarly, the

where **
β
**

We assessed the ^{7} and 8.31 × 10−^{5}, respectively). Results in Figure
^{3} and 3.04 × 10−^{3}, respectively). The

The

**The ****NPV ( k ODV) method versus the NPV (ODV) method.** The number of motif subtypes

Independent validation on ChIP-seq data

To evaluate the proposed NPV and ODV methods on the whole genome scale, we built TF models using TFBSs in the JASPAR database to scan all the human (build hg19) 1000-base promoter sequences obtained from the UCSC Genome Browser database

**ENCODE**

**JASPAR**

**Name**

**PSSM**

**ULPB**

**NPV**

**S**

**IC**

**ODV**

**S**

**IC**

Subspaces (S)

GATA3_(SC-268)

MA0037

GATA3

0.48922

0.46841

0.50963

1

Y

**0.51441**

1

Y

MEF2A

MA0052

MEF2A

0.42566

**0.45955**

0.35283

3

Y

0.34807

3

N

PU.1

MA0080.2

SPI1

0.50631

0.49267

0.57575

3

Y

**0.58014**

3

N

SRF

MA0083

SRF

0.34299

0.38457

**0.43920**

2

N

0.43183

3

N

NRSF

MA0138

REST

**0.50615**

0.46371

0.46603

1

N

0.47956

2

N

MA0138.2

REST

0.48031

0.48299

0.49070

3

Y

**0.49522**

3

N

ERalpha_a

MA0112.2

ESR1

**0.53980**

0.49058

0.52414

3

N

0.52146

1

N

STAT1

MA0137.2

STAT1

0.55348

0.58555

0.61733

1

N

**0.62338**

1

Y

CTCF

MA0139.1

CTCF

0.60370

0.60377

0.63785

2

Y

**0.64769**

2

Y

CTCF_(C-20)

0.44108

0.44696

0.53181

**0.54306**

CTCF_(SC-5916)

0.46729

0.47047

0.54097

**0.55028**

FOXA1_(C-20)

MA0148.1

FOXA1

0.48083

0.48698

0.48994

3

Y

**0.49853**

3

N

FOXA1_(SC-101058)

0.48897

0.48326

0.49945

**0.50986**

EBF

MA0154.1

EBF1

0.50011

0.51202

0.56084

3

Y

**0.59172**

3

N

EBF1_(C-8)

0.42214

0.43705

0.52067

**0.53207**

FOXA2_(SC-6554)

MA0047.2

Foxa2

**0.48328**

0.39496

0.45500

3

Y

0.47906

3

N

STAT3

MA0144.1

Stat3

0.39145

0.33052

0.38094

3

Y

**0.43807**

3

Y

POU5F1_(SC-9081)

MA0142.1

Pou5f1

0.42151

0.42793

0.40855

3

N

**0.45449**

3

N

For the NPV and ODV methods, the best weight and subspace combination was found by 5-fold cross-validation on the JASPAR TFBSs, while flanking genomic sequences of the TFBSs were the sources of negative binding sites. To assess the 4 compared methods, we considered the part of a ROC curve where FPR is at most 0.01 and calculated the AUC scaled to between 0 and 1. This is nearly equivalent to allowing at most 10 false positive hits per promoter on average. As a peak spans about 200 bases, it is considered recalled when it fully contains a predicted binding site. Similarly, a predicted binding site must be fully covered by a peak to be a true positive hit.

In Table
**
β
** vector and hence is less sensitive to the weight used to embed an

Conclusions

In this work, we proposed to search for transcription factor binding sites in vector spaces. The novel NPV and ODV methods were introduced to construct a query vector to search for binding sites of a TF. We compared our methods to a state-of-the-art method, the ULPB method, and the widely-used PSSM method. Cross-validation experiments revealed that the NPV and ODV methods significantly outperformed the ULPB and PSSM methods on prokaryotic as well as eukaryotic TF binding sties. Independent validation on human ChIP-seq data further verified that the NPV and ODV methods are significantly better than the other compared methods.

One of the advantages of our framework is that it allows one to easily search for binding sites in various subspaces. Hence, one can search in the best subspace for each individual TF since one can hardly find an optimal subspace for all the TFs. Another advantage is that under the proposed framework one can readily identify motif subtypes for a TF. Hence, to exploit this advantage, we introduced the

Our future work aims for extending our proposed methods to handling known binding sites of variable lengths. We will seek to approach this problem without resorting to multiple sequence alignment, which is notoriously time-consuming. In the meantime, we will also seek to identify additional promising subspaces to search for TF binding sites in.

Competing interests

Both authors declared that they have no competing interests.

Author’s contributions

CL and CH conceived the study. CL collected the data, carried out the experiments and drafted the manuscript. CH guided the study and revised the manuscript. Both authors read and approved the final manuscript.

Acknowledgements

This work was supported in part by National Science Foundation [grant numbers CCF-0755373 and OCI-1156837].