Ottawa Institute of Systems Biology and Department of Biochemistry, Microbiology and Immunology, Faculty of Medicine, University of Ottawa, Ottawa, Ontario, Canada

Abstract

Background

The identifying of binding sites for transcription factors is a key component of gene regulatory network analysis. This is often done using position-weight matrices (PWMs). Because of the importance of

Results

The present work implements the optimization algorithm applied to the existing PWM for GATA-3 transcription factor and builds a new di-nucleotide PWM. The existing available PWM is based on experimental data adopted from Jaspar. The optimized PWM substantially improves the sensitivity and specificity of the TF mapping compared to the conventional applications. The refined PWM also facilitates

Conclusion

Our proposed di-nucleotide PWM approach outperforms the conventional mono-nucleotide PWM approach with respect to GATA-3. Therefore our new di-nucleotide PWM provides new insight into plausible transcriptional regulatory interactions in human promoters.

Background

Understanding the regulation of gene expression is a complex problem and one of the most challenging domains of biological and biomedical research. Intensive ongoing studies aim to understand the detailed mechanisms of the transcriptional regulation in eukaryotes. Transcription factors (TFs) are proteins that regulate the activity of a gene at the levels of mRNA synthesis. These factors bind to specific DNA sequences at positions in the genome near the gene and either reduce or enhance its transcription rate

A crucial limitation of the PWM approach is the paucity of a sufficient number of high confidence, experimentally verified binding sites. One way to address this problem is to include additional transcription factor binding sites (TFBS) identified computationally by including genomic sequences with substantial similarity to the PWM of a particular TF

Several methods to build PWMs have been described. One of the most successful methods was proposed by Staden

The Staden method does not include the definition of the optimal cutoff to minimize a level of false positive predictions for a given level of true positives

A standard PWM approach is based on the assumption that individual nucleotides contribute independently and additively to the binding of a TF to a given DNA motif

We implemented the Gershenzon's method

From the binding sites discovered in the present study some were previously confirmed to be the important binding sites

Results

GATA-3 overrepresentation in the promoter area

We started with computational identification of the GATA-3 binding motifs in the promoters, and then we determined the region where the GATA-3 sites are statistically over-represented. Following _{r} see Methods, Formula 6). Since positional distribution of GATA-3 was basically consistent for both databases (Figure

**The following additional data are available with the online version of this paper (all included in one file).** Additional file 1: Figure S1: z-score distribution of GATA-3 across 10 kb upstream of the EPD promoters. Additional file 1: Figure S2: Comparison of the GATA-3 z-score distribution in promoters divided with respect to the presence of Initiator element. Additional file 1: Table S1: The novel putative GATA-3 sites discovered from the EPD with location in the promoter sequences.

Click here for file

Distribution of the occurrence frequency of GATA-3 in two different human promoter databases

**Distribution of the occurrence frequency of GATA-3 in two different human promoter databases.** Occurrence frequency of the GATA-3 motifs in the promoter databases. x-axis represents distance (bp) with regards to TSS and y-axis represents the z-score. Red line denotes the z-score distribution of OF in EPD and blue is for DBTSS. The promoters were aligned with respect to the TSS which is at 0. The horizontal line denotes the OF_{r} in the shuffled promoter database which is derived from EPD promoter database. The inset figure shows the distribution area from -50 bp to +50 bp to clearly reveal both the peaks.

The difference between the occurrence frequency of the GATA-3 motif in the promoter sequences and occurrence frequency in the shuffled promoter database (shown in the plot as the horizontal line) is much higher in the peak areas than in the rest of the promoter interval. The plot prominently identifies the over-represented area in the promoter with z-score of ~3. The GATA-3 sites are functional in either orientation

Distribution of the occurrence frequency of TCF-1, Ets-1 and E2F in human promoter database (EPD) in the proximal promoter region

**Distribution of the occurrence frequency of TCF-1, Ets-1 and E2F in human promoter database (EPD) in the proximal promoter region.** The individual occurrence frequency distributions of the various transcription factors in human promoters from EPD. The y-axis is the OF of the transcription factors TCF-1 (**A**), Ets-1 (**B**) and E2F (**C**) motifs. The x-axis shows the areas upstream and downstream of the TSS which is at 0. The blue horizontal line shows the expected occurrence frequency OF_{r} for each of the factors calculated from the shuffled sequence dataset.

Differences in the distribution of the occurrence frequency of transcription factors at different thresholds (relaxed and strict)

**Differences in the distribution of the occurrence frequency of transcription factors at different thresholds (relaxed and strict).** The comparison of occurrence frequencies of the four transcription factors (GATA-3, Ets-1, TCF-1 and E2F) for different thresholds. The axes are same as at the Figure _{r} from the shuffled dataset.

As seen from the z-score distribution (Figure

Furthermore, to investigate the reason of the slope we also checked the z-score distribution of other transcription factors, namely Sp1 and Pu1 (Figure

z-score distribution of Pu1 transcription factor across the promoter sequences from EPD (-499 to +100)

**z-score distribution for Pu1 transcription factor across the promoter sequences from EPD (-499 to +100).** Blue plot is the distribution of z-score with relaxed threshold and red is with stringent threshold. X-axis represents distance with regards to TSS(bp) and y-axis represents the z-score.

Comparison of the GATA-3 z-score distribution in divided promoters with respect to the presence of TATA-box

**Comparison of the GATA-3 z-score distribution in promoters divided with respect to the presence of TATA-box.** Distribution of the GATA-3 motif in partitioned EPD promoters based on the presence and absence of TATA-box element. Magenta and blue line represent the distribution of GATA-3 in promoters containing (406) and lacking (1464) TATA-box respectively and red line represents the distribution in all promoters (1870). X-axis represents distance with regards to TSS (bp) and y-axis represents the z-score.

However, we have also classified the promoters to the CpG-island-containing (CpG+) and non-CpG-island-containing (CpG-) groups with the same method

Comparison of the GATA-3 z-score distribution in divided promoters with respect to the presence of CpG islands

**Comparison of the GATA-3 z-score distribution in promoters divided with respect to the presence of CpG islands.** Distribution of the GATA-3 and Sp1 motif across the EPD promoters divided based on the presence and absence of CpG-island. **A**) Comparison of the distribution of GATA-3 in promoters containing CpG-island (red plot) and promoters lacking CpG-island (blue plot). X-axis represents distance (bp) with regards to TSS and y-axis represents the z-score. **B**) Comparison of the distribution of hits of GATA-3 and Sp1. GATA-3 hits are represented with red and blue in CpG-island present and absent promoters respectively and magenta and black represent the distribution in CpG-island present and absent promoters respectively. X-axis represents distance (bp) with regards to TSS and y-axis represents the hits.

This means that the slope in the plot is caused by the presence of CpG-islands in the promoters. The over-representation of GATA-3 motif over TSS is caused by the genuine GATA-3-like motif rich area whereas the peak further upstream around -31 bp from TSS is caused by the TATA box rich area.

New PWM for the GATA-3

To start the optimization we built a PWM from the binding sites from Jaspar

Both the mono-nucleotide matrix and the di-nucleotide matrix were optimized in the same window. The determined initial cutoff value for the original mono-nucleotide matrix is -1.0 with the sensitivity 50%. The two new matrices were optimized with two different cutoff values: -2.0 for the mono-nucleotide and -3.5 for the di-nucleotide matrix. We compared the performance of the initial PWM built from the Jaspar binding sites for GATA-3 as described in the Methods section, as well as those for the new mono-nucleotide and di-nucleotide PWMs with different levels of sensitivity with the performance of Match program (TFBS search algorithm from TRANSFAC) for accession id M00077. Receiver-Operator Characteristic (ROC) curves which plot the true positive rate vs. false positive rate (specificity vs. sensitivity) are usually used to compare different classifiers _{r}) of predicted sites picked from the shuffled sequences as a level of false positives (Formula 8).

Figure _{r} versus sensitivity (percentage of sites selected from the experimentally verified motifs). The Match program uses three matrix-specific cutoffs which attempt to minimize either false-negative error (minFN), false-positive error (minFP), or the sum of these two errors (minSUM) _{r} for Match is much higher comparing to the other PWMs for the sensitivity around 60%. Here, the Match program was run with the minimum FN cutoff provided in the TRANSFAC. The sensitivity of the Match for other thresholds was very low (~15% and ~30% for minimal FP and SUM, respectively). The original PWM obtained with the Bucher’s method also performs better compared to the Match. Comparing the original PWM built by the Bucher’s method with new mono-nucleotide PWM we can see a very little difference between the performances of the matrices. But if we compare performance of the initial PWM with those of the optimized di-nucleotide PWM we can see that the OF_{r} is much lower (i.e. specificity is much higher) for the di-nucleotide matrix with similar sensitivity. This can be observed from the Figure _{r} on the y-axis is around 0.007 with the sensitivity of ~60%. Yet OF_{r} for other PWMs at the Figure _{r} reaches up to around 0.004 even with 80% sensitivity which is far below from the OF_{r} of Match which is 0.007.

Comparison of PWM (mono-nucleotide and di-nucleotide) with the program Match

**Comparison of PWM (mono-nucleotide and di-nucleotide) with the program Match.** The OF_{r} and sensitivity comparison for different PWM and Match. Occurrence frequencies are at Y-axis. X-axis denotes the sensitivities. The average OF_{r} of the Match tool is denoted as filled circle (upper left corner). The sensitivity and specificity curve for the PWM obtained from GATA-3 binding sites from Jaspar is represented by blue, red and magenta are of the new mono-nucleotide PWM and di-nucleotide PWM respectively.

We have also compared the performance of all the matrices considering specificity as proportion of true hits among all positive predictions using ROC curve (Figure

A Receiver-Operator Characteristic curve (ROC) of the optimized PWM (mono-nucleotide and di-nucleotide) compared with the program Match

**A Receiver-Operator Characteristic curve (ROC) of the optimized PWM (mono-nucleotide and di-nucleotide) compared with the program Match.** The red and blue lines represent mono and di-nucleotide PWMs respectively. The filled and blank circles represent the optimized cutoff for mono and di-nucleotide PWMs. The empty triangle on the top right represents the Match minFN cutoff and the filled and the blank square represents the performance of Match minFP and minSUM cutoff.

Comparison of the new and the original PWM

The new mono-nucleotide matrix is similar to the original matrix with some insignificant differences for T in positions 1, 3, 6 (Table

**A)**

**1**

**2**

**3**

**4**

**5**

**6**

**A.** Original mono-nucleotide nucleotide counts (top panel), frequency table (middle panel) and calculated PWM (bottom panel). The frequency table and the PWM is calculated from 63 motifs adopted from Jaspar. The first row represents the column index of the matrices. The last row represents the consensus of the original PWM.

**B.** Optimized mono-nucleotide frequency table and PWM. The nucleotide counts (top panel) and frequency table (middle panel) calculated from the 68 sites obtained by the optimization process in the functional window -7 to 0 was used to build the optimized PWM. The bottom panel is the calculated PWM. The first row represents the column index of the matrices. The last row represents the consensus of the optimized PWM.

A

25

0

61

0

39

15

T

20

0

1

58

19

8

G

4

62

1

5

4

37

C

14

1

0

0

1

3

A

0.40

0.0

0.97

0.0

0.62

0.24

T

0.32

0.0

0.02

0.92

0.30

0.17

G

0.06

0.98

0.02

0.08

0.06

0.59

C

0.22

0.02

0

0.0

0.02

0.05

A

0.00

-3.83

0.00

-4.12

0.00

-0.55

T

-0.021

-3.82

-4.10

0.00

-0.79

-1.16

G

-2.19

0.00

-4.47

-2.81

-2.63

0.00

C

-0.92

-4.11

-4.51

-4.47

-4.10

-2.50

a/t

G

A

T

A

g/a

**B)**

1

2

3

4

5

6

A

27

0

66

0

43

16

T

22

0

1

63

20

8

G

4

67

1

5

4

41

C

15

1

0

0

1

3

A

0.40

0.0

0.97

0.0

0.63

0.24

T

0.32

0.0

0.01

0.93

0.29

0.12

G

0.06

0.99

0.01

0.07

0.07

0.62

C

0.22

0.01

0.0

0.0

0.01

0.04

A

0.00

-3.83

0.00

-4.13

0.00

-0.59

T

-0.19

-3.82

-3.48

0.00

-0.76

-1.27

G

-2.26

0.00

-3.84

-2.9

-2.73

0.00

C

-0.93

-3.49

-4.52

-4.48

-3.4

-2.61

a/t

G

A

T

A

g/a

As was stated earlier in

**1**

**2**

**3**

**4**

**5**

Optimized di**-**nucleotide frequency table and PWM. The observed frequencies are provided in the first line for each di-nucleotide, with the following line representing expected di-nucleotide frequencies (calculated from the mono-nucleotide frequencies). The presented are the frequencies of the di-nucleotides from the motifs selected from the interval -7 to 0.

AA

0

0

0

0

14

0

0

0

0

11.75

AT

0

0

72

0

10

0

0

71.04

0

5.88

AG

32

0

5

0

20

30.91

0

5.64

0

30.12

AC

0

0

0

0

4

0.46

0

0

0

2.2

TA

0

0

0

45

9

0

0

0

46.28

5.47

TT

0

0

1

24

0

0

0

1.08

21.53

2.73

TG

25

0

0

4

17

25.18

0

0.09

4.31

14.01

TC

0

0

0

1

0

0.38

0

0

1.08

1.03

GA

0

76

0

3

0

0

75.55

0

3.67

1.09

GT

0

1

1

2

0

0

1.14

1.08

1.71

0.55

GG

3

1

0

0

3

4.58

1.14

0.09

0.34

2.8

GC

1

0

0

0

1

0.07

0

0

0.09

0.21

CA

0

1

0

0

0

0

1.13

0

0

0.27

CT

0

0

0

0

0

0

0.02

0

0

0.14

CG

18

0

0

0

1

17.17

0.02

0

0

0.7

CC

0

0

0

0

0

0.26

0

0

0

0.05

AA

-4.62

-5.66

-6.16

-5.81

-0.16

AT

-4.14

-5.18

0.00

-5.33

-0.02

AG

0.00

-5.91

-3.39

-6.06

-0.05

AC

-4.41

-5.45

-5.94

-5.60

-1.20

TA

-4.02

-5.06

-5.55

0.00

0.00

TT

-4.58

-5.62

-4.72

-1.20

-4.17

TG

-0.09

-5.75

-6.24

-3.11

-0.06

TC

-4.69

-5.73

-6.22

-4.48

-4.27

GA

-4.69

0.00

-6.22

-3.38

-4.27

GT

-4.44

-4.08

-4.57

-3.54

-4.02

GG

-2.69

-4.83

-6.72

-6.38

-2.27

GC

-3.70

-6.14

-6.63

-6.29

-3.29

CA

-4.70

-4.34

-6.24

-5.89

-4.28

CT

-4.84

-5.89

-6.38

-6.04

-4.43

CG

-0.47

-5.80

-6.29

-5.95

-2.95

CC

-5.17

-6.22

-6.71

-6.37

-4.76

AG/TG/CG

GA

AT

TA

TA/AT/AG/TG/AA

If we compare the frequencies presented in the Table ^{th}) column. For example, the expected frequency for di-nucleotide AT at position 5 is 5.88 but the observed value (10) is much higher. The new di-nucleotide PWM includes more variations in the 1^{st}, 5^{th} and 6^{th} positions.

GATA-3 is a factor from the family of DNA binding proteins GATA with consensus motif (A/T)GATA(A/G) ^{th} position, which results in the incorporation of nucleotide T in the 5^{th} and 6^{th} positions in the core consensus motif as compared to the mono-nucleotide PWM. However, like mono-nucleotide PWM, the di-nucleotide PWM also assigns higher weight to AA, TG and AG which confirms the presence of nucleotides A or G in the 6^{th} position. Like for the mono-nucleotide PWM, higher weights are assigned to CG, AG and TG, which confirms the variation of C, A and T at the 1^{st} position. The incorporation of C at the 1^{st} position, T at the 5^{th} position and T at the 6^{th} position makes the new consensus [A/C/T]GAT[A/T][A/T/G] which is different from those by the mono-nucleotide PWM ([A/T]GATA[G/A]).

Genome-wide mapping of GATA-3 binding sites

The new PWMs can be used to search for novel putative binding sites

Table

**Motif**

**Score**

**Match**

**minFN**

**minSum**

**minFP**

New sites found in human promoter sequences by new PWMs and Match. The sites selected with highest score by the di-nucleotide PWM are shown. The column Motif shows the sites picked up from the human promoters in EPD database and the column Score shows the score given to these sites. The column Match shows the sites found by the Match program with the threshold provided in TRANSFAC as a minimum FN, minimum SUM and minimum FP level in the same database EPD.

TGATAG

-0.14

Found

Not found

Not found

AGATTA

-1.19

Not found

Not found

Not found

CGATTA

-1.66

Not found

Not found

Not found

GGATAT

-2.71

Found

Not found

Not found

Recently GATA-3 bound regions in the human genome in T-47D epithelial cell line derived from a mammary ductal carcinoma were submitted by the ENCODE Project Consortium

**PWM**

**GATA-3 bound sequences**

**GATA-3 not bound sequences**

**Total hits**

**TP sequences**

**FN**

**Sensitivity**

**Total hits**

**Total sequences with hits**

**TN**

New-mono

45095

21289

7040

75%

39276

19760

8569

New-di

58409

23378

4951

83%

52897

22450

5879

Match_minFN

102257

27158

1171

96%

97461

27071

1258

Match_minFP

673

663

27666

2.38%

639

630

27699

Match_minSUM

14988

11436

16893

43%

12330

9846

18483

Discussion

Since the prevalent positioning of the GATA-3 motif overlaps the TSS, it can be suggested that the GATA-3 motifs (GATA) and TSS-related motif share some bases. To check this possibility we have divided the promoters with respect to the Initiator element Inr (YYANWYY) with Promoter Classifier

The slope in the z-score distribution of the AT-rich factors like GATA-3, Ets-1 and TCF-1 is the manifestation of the under-representation of their binding sites around the functional window. If we plot the z-score distribution of GATA-3 further upstream i.e. up to 10 kb, the slope in the z-score starts approximately from 1 kb upstream of the TSS. The under-representation becomes more prominent closer to the functional window (Additional file

Our new mono-nucleotide and di-nucleotide PWM were able to identify novel binding sites for GATA-3 factor (Additional file

Conclusions

The present work provides computationally refined PWMs for GATA-3 transcription factor along the lines established earlier

The high-throughput TFBS data is gradually revealed in the ChIP-chip and ChIP-Seq experiments. Yet the ChIP-chip method does not provide the data with high resolution necessary for building reliable PWM. The high throughput GATA-3 TFBS data is not published yet in any frequently used databases like TRANSFAC or Jaspar. To work with PWM for GATA-3 one still has to resort to the data from TRANSFAC and Jaspar, which are quite widely used as the best available datasets now despite known weaknesses. Therefore any scientist looking for a model to predict putative GATA-3 binding sites in sequences of interest is still limited by the available (even though somewhat inadequate) model to work with. This study focuses on the improvement of the existing GATA-3 PWM with the same limited resources. While we may someday be overwhelmed with binding site information for GATA-3 from technologies like ChIP-chip or ChIP-seq, at the present time our method provides substantially better alternative to the existing PWMs from TRANSFAC or Jaspar.

Materials and methods

The method of optimization proposed in

Building initial PWM

To build the initial PWM form the training set of experimentally defined binding sites, we used 63 experimentally defined motifs for human GATA-3 from Jaspar database

where _{
bi
} is the number of times base ^{th} position of the motif and

The expected frequencies were derived from the human promoter sequences from Eukaryotic Promoter Database (EPD)

The database contains 1870 non-redundant experimentally verified human promoter sequences. We extracted 600 bp promoter sequences from this database, which comprise up to -499 positions upstream of the transcription start site (TSS) to position +100 downstream with TSS at 0. The promoters are aligned with respect to the TSS. Therefore the value for L in our case is 600, the length of the promoter area.

The positional distribution of the GATA-3 motif derived from the above database is also compared with Database of Transcription Start Sites (DBTSS)

The weight for each position of the matrix is derived using the formula described in

Here _{
bi
} is the number of times base ^{
th
} position of the motif, _{
i
} is a constant providing column maximum value to be zero, _{
i
} is a smoothing parameter preventing the logarithm of zero (or too small a value).

(The parameter _{
i
} in Bucher’s formulae is used as the smoothing percentage.) We adopted the criteria as described in _{
i
} = 0 if the first term under logarithm in Formula 2 is larger than

To calculate weights for the di-nucleotide matrix we used the same Formula 2. In this case _{
bi
} is the number of times di-nucleotide _{
bi
} is the expected frequency of the di-nucleotide ^{
th
} position, _{
i
} and _{
i
} have the same meaning as for the mono-nucleotide PWM, _{
i
} = 0 if the first term under logarithm in Formula 2 is larger than

The mono-nucleotide matrix thus built has 4 rows where each row represents each nucleotide and the columns represent positions inside the motif. The di-nucleotide matrix has 16 rows, with each row representing each di-nucleotide. The number of columns of the matrix represents the length of the motifs which is less by one for di-nucleotide PWM comparing to those for mono-nucleotide.

To calculate the weight score

where _{
m
} is the length of PWM, _{
bi
} is the weight of nucleotide _{
m
} _{
m
} -1 instead of _{
m
} and _{
bi
} represents weight of di-nucleotide.

Finding the functional window and optimization of the matrix

To obtain the positional distribution of the GATA-3 motif we compare the observed occurrence frequency of the GATA-3 motif with its background or expected frequency along the promoter sequences. The background frequency is determined by shuffling each sequence from the promoter database which results in a randomized DNA sequence with the same nucleotide content.

Shuffling of the sequences was done by cutting each sequence in randomly chosen positions into randomly chosen smaller fragments and rearranging these fragments. The sequences were fragmented with segment lengths from 1 bp to 10 bp. This step was repeated 100 times and the whole process was repeated 100 times. The EPD database was used for the shuffling; therefore the shuffled sequence database contained sequences of same number and length. Since we exclusively considered the promoter region for shuffling, we thereby preserved the proportion of all the nucleotides in the shuffled sequences as that in the promoter datasets. The reason to preserve the proportion is to retain the GC-rich property of the promoter in the shuffled sequences. The GC-proportion was checked with the help of program called “geecee” from the collection of program suite EMBOSS

To identify the area where the GATA-3 binding site motif is over-represented along the aligned promoter sequences, we looked into the distribution of the z-score derived as

where

The occurrence frequencies were calculated as _{
i
} is the number of promoters containing considered motif starting at position _{
s
} is the number of sequences.

The area where the occurrence of the GATA-3 motif is statistically higher than expected, which is represented by z-scores ~3 or higher, is regarded as the initial “functional window” (Figure

We assume that statistically significant occurrence of the sites in the “functional window” reflects importance of this window in biological function. The functional window thus obtained is the initial approximate interval from where the new sites can be incorporated to build a new PWM. The final matrix after optimization would define the exact functional window.

Calculation of a new GATA-3 PWM from the existing PWM

PWMs are routinely used for prediction of the binding affinities for TFs to a segment of DNA sequence in prokaryotes and eukaryotes

We start with the initial PWM built from the 63 GATA-3 experimentally defined motifs adopted from Jaspar, as described in the previous section. As a control set of sites, we use the 26 unique motifs from the 63 experimentally defined GATA-3 binding sites in human genes from Jaspar. The removal of redundancy from the experimental motifs was important to avoid any biasness toward any motif. The method also utilizes a database to incorporate new putative binding sites of interest to build new PWM. We have used the EPD for this purpose.

Since the initial PWM is not provided with a given cutoff we determine the cutoff value from the correlation coefficient (CC) distribution. The CC is calculated as:

CC is calculated for each cutoff starting from a very stringent threshold and relaxing the threshold until we get the maximal CC. To calculate CC here, we designate TP as the number of sites from the experimentally defined dataset positively identified by the matrix with a given cutoff. FN is defined as the difference between the total number of sites in the experimental dataset and TP. We designate negative sites as all possible sites from the shuffled sequence datasets, which can be calculated as 594 × 1870 where 594 is length of the shuffled promoter sequence (600) minus the length of the matrix and 1870 is the number of sequences in the shuffled dataset. We define FP as the sites picked up as positive from the total negative sites and TN as the difference between the total negative sites and FP.

The method starts with extracting putative binding sites for GATA-3 based on existing PWM with the cutoff determined at the above step. The PWM extracts putative binding sites from inside the identified initial functional window. The functional window was defined comparing the occurrence frequency distribution of the GATA-3 binding sites against the shuffled sequences. A new matrix is built from these aligned sites using the formulae described in

Sensitivity was calculated as:

The optimization is done in three levels, as follows: cutoff value, then motif length and finally functional window.

The objective function that is optimized in this method is the correlation coefficient (Formula 7). This criterion utilizes all the four parameters: true positives, false positives, true negatives and false negatives.

The definitions of the TP, FP, TN and FN for the optimization procedure are slightly different from the previously described.

The TP here is defined as the number of sites positively identified by the new matrix from the given functional window identical to the sites extracted by the original matrix. FP is defined as the difference between the total number of sites identified as positive by the new matrix and the number of sites identified as positive by the original matrix. FN is defined as the difference between the total number of sites from the experimental dataset recognized by the original matrix as positive and the TP. And TN is the total number of possible sites from the functional window subtracting TP, FP and FN:

CC is calculated every time after building a new matrix by changing any parameters. (See the flow chart of the adopted optimization process in

First the optimal cutoff value is obtained for the given position and size of the functional window and for the given motif length. This is attained by calculating CC parameter for every changed cutoff value. The cutoff value varies around the initial given cutoff value. The range we have used is from -0.5 to -4.0 with the increment of 0.1. The cutoff value is considered to be optimal where the CC reaches the maximum. Next, the length of the matrix is varied while the optimal cutoff value is kept. If the CC reaches higher value than at the previous step, the modified length is considered as optimal at the current stage. Thus we obtain a modified matrix with optimal modified length and cutoff values. This modified matrix is regarded as the initial matrix for further process of optimization. The optimization cycle continues with this new initial matrix and all the aforementioned steps are repeated. This continues until we reach the maximal CC = 1. Usually it takes 6 to 12 cycles for the matrix to converge, which is consistent with the previous work

The parameters TP, TN, FP and FN described earlier are internal parameters of the procedure, and they are not used to evaluate the sensitivity and specificity of the final PWM. The sensitivity of the optimized PWM is calculated as the number of experimentally confirmed sites recognized by the new matrix. And to calculate the specificity we use the occurrence frequencies of predicted TFBS in the randomized sequences. We assume that the sites recognized as positive from the randomized sequences are the false positives. We calculate occurrence frequency as the average number of positive predictions per bp in the random shuffled dataset:

where _{
r
} to designate occurrence frequency calculated from shuffled sequence dataset. Therefore higher the occurrence frequencies from the shuffled sequences are, lower is the specificity. Now we choose the matrix resulted from the process of optimization that has sensitivity and specificity higher than the initial PWM.

Abbreviations

PWM: Position weight matrix; TFBS: Transcription factor binding site; CC: Correlation coefficient; TP: True positive; TN: True negative; FP: False positive; FN: False negative; EPD: Eukaryotic Promoter Database; DBTSS: Database of Transcription Start Sites; ROC: Receiver-Operator Characteristic; OF: Occurrence frequency; OF_{r}: Occurrence frequency from randomized sequences; ENCODE: Encyclopedia of DNA Elements; TCR: T cell receptor; Th: T helper cell; GEO: Gene Expression Omnibus.

Competing of interests

The authors declare that they have no competing interests.

Authors' contributions

SN. performed calculations, analyzed the results and wrote the manuscript. II. conceived the study and directed the work, including data analysis, figure assembly and manuscript writing. Both authors read and approved the final manuscript.

Acknowledgements

The authors thank Alex Blais, Mads Kaern and David Bickel for critical editing of the manuscript and Ashkan Golshani for useful discussions and experimental support. The authors also thank Vyacheslav Morozov, Sergey Hosid and Doo Yang for their essential remarks concerning the project. The research presented here was supported by a Natural Sciences and Engineering Research Council of Canada [grant number RGPIN/ 372240-2009] and Canada Foundation for Innovation Leaders Opportunity Fund / Ontario Research Fund [grant number 22880].