Email updates

Keep up to date with the latest news and content from BMC Genomics and BioMed Central.

Open Access Research article

Towards biological characters of interactions between transcription factors and their DNA targets in mammals

Guangyong Zheng12*, Qi Liu23, Guohui Ding12, Chaochun Wei23* and Yixue Li12*

Author Affiliations

1 Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai, 200031, China

2 Shanghai Center for Bioinformation Technology, 100 Qinzhou Road, Shanghai, 200235, China

3 School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China

For all author emails, please log on.

BMC Genomics 2012, 13:388  doi:10.1186/1471-2164-13-388

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2164/13/388


Received:29 February 2012
Accepted:29 June 2012
Published:13 August 2012

© 2012 Zheng et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

In post-genomic era, the study of transcriptional regulation is pivotal to decode genetic information. Transcription factors (TFs) are central proteins for transcriptional regulation, and interactions between TFs and their DNA targets (TFBSs) are important for downstream genes’ expression. However, the lack of knowledge about interactions between TFs and TFBSs is still baffling people to investigate the mechanism of transcription.

Results

To expand the knowledge about interactions between TFs and TFBSs, three biological features (sequence feature, structure feature, and evolution feature) were utilized to build TFBS identification models for studying binding preference between TFs and their DNA targets in mammals. Results show that each feature does have fairly well performance to capture TFBSs, and the hybrid model combined all three features is more robust for TFBS identification. Subsequently, correspondence between TFs and their TFBSs was investigated to explore interactions among them in mammals. Results indicate that TFs and TFBSs are reciprocal in sequence, structure, and evolution level.

Conclusions

Our work demonstrates that, to some extent, TFs and TFBSs have developed a coevolutionary relationship in order to keep their physical binding and maintain their regulatory functions. In summary, our work will help understand transcriptional regulation and interpret binding mechanism between proteins and DNAs.

Background

Transcription factors (TFs) are important functional proteins, which play central roles in transcriptional regulation by interacting with specific DNA targets. These targets are named as transcription factor binding sites (TFBSs), which are short DNA fragments mainly located in promoter regions of genes. Generally, TFs can be grouped into four classes according to their structures and functions:(1) TFs with basic domains (basic-TFs), (2) TFs with zinc-coordinating DNA binding domains (zinc-TFs), (3) TFs with helix-turn-helix patterns (helix-TFs), and (4) beta-scaffold factors with minor groove contacts (beta-TFs)[1,2].

Interactions between TFs and their targets are significantly correlated with gene expression, so comprehensively investigating those interactions is crucial to understand transcriptional regulation. For this purpose, one of the primary steps is to represent TFBSs with appropriate features. Generally, three features are often utilized to describe biological characters of TFs’ DNA targets. (1) Sequence feature, which is the sequence similarity of DNA segments to a position weight matrix (PWM). A PWM is a mathematical model, which reflects nucleotide occurrence probability in each position [3,4]. When a DNA segment is marked with a high score to a valid PWM, it is considered as a positive instance. TFBS prediction methods based on PWM were successfully carried out on some TF data sets [3-6]. But these methods require prior PWM models, which are not available for many TFs. Besides, PWM-based methods may generate too many false positive predictions when they are executed on a genome-wide scale [7,8]. (2) Structure feature, which is conformational and physicochemical information of a DNA segment. Since transcription factors interact physically with their DNA targets, it is reasonable to depict binding preference between TFs and TFBSs through conformational and physicochemical information. For example, Pomomarenko and his colleagues [9,10] employed the conformational and physicochemical values of DNA segments to predict TFBSs. (3) Evolution feature, which is a conservation score of a DNA segment. Because transcription factor binding sites are functional elements. It is commonly believed that these elements are conserved in evolution. In fact, some algorithms for TFBS identification have been proposed based on the assumption that TFBSs are more converved than their surrounding non-functional fragments in order to maintain their functions[11-14].

Pioneer works based on the three features provide promising results and broaden our knowledge of interactions between TFs and TFBSs. Nevertheless, some aspects about interactions between TFs and TFBSs are still unclear. (1) Which feature has the greatest power for describing binding preference between TFs and TFBSs? That is to say, among the models using these three features, which one has the best performance for recognition of TFBSs? In addition, do any complementarities exist for those features? If the answer of the last question was true, then a hybrid model combining these three features should represent binding preference between TFs and TFBSs more comprehensively. (2) In terms of relationships between TFs and TFBSs, is there any correspondence existing in the sequence, structure, and evolution level? Since each of the sequence, structure, and evolution feature can denote TFBSs effectively, we can investigate the correlation between TFs and TFBSs at these three features’ aspects. To be more specific, if the sequences of two TFs are similar, will their TFBSs’ sequences be similar as well? If two TFs can be categorized into a group based on their structure information, will their corresponding TFBSs be also categorized into a group as well? If a TF is conserved in evolution, will its TFBSs be conserved as well? Answers to these questions may help people understand interactions between TFs and TFBSs and reveal their correlations in evolution.

In this paper, experimentally verified TFs and their corresponding TFBSs were first collected for three mammals (Homo sapiens, Mus musculus, and Rattus norvegicus), and then a TFBS recognition model was constructed based on each feature mentioned above. In total, we had three models. The accuracy of each model was used as the measurement to inspect its capability to describe binding preference between TFs and TFBSs. In addition, a hybrid model, integrating all three features, was built to evaluate complementarities of those features. After that, the correspondence between TFs and TFBSs was surveyed at sequence, structure, and evolution aspect respectively. Our results may offer new clues for TFBSs’ identification. Moreover, the correspondence between TFs and TFBSs we obtained accumulates the knowledge of interactions between proteins and DNAs. Thus, our investigation will shed light on understanding transcriptional regulation in mammals.

Methods

Dataset of transcription factors and their DNA targets

Experimentally verified TFs and their corresponding TFBSs were collected from the TRANSFAC database (v 9.4) [1,2] for three mammals (Human, Mouse, and Rat). A TF was selected when it contained more than 10 verified DNA targets. As a result, 326 groups of TFs and their DNA targets (TF-TFBSs) were generated. 309 of the 326 groups contained PWM patterns and the remaining 17 groups had no PWM information [see Additional file 1. The 309 groups with PWM patterns were named dataset 1, while the rest 17 groups were termed dataset 2. Moreover, according to the description of TRANSFAC database, TFs contained in the dataset 2 had less conserved binding sites, since their TFBSs were not able to be aligned to generate a PWM. Based on our previous work [15,16], among those 326 TFs, 270 TFs with amino acid sequence were classified into four classes according to their structures and domains [see Additional file 2 and Additional file 3. Detailed information of TF-TFBS datasets was summarized in Table 1. Given a TF, verified DNA targets were used as positive instances. Meanwhile, promoter sequences of the three mammals were obtained from the Eukaryotic Promoter Database (EPD) [17,18] to construct negative instances: First, those promoter sequences were utilized as training data to generate a 3rd-order hidden markov model; then the model was employed to produce 5 kb-long pseudo DNA sequences, which had the same nucleotide distribution of those promoter sequences; subsequently, a window (with the average length of positive instances for a TF) was employed to scan and cut those pseudo sequences for building a negative instance pool; finally, for each TF, 10 DNA sequence sets were constructed by mixing equal positive and negative instances. In practice, for each DNA sequence set, the negative instances were randomly selected from the pool.

Additional file 1. Information of 326 transcription factors.

Format: XLS Size: 36KB Download file

This file can be viewed with: Microsoft Excel ViewerOpen Data

Additional file 2. Sequences of 270 transcription factors.

Format: PDF Size: 208KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Additional file 3. Classification of 270 transcription factors (with sequences).

Format: XLS Size: 34KB Download file

This file can be viewed with: Microsoft Excel ViewerOpen Data

Table 1. Detailed information of transcription factors and their DNA targets

Sequence feature of a DNA segment

For a DNA segment, its sequence feature was calculated through Equation 1 modified from some previous studies [3,5,6]. The sequence feature presented a score for assessing the similarity of a short DNA fragment to a known PWM pattern.

<a onClick="popup('http://www.biomedcentral.com/1471-2164/13/388/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/13/388/mathml/M1">View MathML</a>

(1)

where n is the length of the DNA segment, j denotes a position in the DNA segment or the PWM, ij denotes the base (A,T,C,G) of position j, Wj(ij) is the weight of position j for the DNA segment, Cj is the information content of position j for the DNA segment, fij is the frequency of base i occurred in position j for the PWM pattern, Pi is the observation probability of base i in background sequences. When an instance was evaluated, scores of the Watson and Crick strands were calculated respectively, and the higher one was assigned to the instance.

Structure feature of a DNA segment

For a DNA segment, its structure feature was calculated through an empirical formula (Equation 2) proposed by Ponomarenko and his colleagues [9,10].

<a onClick="popup('http://www.biomedcentral.com/1471-2164/13/388/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/13/388/mathml/M2">View MathML</a>

(2)

where n is the length of the DNA segment, j denotes a position of the segment, x(bjbj+1) are empirical values of 16 binucleotides combination at position j/j + 1 for transcription factor binding sites. For each conformational and physicochemical attribute, its x(bjbj+1) values were listed in Additional file 4. Based on Equation 2, for a DNA segment, a structure feature vector was built to represent the TFBS from 38 conformational and physicochemical attributes. Detailed information of these 38 attributes was provided in Additional file 4.

Additional file 4. Information of 38 conformational and physicochemical attributes.

Format: PDF Size: 27KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Evolution feature of a DNA segment

In 2005, Xie and his colleagues [19] presented 174 conserved regulatory motifs [see Additional file 5 through alignment of several mammalian genomes. In our work, the evolution feature of a DNA segment was generated through comparing to those motifs. In practice, a conservation score of a motif was assigned to a DNA segment when it was similar to the motif (with a similarity threshold 0.95). If a DNA segment was similar to several motifs, the maximal conservation score of those motifs was assigned to the segment. If a segment was not similar to any motif, 0 was assigned to the segment (Equation 3).

<a onClick="popup('http://www.biomedcentral.com/1471-2164/13/388/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/13/388/mathml/M3">View MathML</a>

(3)

Additional file 5. Information of 174 conservative motifs.

Format: XLS Size: 23KB Download file

This file can be viewed with: Microsoft Excel ViewerOpen Data

Construction of the sequence model, the structure model, the evolution model, and the control model

Given a TF and its 10 DNA target sets (each set included positive and negative instances), first, three scores were calculated for each instance according to the three features. Then three TFBS identification models (named the sequence model, the structure model, and the evolution model) were constructed respectively based on these three features. In practice, the C4.5 algorithm [20,21] was utilized to build those TFBS identification models, in which the positive and negative instances with feature information were used as the input and a decision tree model was generated as the output. At the same time, the Match 2.0 method [22,23] was utilized as the control model, since it was adopted by the TRANSFAC database to measure the similarity of DNA segments to a PWM pattern.

Construction of the hybrid model

After using the sequence, the structure, and the evolution feature separately to establish TFBS identification models, an integrated strategy was employed to inspect the complementarities of the three features. First, scores were calculated for each feature. As a result, each instance in a DNA target set was presented with 40 attributes, in which 2 attributes depicted the sequence and evolution feature respectively, and the other 38 attributes stood for the structure feature. In practice, we first combined the 40 attributes of the sequence, structure, and evolution feature, and then delivered positive and negative instances with 40 attributes to the C4.5 algorithm [20,21]. Wherein, attribute selection was carried out to remove redundant attributes using a correlation-based filter method with default parameters [24]. At last, a decision tree model, contained the three features, was constructed.

Evaluation of different models

Given a TF, 5 models (the control model, the sequence model, the structure model, the evolution model, and the hybrid model) were built for each DNA instance set of this TF separately. In practice, a 10-fold cross validation test was used to assess the performance of each model. The test was operated as follows: (1) split an instance set into 10 fractions; (2) selected one as the test set and made the remaining 9 fractions as the training sets; (3) computed the following four statistical measurements for the subsequent analysis: (a) the true positive (TP), (b) the false positive (FP), (c) the true negative (TN), and (d) the false negative (FN). The true positive and the true negative were the correct recognition of TFBSs and non-TFBS items respectively. A false positive occurred when a non-TFBS item was predicted as a TFBS one. Similarly, a false negative occurred when a TFBS item was predicted as a non-TFBS one; (4) calculated the sensitivity, specificity, and accuracy through Equation 4; (5) repeated step (2), (3), and (4), while each fraction was chosen as the test set in turn.

<a onClick="popup('http://www.biomedcentral.com/1471-2164/13/388/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/13/388/mathml/M4">View MathML</a>

(4)

After that, in order to further evaluate the performance of models, the receiver operating characteristic curves were constructed for the 5 different models, and the area under curve (AUC) was used as a statistic measurement to assess the power of each model to distinguish TFBSs.

Results

Performances of different models

10-fold cross validation tests were executed for each TF-TFBS model in dataset 1(with PWM) and dataset 2 (without PWM). Detailed results of the 10-fold cross validation test were included in the Additional file 6. Since the control model and the sequence model required PWM information, performance of these two models on dataset 2 was not presented. Detailed results of AUC measurement were listed in Additional file 7. Figure 1 showed different models’ sensitivity, specificity, accuracy, and AUC distribution in dataset 1. While Figure 2 showed those distributions in dataset 2. Table 2 and 3 summarized the mean and standard deviation of model performance for dataset 1 and 2 respectively.

Additional file 6. Results of performance inspection for dataset1 (TF-TFBS with PWM information) and dataset2 (TF-TFBS without PWM information).

Format: XLS Size: 108KB Download file

This file can be viewed with: Microsoft Excel ViewerOpen Data

Additional file 7. Results of AUC measure for dataset1 (TF-TFBS with PWM information) and dataset2 (TF-TFBS without PWM information).

Format: XLS Size: 59KB Download file

This file can be viewed with: Microsoft Excel ViewerOpen Data

thumbnailFigure 1. Performance comparison of different models for 309 TF-TFBSs (with PWM information). Panel(a)-(d): boxplots of 5 models (the control model, the sequence model, the structure model, the evolution model, and the hybrid model) for sensitivity, specificity, accuracy and AUC measurement. For a boxplot, the 5 whiskers from bottom to top denote the 5th, 25th, 50th, 75th, and 95th percentile respectively.

thumbnailFigure 2. Performance comparison of different models for 17 TF-TFBSs (without PWM information). Panel(a)-(d): boxplot of 3 models (the structure model, the evolution model, and the hybrid model) for sensitivity, specificity, accuracy and AUC measurement. For a boxplot, the 5 whiskers from bottom to top denote the 5th, 25th, 50th, 75th, and 95th percentile respectively.

Table 2. Performance of different models for 309 TF-TFBSs (with PWM information)

Table 3. Performance of different models for 17 TF-TFBSs (without PWM information)

Results for dataset 1 were shown in Figure 1. The interval between the 25th and the 75th percentile was also adopted as a model performance measurement. For sensitivity, the intervals of the 5 models (the control model, the sequence model, the structure model, the evolution model, and the hybrid model) were (0.447-0.773), (0.774-0.955), (0.676-0.830), (0.556-0.786), and (0.810-0.938) respectively. For positive instances, sensitivity results demonstrated that: (1) the sequence model had the best performance among the three single feature models; (2) the hybrid model was comparable to the best single feature model (the sequence model) and better than the control model. For specificity, interval values of the 5 models were (0.950-1.000), (0.828-0.928), (0.632-0.818), (0.393-0.842), and (0.808-0.910) respectively. For negative instances, specificity results indicated that: (1) the sequence model was the best one in three single feature models; (2) the hybrid model was comparable to the best single feature model (the sequence model) and worse than the control model. The accuracy values of the 5 models were (0.690-0.873), (0.804-0.930), (0.646-0.818), (0.502-0.768), and (0.806-0.925) respectively. When both positive and negative instances were considered, the accuracy results showed that: (1) among single feature models, the sequence model outperformed the other two for TFBS recognition; (2) the hybrid model was comparable to the best single feature model (the sequence model) and surpassed the control one. For AUC measurement, corresponding values of the 5 models were (0.696-0.877), (0.760-0.913), (0.630-0.831), (0.476-0.726), and (0.804-0.919) respectively. Conclusions hinted by the accuracy measurement were reinforced by the AUC results.

Results for dataset 2 were shown in Figure 2. For sensitivity, interval values of the structure model, evolution model and hybrid model were (0.718-0.879), (0.690-0.741), and (0.771-0.877) respectively. While for specificity, accuracy, and AUC measurement, corresponding values were [(0.800-0.868),(0.455-0.857),(0.790-0.868)], [(0.775-0.857),(0.490-0.809),(0.788-0.856)], and [(0.769-0.866),(0.455-0.802),(0.791-0.872)] respectively. Results of dataset 2 implied that without PWM information: (1) the structure model was better than the evolution model for TFBS recognition; (2) performance of the hybrid model was comparable to the best single feature model (the structure model) for identifying TFBS.

In order to compare the 5 models more directly, the mean of performance was calculated. Table 2 showed the mean values of model performance in dataset 1. In terms of accuracy, when the hybrid model was compared with the control model and the three single feature models, TFBS identification success rate improved 8.0%, 0.0%, 12.8%, and 21.1% respectively. In terms of AUC, corresponding increments were 6.9%, 2.3%, 12.6%, and 24.5% respectively. Those results suggested, again, that considering both positive and negative instances, performance of the hybrid model was comparable to the best single feature model and surpassed the control one. Table 3 showed the mean values of model performance in dataset 2. When the hybrid model was compared with the structure model, the increased values of accuracy and AUC were 0.7% and 11.3% respectively. When the hybrid model was compared with the evolution model, the increase was 1.2% and 14.1% for accuracy and AUC respectively. According to the results of dataset 2, a conclusion similar to dataset 1’s was made, that the hybrid model was comparable to the best single feature model and outperformed the control one. In addition, as shown in Table 2 and 3, the standard deviation of the hybrid model was smaller than other models’ in most cases, which meant that the hybrid model was more robust and balanced than other models.

In order to survey power of the hybrid model further, we investigated frequency distribution of accuracy measurement for the hybrid model and the best single feature model in the two datasets (Figure 3). In dataset 1, the hybrid model was compared with the sequence model. While in dataset 2, the hybrid model and the structure model were compared. As shown in Figure 3, for accuracy, values of the hybrid model were more concentrated in high score region than the single feature model. That outcome demonstrated that the hybrid model was more robust than the single feature model.

thumbnailFigure 3. Distribution of accuracy measurement for different models. Panel (a): the histogram of the sequence model and the hybrid model. The green and red rectangle represents the former and the latter’s accuracy frequency respectively. Panel (b): the histogram of the structure model and the hybrid model. The green and red rectangle represents the former and the latter’s accuracy frequency respectively.

Correspondence between TFs and TFBSs

In the previous section, capability of the sequence, structure, and evolution feature to denote TFBSs were surveyed respectively through constructing TFBS identification models. In this section, biological characters of the relationship between TFs and TFBSs were investigated for better comprehending transcriptional regulation. In practice, we inspected TF-TFBS correspondence in terms of sequence, structure, and evolution to explore their relationships.

Inspecting correspondence between TFs and TFBSs in sequence level

In sequence level, correspondence inspection was operated as follows: (1) 270 TFs (with sequences) out of 326 TFs were clustered through the BLASTCLUST algorithm [25], which could categorize sequences according to their similarity. In practice, for TF clustering, the parameter of length coverage threshold (−L) was changed from 0.60 to 0.95, with 0.05 as the step size, and the parameter of identity percentage (−S) was changed from 60 to 95, with 5 as the step size. (2) Simultaneously, corresponding TFBSs of those 270 TFs were also clustered through the BLASTCLUST algorithm, where TFBS length coverage threshold (−L) was set to 0.90 (required by the BLASTCLUST algorithm due to TFBSs’ short length), and TFBS identity percentage (−S) was changed from 60 to 95, with 5 as the step size. (3) Clustering outcomes of TFs and TFBSs were recorded separately, and then for each TFBS cluster, its items were transformed to their TF names according to TF-TFBS interaction pairs. Subsequently, matched clusters between TFs and TFBSs were checked. A TF cluster was regarded as matching with a TFBS cluster when one of below criteria held: (a) over 90% items of a TF cluster were contained in a TFBS cluster; (b) the intersection rate (Equation 5) between a TF and a TFBS cluster was over two-thirds. Results of the inspection were summarized in Table 4.

<a onClick="popup('http://www.biomedcentral.com/1471-2164/13/388/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/13/388/mathml/M5">View MathML</a>

(5)

Table 4. Matching results of TF and TFBS clusters based on sequence information

As shown in Table 4, for TF clustering, when the length coverage threshold and identity percentage increased, cluster number dropped from 62 to 36, which meant TF clustering outcome was sensitive to these two parameters. In terms of TFBSs, when the identity percentage increased, the cluster number of TFBS was not altered. Since sequences of TFBSs were degenerated to some extent, it was not surprising that their clustering outcome was not sensitive to the sequence parameter. The match rate of TF-TFBS clusters was always over 60%, which demonstrated that most TF clusters could be found matched TFBS ones in all conditions. That is to say, to some extent, when some TFs were categorized into a cluster due to their similar sequences, their corresponding TFBSs were also classified into a cluster by sequence similarity. In another word, if some TFs’ sequences were similar, their TFBSs’ sequences were most probably similar as well. Those results suggested that to some degree, there existed correspondence in the sequence level between TFs and TFBSs.

Inspecting correspondence between TFs and TFBSs in structure level

In structure level, correspondence inspection was executed as following: (1) 270 TFs (with sequences) out of 326 TFs were categorized into four classes (basic-TFs, zinc-TFs, helix-TFs, beta-TFs) according to their structure information [15,16]. (2) Frequency of 38 attributes for structure feature was recorded during the TFBS recognition model construction. Meanwhile, a confidence interval, based on the 75th quantile of attribute frequency, was generated through a 10,000-replication bootstrapping. Then significant attributes, with frequencies over median of the interval, were selected for subsequent process. As a result, 5 (the 27th, 30th, 32th, 33th, and 34th attributes) out of the 38 attributes were chosen, and TFBSs of the 270 TFs were encoded with a 5-dimension vector. (3) Expectation-Maximization (EM) algorithm was employed to evaluate class number of TFBSs, and then the number was delivered to K-means cluster algorithm as an initial parameter for TFBS classification. (4) For each TFBS class, its items were transformed to their TF names according to TF-TFBS interaction pairs. Then mapping status between TF and TFBS classes was inspected with similar criteria used in the previous section (inspecting correspondence in sequence level). In practice, mapping status was defined as Yes when over 90% items of a TF class were found in a TFBS class. The mapping results of four TF classes were summarized in Table 5.

Table 5. Mapping results of TF classes based on structure information

As shown in Table 5, for each TF class, in terms of class-level mapping rate, the numbers were no less than 90%, which suggested that every TF class found a matched TFBS class. That is to say, according to structure information, when some TFs were grouped into a class, their corresponding TFBSs were most likely categorized into a class as well. Therefore, we thought that in structure level, correspondence between TFs and TFBSs did exist as well.

Inspecting correspondence between TFs and TFBSs in evolution level

In evolution level, correspondence inspection was carried out as belows: (1) Homolog information of 270 TFs (with sequences) was collected from the InParanoid database, which contained eukaryotic ortholog groups [26,27]. Then each TF was assigned a conservation score based on the number of its orthologs. In practice, a TF obtained higher score when it had more orthologous genes. (2) Simultaneously, for each TF, conservation of its DNA targets was assessed through their evolution feature during model construction for TFBS identification. In practice, the mean value of evolution feature for a TF’s DNA target was assigned as its corresponding TFBSs’ conservation score. (3) Correspondence between TFs and TFBSs was inspected through surveying correlation of conservation score between TFs and their DNA targets. Detailed information about conservation score of TFs and their DNA targets was listed in Table 6.

Table 6. Conservation scores of 270 TFs and their corresponding TFBSs

A spearman’s rank test was used to investigate the correlation between TFs and TFBSs. As a result, the coefficient of TFs and TFBSs was 0.122 (p = 0.023 < 0.05, one side test), which meant there was positive correlation between transcription factors and their DNA targets to some degree. Those results suggested when a TF was conserved, its TFBSs were likely conserved. In other words, in terms of evolution, there exists correspondence between TFs and their TFBSs.

Discussion

In this work, we first evaluated the power of sequence, structure, and evolution feature to describe properties of transcription factor binding sites through constructing TFBS identification model. For TF datasets with PWM information, TFBS identification accuracy of the three single feature models achieved 86%, 73%, and 65% for the sequence, structure and evolution model respectively. Given no PWM information, accuracy of the structure and the evolution feature were about 80% and 69%. Those results demonstrate: (1) these features do have fairly well capability to capture TFBSs; (2) among the three features, the sequence feature is most impactful for depicting TFBS binding preference. It is noteworthy that prior PWM information is required when computing the sequence feature. In contrast, the structure and the evolution feature don’t need much prior information when they are applied to TFBS recognition. Thus, the structure and the evolution feature are more suitable than the sequence one for ab inito TFBS recognition in a certain degree.

A hybrid model was built to survey the complementarities of the three features. According to the outcomes of sensitivity, specificity, accuracy, and AUC measurement, performance of the hybrid model exceeds the control one and is comparable to the best single feature model. Moreover, the hybrid model has fairly well performance not only in TF sets having PWM information (dataset 1) but also in TF sets with low conserved TFBSs (dataset 2). Powerful capability of the hybrid model can be explained by following two reasons: (1) In terms of biological character, the sequence feature presents similarity of an DNA sequence to a PWM pattern; the structure feature contains conformational and physicochemical attributes, which are thought to be closely related to TFBS binding; the evolution feature depicts conservation degree of a DNA segment. The three features offer properties of TFBSs in various biologic aspects, so combining these features can describe TFBS binding preference more comprehensively. (2) In terms of string context, for a DNA segment, the sequence feature gives contribution of each nucleotide to a valid pattern (PWM pattern); the structure feature is correlative to dinucleotide distribution, which reflects relationship of joint nucleotides; the evolution feature considers conservation of a DNA segment as a whole. In methodology, integrated model is more effectively using string context than the single feature model, so it is not surprising that the hybrid model has better performance for TFBS recognition. In summary, investigation results illustrates: (1) there are complementarities over the three biological features to some extent; (2) strategy of combining different features is good to TFBS identification.

After investigating competence of the sequence, structure, and evolution feature to distinguish TFBSs, we investigated the correspondence in those features’ levels to explore the interaction mechanism between TFs and TFBSs. Results of correspondence inspection make clear that TFs are reciprocal with TFBSs: (1) in sequence level, when some TFs’ sequences are similar, their corresponding TFBSs’ sequences are also similar. In general, when some proteins’ sequences are similar, they are believed to have analogous functions. TFs are pivotal proteins of transcriptional regulation, and their most important functions are binding with TFBSs to regulate expression of downstream target genes. Hence, it is reasonable when some TFs having similar sequences, sequences of their TFBSs are similar as well. Those reciprocal phenomena of TFs and TFBSs in sequence level are functional reflection of interactions between them; (2) in structure level, when some TFs are grouped into a class, it is most probably that their TFBSs are categorized into a class as well. When some TFs belong to a class, they generally have analogous structure domain. It is well known that interactions between TFs and TFBSs are determined by structure domains of the former and fold conformation of the latter. When some TFs are clustered into a class, they interact with analogous TFBSs. Analogous TFBSs are usually having similar fold conformation. Therefore, it is not surprising that we can observe structure correspondence between TF and TFBS. Those results are directly mapping at structure aspect for interactions between TFs and TFBSs; (3) in evolution level, when a TF is conserved, its corresponding TFBSs are likely to have low mutation rates. In another words, TFs and their TFBSs have consistent mutation trends in evolution. Considering the opposite situation, a TF is conserved which indicates it has low mutation rate. But its TFBSs are more active and have a high mutation rate. When those TFBSs’ sequences are mutated and their fold conformations are changed. They will not be bound by the original TF, which means interactions between the TF and its DNA targets are eliminated. Thus TFs and their TFBSs should have coherent trends in evolution so as to maintain interactions between them. According to coherence between TFs and TFBSs at sequence, structure, and evolution aspect, we deem that, to a certain degree, TFs and TFBSs have co-evolved in order to keep their physical binding and maintain their regulatory functions, which is consistent with reports of Yang’s work [28].

Conclusions

In this work, we gave an insight into biological characters of interactions between transcription factors and their DNA targets. Our results show that the sequence, structure, and evolution features do have powerful performance not only in TFBS recognition, but also in TF-TFBS interaction description. Besides, it is a reasonable strategy to combine the three features for capturing TFBSs. Furthermore, interesting finding of correspondence inspection between TFs and TFBSs makes solid contribution to transcriptional regulation: On one hand, coherence between TFs and TFBSs in sequence, structure, and evolution level gives aid to people for interpreting TFBS binding preference; On the other hand, the reciprocal phenomena of TFs and TFBSs at sequence, structure, and evolution aspect provide useful information for the research of interactions between proteins and DNAs. In summary, results of our work widen the knowledge of interactions between transcription factors and their binding sites, which will help us further investigate transcriptional regulation and explore binding mechanisms between proteins and DNAs.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

GYZ collected datasets, carried out experiments, and drafted the manuscript. QL and GHD help to collect datasets. CCW and YXL directed the whole research work and revised the manuscript. All authors read and approved the manuscript.

Acknowledgements

We thank the anonymous reviewers for their help to improve the article.

Funding: this work was supported by the National Natural Science Foundation of China (No.31100957, No.60970050), K.C. Wong Education Foundation, Hong Kong, China Postdoctoral Science Foundation fund (No. 20110490758), the National Basic Research program of China (973) (No.2011CB910204) and the Main Direction Program of Knowledge Innovation of Chinese Academy of Sciences (No.KSCX2-EW-R-04).

References

  1. Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, et al.: TRANSFAC: transcriptional regulation, from patterns to profiles.

    Nucleic Acids Res 2003, 31(1):374-378. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  2. Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, et al.: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes.

    Nucleic Acids Res 2006, 34(Database issue):D108-110. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  3. Hertz GZ, Stormo GD: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.

    Bioinformatics 1999, 15(7–8):563-577. PubMed Abstract | Publisher Full Text OpenURL

  4. Nagarajan N, Jones N, Keich U: Computing the P-value of the information content from an alignment of multiple sequences.

    Bioinformatics 2005, 21(Suppl 1):i311-318. PubMed Abstract | Publisher Full Text OpenURL

  5. Hertz GZ, Hartzell GW III, Stormo GD: Identification of consensus patterns in unaligned DNA sequences known to be functionally related.

    Comput Appl Biosci 1990, 6(2):81-92. PubMed Abstract OpenURL

  6. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment.

    Science 1993, 262(5131):208-214. PubMed Abstract | Publisher Full Text OpenURL

  7. Abnizova I, Gilks WR: Studying statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the eukaryotic genomes.

    Brief Bioinform 2006, 7(1):48-54. PubMed Abstract | Publisher Full Text OpenURL

  8. GuhaThakurta D: Computational identification of transcriptional regulatory elements in DNA sequence.

    Nucleic Acids Res 2006, 34(12):3585-3598. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  9. Ponomarenko MP, Ponomarenko JV, Frolov AS, Podkolodny NL, Savinkova LK, Kolchanov NA, Overton GC: Identification of sequence-dependent DNA features correlating to activity of DNA sites interacting with proteins.

    Bioinformatics 1999, 15(7–8):687-703. PubMed Abstract | Publisher Full Text OpenURL

  10. Ponomarenko JV, Ponomarenko MP, Frolov AS, Vorobyev DG, Overton GC, Kolchanov NA: Conformational and physicochemical DNA features specific for transcription factor binding sites.

    Bioinformatics 1999, 15(7–8):654-668. PubMed Abstract | Publisher Full Text OpenURL

  11. Loots GG, Locksley RM, Blankespoor CM, Wang ZE, Miller W, Rubin EM, Frazer KA: Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons.

    Science 2000, 288(5463):136-140. PubMed Abstract | Publisher Full Text OpenURL

  12. Blanchette M, Tompa M: Discovery of regulatory elements by a computational method for phylogenetic footprinting.

    Genome Res 2002, 12(5):739-748. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  13. Corcoran DL, Feingold E, Dominick J, Wright M, Harnaha J, Trucco M, Giannoukakis N, Benos PV: Footer: a quantitative comparative genomics method for efficient recognition of cis-regulatory elements.

    Genome Res 2005, 15(6):840-847. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. Boffelli D: Phylogenetic shadowing: sequence comparisons of multiple primate species.

    Methods Mol Biol 2008, 453:217-231. PubMed Abstract | Publisher Full Text OpenURL

  15. Zheng G, Qian Z, Yang Q, Wei C, Xie L, Zhu Y, Li Y: The combination approach of SVM and ECOC for powerful identification and classification of transcription factor.

    BMC Bioinformatics 2008, 9:282. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  16. Zheng G, Tu K, Yang Q, Xiong Y, Wei C, Xie L, Zhu Y, Li Y: ITFP: an integrated platform of mammalian transcription factors.

    Bioinformatics 2008, 24(20):2416-2417. PubMed Abstract | Publisher Full Text OpenURL

  17. Praz V, Perier R, Bonnard C, Bucher P: The Eukaryotic Promoter Database, EPD: new entry types and links to gene expression data.

    Nucleic Acids Res 2002, 30(1):322-324. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  18. Schmid CD, Praz V, Delorenzi M, Perier R, Bucher P: The Eukaryotic Promoter Database EPD: the impact of in silico primer extension.

    Nucleic Acids Res 2004, 32(Database issue):D82-85. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  19. Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M: Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals.

    Nature 2005, 434(7031):338-345. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  20. Quinlan JR: C4.5: programs for machine learning. Morgen Kaufmann Publishers, San Franscisco, CA, USA; 1993. OpenURL

  21. Mark Hall EF, Holmes G, Pfahringer B, Reutemann P, Witten IH: the WEKA data Mining Software: An Update.

    SIGKDD Explorations 2009, 11(1):10-18. Publisher Full Text OpenURL

  22. Kel AE, Gossling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E: MATCH: A tool for searching transcription factor binding sites in DNA sequences.

    Nucleic Acids Res 2003, 31(13):3576-3579. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  23. Chekmenev DS, Haid C, Kel AE: P-Match: transcription factor binding site search by combining patterns and weight matrices.

    Nucleic Acids Res 2005, 33(Web Server issue):W432-437. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  24. Hall MAS, Lloyd A: feature subset selection: a correlation based filter approach. In International Conference on Neural Information Processing and Intelligent Information Systems. Springer, Berlin; 1997:855-858. OpenURL

  25. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

    Nucleic Acids Res 1997, 25(17):3389-3402. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  26. Berglund AC, Sjolund E, Ostlund G, Sonnhammer EL: InParanoid 6: eukaryotic ortholog clusters with inparalogs.

    Nucleic Acids Res 2008, 36(Database issue):D263-266. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  27. Ostlund G, Schmitt T, Forslund K, Kostler T, Messina DN, Roopra S, Frings O, Sonnhammer EL: InParanoid 7: new algorithms and tools for eukaryotic orthology analysis.

    Nucleic Acids Res 2010, 38(Database issue):D196-203. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  28. Yang S, Yalamanchili HK, Li X, Yao KM, Sham PC, Zhang MQ, Wang J: Correlated evolution of transcription factors and their binding sites.

    Bioinformatics 2011, 27(21):2972-2978. PubMed Abstract | Publisher Full Text OpenURL