Abstract
Background
Transcription factor (TF)DNA binding loci are explored by analyzing massive datasets generated with application of Chromatin ImmunoPrecipitation (ChIP)based highthroughput sequencing technologies. These datasets suffer from a bias in the information about binding loci availability, sample incompleteness and diverse sources of technical and biological noises. Therefore adequate mathematical models of ChIPbased highthroughput assay(s) and statistical tools are required for a robust identification of specific and reliable TF binding sites (TFBS), a precise characterization of TFBS avidity distribution and a plausible estimation the total number of specific TFBS for a given TF in the genome for a given cell type.
Results
We developed an exploratory mixture probabilistic model for a specific and nonspecific transcription factorDNA (TFDNA) binding. Within ChiPseq data sets, the statistics of specific and nonspecific DNAprotein binding is defined by a mixture of sample sizedependent skewed functions described by KolmogorovWaring (KW) function (Kuznetsov, 2003) and exponential function, respectively. Using available Chipseq data for eleven TFs, essential for selfmaintenance and differentiation of mouse embryonic stem cells (SC) (Nanog, Oct4, sox2, KLf4, STAT3, E2F1, Tcfcp211, ZFX, nMyc, cMyc and Essrb) reported in Chen et al (2008), we estimated (i) the specificity and the sensitivity of the ChiPseq binding assays and (ii) the number of specific but not identified in the current experiments binding sites (BSs) in the genome of mouse embryonic stem cells. Motif finding analysis applied to the identified cMyc TFBSs supports our results and allowed us to predict many novel cMyc target genes.
Conclusion
We provide a novel methodology of estimating the specificity and the sensitivity of TFDNA binding in massively paralleled ChIP sequencing (ChIPseq) binding assay. Goodnessof fit analysis of KW functions suggests that a large fraction of low and moderate avidity TFBSs cannot be identified by the ChIPbased methods. Thus the task to identify the binding sensitivity of a TF cannot be technically resolved yet by current ChIPseq, compared to former experimental techniques. Considering our improvement in measuring the sensitivity and the specificity of the TFs obtained from the ChIPseq data, the models of transcriptional regulatory networks in embryonic cells and other cell types derived from the given ChIpseq data should be carefully revised.
Background
Identification of transcription regulatory elements in the genome is an important problem of molecular systems biology and statistical genomic studies. Among those elements, transcription factor binding sites (TFBSs), short and specific DNA loci targeted by transcription factors (TFs), are considered as basic regulatory elements of gene functional activity and reflect the corresponding proteinDNA interactions in a cell. TFs are the largest set of regulatory proteins in mammalian cells. According to NCBI RefSeq database, about 10% of all known proteins of mammals, including humans, are TFs.
A TFBS serves as a target for a transcription factor which binds to this specific binding site (BS) directly or via other proteins and regulates gene transcription. In a mammalian genome the number of direct and indirect BSs for a given TF could be ranged from several hundreds to hundred thousand [18]. However, these values have not been directly measured and the theoretical estimates provide lowly confident values.
The interactions between the molecules of a given TF and corresponding TFBSs in the genome could be considered in the terms of TFDNA binding events (BEs) which reflect the events of binding in the assay. Any of such events might be specific and nonspecific in the context of a direct physical binding of the TF to a TFBS. The intensity (and the corresponding probability) of an occurrence of a given BE in a given genome locus can be characterized by the level of relative avidity (RA) of the TFDNA binding an integrative quantitative characteristic of availability of a DNA locus (e.g. TFBS and its flanking region) for a given protein (e.g. TF) binding [9].
The population distribution function of RA (i.e. the distribution function of RA for a given set (population) of BE) for a given TF can reveal functional attributes of the TFBSs and the mechanisms of the TFDNA interaction on the genomic scale. However, at the level of single cells or cell samples the distribution function of RA for any TF is unknown, since many technical problems of direct counting of specific protein molecules bound DNA have not been solved yet. Instead, the relative avidity of TFDNA binding in an average genome within a given cell sample can be quantified by an estimate of the number of TF molecules bound to a given locus averaging across the given cell sample. However, quantitative detection of TFs bound to specific loci is a great challenge. A simpler task is a pulldown of short DNA fragments directly or indirectly bound with the molecules of a given TF. Such TFDNA complexes can be detected in Chromatin ImmunoPrecipitation (ChIP) ChIPbased genomewide sequencing experiments [3,4,68].
In a typical genomescale ChIP experiment ~10^{8 }cells are used and the transcription factors are crosslinked to their DNA using chemical cross linkers. After the genomic DNA is isolated and fragmented by ultrasound sonication, an antibody specific to the TF of interest is used to isolate each TF molecule and the DNA fragment which it is bound to. Cloning the whole pool of such fragments and sequencing them (for example, in a serial analysis of chromatin occupancy (SACO) [10], ChIPpaired end tag (ChIPPET) [7], sequence tag analysis of genome enrichment (STAGE) [1]) provides the data for a largescale identification of TFBSs.
However, the experimental information, obtained from SACO, ChIPPET and STAGE experiments, about statistical properties of TF binding at specific physical BSs (and moreover biological functions of the BSs) in a given cell type, at a given environment is highly noisy and essentially incomplete. Only pulleddown TFDNA complexes containing DNA fragments with maximal relative binding avidity could be reliably detected.
Recently, a new generation of sequencing technology, massively parallel sequencing (MPS), has been established [2,5] MPS can sequence many millions of fragments in a single experiment. Combining ChIP and MPS (ChIPseq) provides a genome wide view on TFBSs. In particular, ChIPseq method can accurately detect moderate/high avidity TFBSs at a resolution (up to several base pairs) higher than for any previous ChIP method [2,5]. In detail, during a ChIPseq experiment, TFImmunoprecipitated DNA fragments (sequence reads) are directly sequenced in series of 2536 bp reads, and dozen millions of such short reads are then mapped to the reference genome. After mapping of unique DNA fragments (enriched by DNA bound by TF molecules) onto the genome, the DNA fragment sequences, extended further up to ~200 bp, are clustered by their overlapping subsequences and mapped onto the genome. The number of DNA fragment sequence overlaps is further represented in the analysis as overlap signal with peaks. Each DNA fragment cluster is usually quantified by its highest overlap peak, i.e. by the cluster's internal region with the largest number of overlapping extended ChIPseq DNA fragment sequences. (Figure 1) The number of overlapping DNA fragments in a given cluster can be observed as the RA value of the given TFDNA BS. Technically, TF binding avidity in specific DNA loci could be measured on the genome scale using peak search procedure on the mapped overlap signal. A number of software tools have been proposed to perform this search in the genome, and thus to find genomic regions containing the TFabundant DNA loci and to identify in each locus specific TFbound DNA sequences.
Figure 1. A random sampling model of determination of TF binding avidity potential on the genome defined in a ChIPseq experiment. Sequencing TFBSenriched DNA fragments can be assayed to determine the specific clusters of DNA sequences bound by TF protein. Results are strongly depended from the number of read (sample size). A: Small sample size. B: Large sample size. Blue horizontal stake: specific DNA fragment; white horizontal stake: nonspecific DNA fragment forming nonspecific (falsepositive) clusters. C: BS1BS6 are binding loci presented in the given cells: blue vertical stakes are relative binding avidity in the loci; BS6 might be modified (epigenetically) and suppressed BS (a stake with triangle basis) and therefore it might be not detected in ChIpseq assay. BS1 and BS4 might be not detected in the assay due to sample size limit. D: A scheme of the random Markov process of bindingdissociation of TFDNA realized on the genome scale. The graph illustrates concept of birthdeath random process model utilized in this work (see Methods).
One of the crucial problems with ChIPbased genomewide assays, including ChIPseq, is a statistically reliable identification of biologically meaningful phenomena (e.g. all specific and functionally important binding loci) from the large amounts of generated experimental data. In this context reliable, specific and sensitive mapping of proteinDNA interactions is still essentially dependent on subjective rules of preprocessing and filtration of the DNA fragment sequences, the statistical criteria used to identify specific binding loci and the real TFBSs. As a result, some basic definitions, datasets, statistical models and estimates have been revised after original publications [2,5,9,1113].
A large unexplained "technical" noise component in the experimental measurements and its nonuniform location in the genome are serious limitations for the specificity of the current ChIPbased methods [7,9,1113]. Comparison of ChIPseq data sets analysing the same TF, under similar conditions and the same cell type but using different experimental platforms (ChIPchip, ChIPPET and ChIPseq) revealed a reasonable consistency between the mapped datasets only for high and moderate avidity ChIPdefined binding loci. These loci, however, represent only a small fraction of all binding loci observed in the course of the entire ChIPseq experiment. Therefore genomic localization and biological roles of other reliable but less reactive TFbinding loci (low avidity BSs) cannot be identified from the assays [9,11,12,14]. To analyse statistical properties of the low, moderate and highavidity binding loci for a given ChIPseq data set we need to develop a probabilistic model which would allow us to answer three important open questions.
How many 'hidden' TFspecific BSs with moderate and low avidity are present in the noiserich subset of ChIPseq data?
How many "true negative" specific BSs which should be included in consideration are not present in the given ChIPseq dataset?
How many specific BSs of a given TF exist in the genome?
To answer these questions an analytical model(s) of ChIPseq TFDNA binding experiments should be focused on analysis of specificity and sensitivity of proteinDNA binding.
To do this a simple exploratory probabilistic model of TFDNA binding events was developed and implemented. It allowed us to facilitate noise filtering of ChIPseq datasets and predicting, for a given TF, the number of specific TFDNA binding loci in the entire genome. We applied the proposed analytical approach to the distributions of binding avidity of the TFBS of eleven TFs which are essential for maintaining and differentiation mouse embryonic stem cells.
Results
Natural variation of binding availability of the TFBS for TFs and critical threshold of TF binding specificity
TFs bind to short high affinity DNA response elements (motifs) mostly in putative promoter region of a gene and modify gene activity. However this model of TFDNA binding is too simple and the phenomenon remains poorly understood, although many models of TFDNA binding have been reported [9,13,1517]. For example, wellknown TF cMyc exhibits very diverse and complex patterns of DNA binding activity [12]. CMyc binding region in gene promoter region could often contain multiple copies of a few cMyc Ebox sequences. Such TFBS clusters could be found in low, moderate and high avidity ChIPseq binding loci of the mouse embryonic stem cell (EC) related genes (see below).
The Ebox is a high affinity response/binding element of DNA which able to discriminate cMyc among other different TFs. cMyc forming heterodimers with its binding partners (MAX, MIZ1) and other DNAspecific proteins (e.g. MIZ1, Sp1, NEY) using "canonical" E box 5'CACGTG3', which represents the consensus or other at least five "noncanonical" Eboxes (CATGTG, CATGCG, CACGCG, CACGAG, CAACGTG), and additional one (CGCGAG) which we report in this work (see below). The Ebox motifs are present in putative promoter regions of at least 4000 genes [12]; however, how a given response element can discriminate among all these different transcription factors is currently unclear. Moreover, several types of noncanonical Eboxes could bind cMycMAX complex strongly than canonical [18]. In this respect, several mechanisms of variability of cMycMaxDNA binding avidity have been proposed (and, in some instances, experimentally substantiated) [12,17,18]. For instance, flanking sequences, chromatin structure, methylation status, and relative position within the promoter or interaction with adjacent regulatory elements might contribute to selection of a particular complex [17].
We assume here that transcription factor (TF)DNA binding avidity in a given TFBS defined in ChIPseq experiment could be considered as useful integrative characteristic of availability of the TFBS for a given TF. Naturally, TFDNA binding avidity could be quantified by the number of TF molecules binding a given specific locus including specific binding element(s) and its flanking regions. However, the number of TF molecules bound DNA cannot be directly inferred from ChIPseq experiment. ChIPseq detects the amount of specific DNA fragments directly and indirectly bound by the proteinantibody complexes. Mapping the extended DNA fragments on the genome, an appropriate clustering of overlapping fragments and filtering of the clusters result in an identification of putative TFBSs. Thus the integrative relative avidity of TFBS in ChIPseq experiment could be quantified by the number of overlapping DNA fragments in the cluster loci. When this number is equal to or larger then a specificity threshold, t, such significant DNA loci (specific BEs) could be used to construct and analyze of statistically reliable part of the empirical frequency distribution representing relatively highavidity binding loci of specific TFDNA binding. In particular, we used this part of the distribution for estimation of the number of reliably detected specific TFBSs (called , see Definitions & Methods).
In the datasets [15], which we used in our study, the specificity threshold was calculated using three different methods. The first method implements a model of normalization of observed binding peak height values in a ChIPseq experiment against peak height values in the corresponding negative control experiment (antiGFP [15]). However, the control data provided many strong peaks which are often found in specific genomic regions like satellite repeats creating an unpredictable bias in identification of specificity threshold. We found also a fraction of loci in which negative control and experimental set binding signals (peaks heights) are highly correlated and the fraction of locus in which the peaks in control dataset were higher than in the experimental set.
After removing nonrandom false peaks, a relatively small reduction of the number of the nonspecific DNA fragments even among the smallest peaks (1, 2,..., 7) about 1525% of total sequence read could be excluded. However, many millions DNA fragments within low and moderate abundant peaks of are present in the data sets (data not shown). These findings suggest that the number and the variation of DNA fragments of the negative control experiment do not allow us to explain the source(s) of the major fraction of nonspecific DNA sequences and their clusters occurred in a specific ChIPseq experiment.
The second method implements a model of random occurrence of DNA fragment clusters in the genome. The model assumes that each genomic position has the same probability of producing a ChIPseq DNA fragment (sequence read) and the probability of finding by chance m DNA fragments in a ChIPseq experiment within the same genomic region taken from a genome of size g. The probability calculations were made, using a MonteCarlo simulation, by randomly extracting 200 bp DNA fragments from reference mouse genome (mm8) and estimating the expected numbers of nonspecific overlap peaks with various height values. Then the frequency distribution of the number of random ChIPseq DNA fragments in a cluster overlap (peak heights) was constructed and the threshold value at the given specificity level for the original empirical frequency distribution of the peak height value was defined. The binding loci reported in [15] has been limited to the locis with specificity threshold value defined by the model described above, which was considered as the random background noise mode [15].
The third approach based on qPCR validation data with correction of threshold value for a limited number of ChIPseq DNA loci has been used [15]. ChiPseq data analyses showed that in all cases the simple model (with a random uniform distribution of DNA fragments locations in the genome) produces too optimistic specificity threshold values in comparison with the values defined by qPCR method [15]. For example, for TF Nanog the model overlap peak value 7 is predicted as a cutoff at FDR<0.95 [15]. At this cutoff value 32773 putative ChIPseq BSs were predicted as "true" (or specific) TFBS. However, the analysis of 37 ChIPseq defined BSs using ChIPqPCR suggested the specificity threshold value equal to 11 (at specificity level 98%). At this specificity threshold value, only ~30% (10343/32773) of DNA fragment clusters could belong to "true" BSs. Additionally, the numbers of singletons (single unique ChIPseq DNA sequences occurring only once per a single unique locus) generated in the computational simulation were systematically larger than observed ones (for example, Figure 2A). Inversely, predicted numbers of random clusters with low and moderate avidity loci (e.g. with overlap peak heights 210) might be systematically underrepresented in ChIPqPCR data.
Figure 2. Observed and predicted statistics of TFDNA BEs. A: Fitting and back extrapolation analysis for complete dataset. Decomposition of mixture model (1) for Nanog TFDNA BEs is provided based on curvefitting analysis of the model. Close circle: number of loci of ChIPseq extended DNA cluster overlaps from 1 to 8 BEs. Open circle: number of loci of ChIPseq extended DNA cluster overlaps from 9 to 73 (included) BEs. Noiselike (close circles) data fits well be exponential function with exponent parameter s = 1.05 ± 0.055 (p < 0.0001, ttest). The reliable set of TF BS (at >8 BEs) are equally well fitted by the leftside truncated GDP function (at k = 1.81 ± 0.15 (p < 0.001, ttest) and b = 8.00 ± 1.335 (p < 0.001, ttest)) as well as by KW function (θ = 0.999, a = 6.618, b = 8.29; Table 3). Extrapolation curve predicts the number of Nanog TFBSs in the noiseenriched binding site fraction of the empirical distribution. B: Nanog TFDNA BEs, C: Esrrb TFDNA BEs and D: cMyc TFDNA BEs. B, C and D: KW model fitting on the observed and extrapolated of doubletruncated GDP data to calculate p_{0}. Vertical dotted lines are representing qPCRdefined threshold and the threshold defined based on bestfit doubletruncated GDP function. Triangle symbols show the observed over represented number of TFBSs in compare to bestfit GDP function. N_{0}, N_{1 }and N_{2 }are the numbers of nondetected, potentially detected and high specific (reliable) TFBSs, respectively. More detail information about parameter values of GDP and KW models presents in Additional File 3, 4, 5.
ChIPqPCR is often used to determine in vitro whether a given ChIPseq DNA fragment cluster belongs to a specific TFBS and, subsequently, to estimate the specificity threshold for the original ChIPseq dataset by extrapolation. However, due to the limited number of loci which could be used in ChIPqPCR, a suboptimal design of ChIPqPCR experiment might contain a bias in estimating the ChIPseq BE specificity threshold. In this section we shall give an answer to the question: how to optimize a design of ChIPqPCR experiments to minimize the sample size bias?
On Figure 3 we present the results of our analysis of 94 ChIPseq loci containing 224 distinct DNA fragments from Esrrb TF ChIPseq library selected by the authors of [15] for their validation of computational model estimate (t = 6). Figure 3 shows that ChIPqPCR experiment was designed on the studied affinity value range based on a uniform distribution function of the number of TFDNA BEs per locus. To search for a more realistic frequency distribution representing Chipseq data, 224 random samples of the 94 ChIPseq loci were chosen without replacement from the same ChIPseq library. On Figure 3 the mean values and standard deviations (SD) of the number of BEs (peak heights) over these 224 samples are plotted. This figure shows that the observed frequency distribution of mean peak height is a skewed Paretolike function. In comparison to the uniform distribution, our frequency distribution is significantly overrepresented by the BS with overlap peak values less than 13 and is significantly underrepresented by the with overlap peak values greater than 15. This result suggests that a suboptimal design of a qPCR validation study can provide a false positive increment in the estimation of the number of qPCRconfirmed specific ChIPseq loci and thus it can erroneously predict a larger specificity value than it can be deduced from the whole ChIPseq dataset. If so, the actual specificity threshold values of ChiPseq experiments might be larger than it was reported in [15]. Respectively, the real number of specific TFBSs at the threshold value given by qPCR test could be smaller than it is expected from results of the test. In the next sections we will present an alternative computational method which uses all available Chipseq data and estimates the specificity of Chipseq assay without (i) assumptions regarding physical distribution nonspecific Chipseq DNA clusters (noise BEs) and (ii) the need for a validation of the results in ChIPqPCR assay.
Table 1. Comparative analysis of TF binding specificity estimated based on three methods. Specificity threshold estimates by the uniform noise model [15], by ChIP qPCR [15] and by the bestfit GDP function. The uniform random modelbased estimations of the threshold values defined at FDR <5%. TF binding specificity threshold t was estimated based on the bestfit double truncated GDP function. The last two columns of the table shows that GDPmodel could improve the specificity estimates providing by the Poisson model and ChIPqPCR assay.
Figure 3. Suboptimal design of the ChIPqPCR experiment. Statistics of BEs in Esrrb TF ChIPseq data is following to skewed distribution. Difference in the frequency distributions of BEs for peaks used in qPCR and in random samples chosen from Esrrb TF ChIPseq library at peak values >11 available for this dataset. In Chen et al [15], to determine the specificity, the peak height critical threshold were determined by 3fold enriched qPCR signal/noise criteria.
A mixed model of the samplesize dependent frequency distributions of binding events could be used to estimate the number of low/moderate and large avidity binding loci
A simple visual inspection of the available empirical frequency distributions of TFDNA binding plot in loglog scale (Figure 2, Additional files 1, 2) allows us to suggest that a mixture of at least two distinct skewed samplesize  dependent frequency distributions of binding avidity is present in the data. Statistical analysis of ChIPbased [9,1115,19] has showed that the left part of the empirical distribution is reach with the noise resulted from relatively low avidity BE and the right side of the distribution is represented by relatively highavidity TF DNA binding (Figure 2, Additional files 1, 2). Also we found that the tails of the empirical distributions of TFDNA binding avidity exhibit monotonicallyskewed shape with a greater abundance of low avidity and more gaps among the highavidity loci. Low and moderate avidity TFBSs are highlyabundant in the mouse and other mammalian genomes and could play biologically meaningful functional roles.
Additional file 1. GDP function fitting and extrapolation in noisy events for Esrrb TF library. Empirical relative frequency distribution of peak height intensities for Esrrb is fitted by GDP. Loglog plot: frequency of peak height intensity for the Esrrb library; solid circles: observed frequencies for cut off 12; solid line: best fit GDP function with parameters k = 2.40 ± 0.0778, b = 10.42 ± 0.6828. Extrapolated graph with the same parameters to get the predicted TFBSs in noise enriched binding events of library.
Format: PDF Size: 13KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 2. KW model fits on the observed and bestfit GDPderived data and calculates p0. Vertical dotted lines are representing qPCR experimental threshold and Improved Model threshold. Table 1 is representing the parameters of the KW model fitting for all TFs.
Format: PDF Size: 191KB Download file
This file can be viewed with: Adobe Acrobat Reader
The specificity threshold of the distribution of the number of TFbound DNA loci, t, is the parameter which separates lowly reliable and highly reliablebinding loci at a given specificity level (6).
For quantitative analysis of ChIPseq data based on the model (1) it should be important to estimate the specificity threshold, t, as well as the number of highavidity specific loci , at avidity m ≥ t, and the number of specific loci having low/moderate avidity less than t (m <t). To do that, we use the fitting and backextrapolation method (see Methods). We illustrate our method with the analysis of the ChIPseq Nanog TFDNA binding data as an example (Figure 2A &2B).
We demonstrated that the exponential distribution function fits well to the background noise of the empirical distribution of TFDNA binding, while the truncated GDP function (2) fits well to long tail of the empirical distribution (Figure 2A). Specifically, this noiselike part of the distribution can be fitted well with an exponential function with exponent parameter d = 1.05 ± 0.055 (p < 0.0001, ttest) (Figure 2A). Table 2 and Additional File 3 (Table 1) provide the estimated parameter values and detailed characteristics of statistical tests.
Additional file 3. Fitting statistics of the GDP model to the empirical frequency distribution of binding events. t[k], t[b] are t test value for k, and b respectively. p[k] and p[b] are the p values. F is Fisher criterion (by SigmaPlot)
Format: XLS Size: 30KB Download file
This file can be viewed with: Microsoft Excel Viewer
Table 2. TGDP function parameters for the eleven TF libraries. Nonlinear regression module of SigmaPlot program was used to estimate parameters of GDP function. More details regarding binding statistics and TGDP function parameters are presented in Definitions, Methods and Additional files.
The most noiseaffected data points are located at the left side of the empirical histogram (from 1 to 8 BEs), where the data was accurately fitted by the exponential function using SigmaPlot software (Figure 2A). The values of the exponential function at data points from 0 to 9 were fitted (solid line on Figure 2A) and for larger values were estimated by the bestfit exponential function (extrapolating onto the right). At the specificity threshold (t = 11) identified from the analysis of ChiPqPCR data, the admixture of nonspecific BS (originating from the tail of the exponential function) consists of ~2% of the reliable subset (Sp = 98%, respectively). Importantly, at the predicted specificity level (Sp = 97%) our bestfit mixture model predicts the same specificity threshold (t = 11).
Subtraction of the exponential function values from the observed frequency distribution values allowed us to restore the "noisefree" frequency distribution function for overlap peak value 9 and larger (Figure 2A &2B). This reconstructed doubletruncated frequency distribution could be considered as a reliable segment of the empirical specific binding avidity distribution. After parameterisation of the GDP function we can extrapolate the bestfit GDP to smaller overlap peak values (8 and smaller). Figure 2A shows that the truncated GDP function fits well to the righttail of the empirical frequency distribution of TF specific binding in the interval of peak values larger than 11. This part of the empirical distribution characterises the more reliable fraction of TFDNA binding loci corresponding to highavidity level BSs and represented in ChIPseq experiment by number of overlapping DNA fragments (peak values) up to the maximal observed number of overlaps.
The GDP function, after parameterization in this interval, can be used to predict the values of the function for smaller number of peak values: 8,7,6,...,1. At the threshold 11 Sp= 97%, and the number of GDPestimated specific BSs equals 10213 versus 10343 BSs observed in Nanog ChIPSeq assay (Table 1). The number of observed unique ChIPseq DNA fragments, M_{2, }representing these BSs (N_{2}), equals 299830; GDP estimates 297493 ChIPseq DNA fragments. Interestingly, the last number represents only 3.5% of the total number of the unique DNA fragments of the ChIPseq Nanog library. Similar results we obtained for other TF libraries (Table 1). These results suggest that the major fraction of ChIPseq DNA fragments maps to moderate and low avidity loci specific TFBSs and to background nonspecific DNA loci.
Only part of original ChIPseq sequence datasets is available [15]. The analyzed BEs were limited in the dataset to the loci conforming to the specificity threshold value defined by the random model of uniform distribution of background signals reported in [15]. Using these data sets, the smallest threshold of the most reliable and the most specific BEs (providing the best goodnessof fit statistical criteria, according to [20]), was used to calculate the GDPdefined specificity cutoff value. In these cases, the right tail of the available truncated empirical distribution was used to estimate the parameters of the truncated GDP function.
We calculated the specificity threshold values (8) starting from ChIPqPCRdefined specificity threshold followed by minimizing the distance between the truncated GDP probability function and the observed frequency distribution. Figure 2 and Additional Files 1, 2 show that the truncated GDP function fits well the righttail GDP function corresponding to the fraction of reliable and highly specific BEs. Moreover, due to the high accuracy of parameterization of the mixture GDP function and the exponential function, our curvefitting of ChIPseq data allows us to predict the GDP function for the noiseenriched BEs, located on the histogram in the left part in the empirical avidity distribution function. Figure 2C and Additional File 1 show the truncated frequency distribution of Esrrb TFDNA binding (overlap peak cutoff value 12) and the bestfit GDP function extrapolated to the noiseenriched part of the distribution are presented. The best fit GDP function with parameters k = 2.40 ± 0.0778, b = 10.42 ± 0.6828 allows us to extrapolate the bestfit curve and to predict the number of Esrrb TFBSs the in the noiseenriched binding sites.
Figure 2D contains the results of curvefitting analysis for the cMyc ChIPseq library. It shows that the bestfit GDP function, by extrapolation, could provide an estimation of specific BEs in the highly noisy region of the distribution (the left part of the distribution. Thus, based on these findings, the empirical frequency distribution of TF BEs could be separated into the rightside and leftside regions, relatively to the critical cutoff value, t, discriminating the reliable and stronglyspecific BSs from the less reliable noiserich and low/moderate avidity BSs.
The analysis of fitting the avidity curve with GDP could improve the specificity threshold obtained from ChIPqPCR
Figure 2D shows that at the given level of specificity (sp > 97%) the bestfit GDP function can predict a similar or larger binding specificity threshold, t, than the one obtained from ChIPqPCR data. Suboptimal design of the ChIPqPCR experiment possibly (Figure 3) supports this suggestion. Table 1 and Additional file 2 show that in some Chipseq libraries the frequency values at qPCRdefined specificity thresholds and the values around them systematically deviate from the GDP extrapolation curves. For six TFs (Oct4, sox2, Klf4, cMyc, nMyc, STAT3) the specificity predicted by our model at the qPCRdefined critical threshold values was much smaller (cutting off specificity at 78%, 85%, 88%, 77%, 78%, 78%, respectively) than it was found in qPCRChIP experiments (Table 1). In these cases our model allows us to select more reliable critical threshold values (at a larger overlap peak height) following a single confidence criterion (e.g. specificity cutoff at 94%). For the other four TF ChIPseq libraries,(Essrb, E2F1, Tcfcp211, ZFX) out of 11 studied TFs specificity thresholds were defined by the both methods at a similar specificity cutoff (94%) (Table 1). In particular, for cMyc BEs, at specificity level >0.95, reported in Supplementary File of [15], for qPCRChIP our model predicts t = 12 instead t = 9 (determined by qPCRChIP experiment with Sp = 100% based on 47 qPCR experiments). We could suggest that in these cases "optimistic" specificity estimates could be obtained due to a suboptimal design of qPCRChIP experiments (see discussion of Figure 3) and/or unreliable detection of the peak values for a given range of overlap peak signal enriched by nonspecific BSs.
Estimation of the number of ChIPseq DNA fragments in predicted specific low and moderate avidity binding loci
Parameter a of our mixture model (1) was estimated as the fraction of specific DNA fragments in the specific TFDNA binding loci in the ChIPseq experiment. The value of this parameter we calculated by extrapolation of the bestfit GDP function or the best fit KW function to low and moderate values of peak heights on the empirical histogram of binding avidity. (7). (see Table 3). The estimated values of parameter a show that estimated specific DNA fragments is much smaller than a fraction of non specific background DNA fragments. This parameter varies significantly across the libraries. (table 2S of Additional file 4). We suggest that parameter α can be considered as an important parameter characterising an enrichment of the library with sDNA fragments bound to the specific lociBEs. These parameters can serve as targets for further optimization of ChIPbased TFDNA biding assays.
Additional file 4. The numbers of TFBSspecific DNA fragments according to different specificity thresholds. Observed and bestfit GDP function predicted numbers of the DNA fragments are compared.
Format: XLS Size: 25KB Download file
This file can be viewed with: Microsoft Excel Viewer
Table 3. Best fit KW function estimations: specific components of probabilistic TFDNA binding model
FittingExtrapolation method for KW function and estimation of parameters p_{0 }and N_{tot}
In this section we will answer the question: how many physically specific BSs of a given TF exist in the mammalian genome? We used the theoretical result obtained in the previous section to estimate the parameters a, b and θ of KW function and thus, to estimate p_{0}. Practically, for all ChIPseq datasets we could fit the KW probability function to the GDP function values after fitting the truncated GDP to a highconfidence segment of the empirical frequency distribution of TFDNA binding. This segment of the empirical distribution is assigned by the number of ChIPseq DNA fragments, called M_{2}, which form N_{2 }clusters. The number of these clusters was counted for the peak height values ranged between specificity threshold level t and maximum value of peak height J, where the specificity Sp is defined by the formula (8).
The detailed description of the algorithm of estimation of the parameters of skewed distribution function functions like GDP and KW was reported in [20]. Using this algorithm, we found that for ChIPseq data the GDP function exhibits a fairly accurate approximation of KW function throughout the entire range of BEs (Additional file 5), We used the properties for fitting KW function based on the extrapolated data estimated from the bestfit truncated GDP function values. The estimated parameters of KW function, we used to estimate the probability of nondetected BEs (p_{0 }at m = 0) by the formula (28) and N_{tot}, as follows:
Additional file 5. Mutual agreement bestfit KW and GDP functions. the both functions provide an accurately estimation of the number of specific ChIPseq DNA fragments in reliablydefined TF binding sites.
Format: XLS Size: 23KB Download file
This file can be viewed with: Microsoft Excel Viewer
and estimate the number nonobserved TFBSs
Using p_{0 }(28)_{, }we could estimate TFDNA binding sensitivity
KolmogorovWaring Function predicts a large number of TFBSs with low and moderate binding avidity
GDP can be considered as a good empirical approximation of the KW function (see Methods). Granted this property, we used fitting and backextrapolation method to get an accurate estimate of three parameters of KW function. Figure 2CD Table 3 shows that all empirical distributions of binding avidity are fitted well by the KW function. Using the estimated parameters of KW function (see Methods), we can predict the fraction of specific BSs which have not been detected in the ChIPseq experiments (p_{o}) (28). For example, the parameters of the probabilistic model for Nanog TF data were estimated as θ = 0.997; α = 5.870; β = 7.465, and thus, from (28), we have p_{0 }= 0.22. For Oct4 data: θ = 0.998; α = 5.681; β = 8.32, thus by (28) we have p_{o }= 0.32. Interestingly, for all TF binding data the parameter of the KW function equals to or slightly smaller than 1. This result suggests that for TFDNA bindingdissociation processes the rate of preferential dissociation equals to or little bit larger than the rate of preferential binding.
Using our parameterization algorithm we can estimate the total number of specific BSs in the mouse genome for a given TF. This estimate equals ~6.67 10^{4 }BSs for Nanog and ~2.82 10^{4 }BSs for Oct4. Table 3 shows the estimates for other TFs.
Reliable data points in the righttail region of the empirical frequency distribution (starting from our bestfit GDPdefined cutoff peak value; region N_{2 }on Figure 2B, C, D) and the bestfit predicted GDP data points of the noiserich (left size) region of the distribution (region N_{1}, Figure 2B, Additional file 2) were combined together. KW function fitting was used to estimate the fraction of nondetected BEs, p_{0}, and the total number of TFBSs, N_{tot }(N_{tot}= N_{0}+ N_{1 }+ N_{2}). For example, for the Nanog dataset, at threshold 11, 1.02*10^{4 }reliable BSs were predicted by the GDP and KW. Additionally, KW function predicts in total 51114 Nanog BSs in the genome of E14 cells, where 10223 (20% of N_{tot}) were nonobserved BSs (N_{0}) in the library; 30678 (60% of N_{tot}) specific BSs where predicted in the noiserich BE set (N_{1}) and 10213 (20% of N_{tot}) were predicted in the reliable specific BE set (N_{2}), respectively (Figure 4). The estimates of the parameters of GDP and KW functions and N_{tot }for cMyc, Esrrb and other TF data are given in Table 3 and, partially, in Figure 4.
Figure 4. Three segments in the range of TFDNA BEs count for 11 TFs of mouse E14 embryonic cells.
Figure 4 and Table 3 shows that the fraction of nondetected BSs (p_{0}) varies among different TF ChIPseq libraries from 6% (E2F1) to 54% (ZfX) of the total number of BSs predicted in the genome of mouse ESC E14. The total number of predicted BS ranges from 9874 (cMyc) and 10320 (STAT3) to 178330 (Klf4) and 513739 (ZfX). The total number of cMyc TFBS predicted in the genome mouse ESC is similar to the predicted number of cMyc TFBS predicted in human Blymphocytes [12]. Note, later estimate derived based on the data analysis and modelling ChIPPET data. Figure 4 shows also that reliable and specific BEs form the smallest fraction of the BSs with one exception (E2F1, 52%; Table 3), while the number of low and moderate avidity TFBSs for other ten TFs contain the most abundant subsets of N_{tot}. The nonobserved fraction of TF BS is estimated as the largest for Sox2, ZfX and nMyc. Thus, these results strongly suggest that the sensitivity of ChIPSeq technique is still low and has to be improved essentially.
Significant number of putative low and moderate avidity TF binding loci are admixed to the noiserich binding loci and might be functionally important
According to the prediction of the GDP and KW models, the number of low and moderate avidity TFBSs should monotonously increase when the number of unique overlapping ChIPseq DNA fragments in a binding locus becomes smaller. To validate this prediction, we used motifdiscovery program nminfer (see Methods section) to identify cMyc binding motifs/Eboxes in the ChIPseq binding loci. As a training set we used the loci with cMycDNA binding specificity cutoff values defined by GDP model at ≥12 sequences per locus representing high avidity loci. We found Position Weight Matrix (PWMs) which contains 'canonical' Ebox CACGTG 'noncanonical' Eboxes including CACGCG, CGCGAG and CACATG [18] and a novel noncanonical Ebox CGCGAG which is found in our study (Figure 5A). All four noncanonical motifs were scanned for exact matches in the mouse genome (mm8) using SeqMap software [21]. The total numbers of matches of CACGTG, CACGCG, CGCGAG and CACATG in the mouse genome were 262,133, 83,490, 41,394 and 2,451,550, respectively. Ebox CACATG sequence is highly frequently occurred in nongenic lowcomplexity regions of the mouse genome associated with repeats elements or promiscuous genome regions. To minimize falsepositive or bias in the results, we excluded the Ebox CACATG from our validation and prediction analyses. Then we studied the localization of the Eboxes CACGTG, CACGCG, CGCGAG within ChIPseq binding loci. To do that we construct the frequency distribution of the Ebox sequences around the central nucleotide of ChIPseqdefined cMyc biding loci (Figure 5B). To identify the region of the Ebox localization within binding loci, we narrowed the scan region with ± 150 bp (300 bp) and ± 250 bp (500 bp) around the center of cMyc binding locus (Figure 5B).
Figure 5. Validation of ChIPseq defined cMyc binding loci based on motif finding analysis. A: PWM of cMyc TFBSs defined with NestedMICA program trained with 12 peak height or higher value defined in ChIPseq experiment. B: Distribution of Ebox sequences in ± 1 kb from the centre of ChIPseq defined binding loci. C: Frequency distribution of the number of ChIPseq overlapped DNA fragments (peak height). ◊: All ChIPseq cMyc bound loci for observed peak heights. o: Eboxes positive loci found in vicinity ± 250 bp. ∇: Eboxes positive loci found in vicinity ± 150 bp. D: Venn diagram of cooccurrence of Eboxes in ± 150 bp of cMyc binding loci (left side). Pair of Kappa correlation coefficient of cooccurrence of Eboxes in cMyc binding loci (right side).
In total, 6437 ChIPseq cMyc binding loci with the peak values 7 and higher were found. We found that the number of ChIPseq loci containing the Eboxes in ± 150 bp region is 3527 (Additional files 6 and 7) and in ± 250 bp region is 3948 (Additional files 6 and 7), respectively. These results suggest that cMyc binding loci are strongly enriched with Ebox sequences: we found that 55% (3527/6437) loci of the ± 150 bp region (Additional file 7) and 61% (3948/6437) loci of ± 250 bp region (additional file 7) are Eboxpositive. Each of these regions exhibits at least one copy of the three specific cMyc Eboxes or (CACGTG, CACGCG and CGCGAG).
Additional file 6. Venn diagrams of number of Eboxes colocalization in ChIPseq defined binding loci. A: Venn diagram of number of Eboxes positive loci found in vicinity ± 150 bp of the centre of ChIPseq defined binding loci. B: Venn diagram of number of Eboxes positive loci found in vicinity ± 250 bp of the centre of ChIPseq defined binding loci.
Format: PDF Size: 111KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 7. Number of cMyc binding loci containing Eboxes in different peak height. A: Number of cMyc binding loci (± 150 bp from cMyc loci center) containing Eboxes. B: Number of cMyc binding loci (± 250 bp from cMyc loci center) containing Eboxes.
Format: XLS Size: 31KB Download file
This file can be viewed with: Microsoft Excel Viewer
Figure 5C shows the frequency distribution of the peak height representing cMycDNA binding avidity starting with peak value 7. On the Yaxis, we present the numbers of ChIPseq cMyc binding loci at the given peak height value, the number of ChiPseq binding loci containing E box in the ± 150 bp region of centre of locus, and the number of Chipseq binding loci containing Eboxes in the ± 150 bp region of centre of locus.
Each frequency distribution on Figure 5C shows a similar trend to increase number of binding loci when the peak height decreases. For example in ± 150 bp regions (Additional file 7), the number of the binding loci containing Eboxes with overlap peak values 7, 8, 9, 10, 11 and 12 were 721, 539, 378, 266, 209, and 159, respectively. The low and moderate avidity BSs with peak values from 7 to 11 include the major fraction (~60% (2113/3527)) of the binding loci containing Eboxes found in the ChIPseq library. We also observed that the number of binding loci containing only one of three Ebox sequences increases when the binding avidity decreases (not presented). These trends are in agreement with predictions of our model of TFBS binding (Figure 2, Additional Files 1, 2).
It has been shown that relatively high avidity TFBSs could be preferentially located nearby transcription start site (TSS) of the target genes and overlap with CpG Island [15]. It is also true for a large number of low and moderate avidity cMyc BSs in the genome of human Bcells [12]. We observed that for mouse EC cMyc ChIPseq data [15], Eboxpositive Chipseq binding loci are also often localized in the putative promoter regions of target genes. We found that 50.9% (3277/6437) cMyc binding loci with peak height 7 and larger are localized within ± 1 kb TSS region of 3966 RefSeq genes (mm8) (Additional file 8). We count also the number of cMyc binding loci containing Eboxes and being around ± 1 kb of TSS. 68% (2223/3277) of cMyc binding loci around ± 1 kb of TSS contain at least one Ebox in ± 150 bp around the centre of cMyc binding loci while 32% (1054/3277) of cMyc binding loci have no Eboxes (Table 1 in Additional file 8). When we extend the region from ± 150 bp to be ± 250 bp around the centre of cMyc binding loci, percent of cMyc binding loci around ± 1 kb of TSS contain at least one Ebox increase to 76% (2493/3277) and percent of cMyc binding loci without Eboxes leave 24%(784/3277). These results easily demonstrate strong association of the Eboxpositive binding loci with gene target promoter regions. Perhaps, the most of cMyc target genes in mouse EC cell genome should be considered as direct gene targets, because alternative Eboxes (for instance, CACATG) also found in the cMyc binding loci with high frequency.
Additional file 8. Validation of ChIPseq defined cMyc binding loci based on localization of cMyc binding loci, Eboxes and putative promoter of genes in mouse genome. A: Number of cMyc BSs containing Eboxes and not containing Eboxes around TSS ± 1 kb and Number of genes which have cMyc BSs containing Eboxes and not containing Eboxes. B: Number of BSs which are around TSS ± 1 kb and contain Ebox. The number was separated in two groups; 1: relative low avidity (peak height 78) and 2: moderate and highavidity (peak height 9+)
Format: XLS Size: 31KB Download file
This file can be viewed with: Microsoft Excel Viewer
Interestingly, 2830% of cMyc binding loci containing at least one Ebox sequence exhibit low or moderate avidity (peak height 78) cMyc binding (Table B Additional file 8). These results support the prediction of our probabilistic model (KW) that, in low or moderate avidity binding could be considered as true cMyc binding sites. However information about conserved motifs and other regulatory regions are required for indentifying putative true binding sites in low or moderate avidity of binding. By our analysis, Eboxes positive binding loci often overlapped with CpG island region(s) which are cooccurred with cMyc BSs [12]. The loci with peak height >8 are overlapped with CpG Island in 82% (1305/1598) cases (+/150 region in Table B in Additional file 8); Eboxpositive relatively low and moderate binding avidity (78 peak heights) loci are also overlapped with CpG Islands of the putative gene targets in 75% (469/625) cases. The genes having this moderate avidity BS might be considered as strong candidates for further bioinformatics analysis and experimental validation.
Additional file 9. Number of Eboxes found within cMyc binding loci in different peak height. A: Number of Eboxes found within cMyc binding loci (± 150 bp of center). B: Number of motifs found in cMyc binding loci (± 250 bp of center).
Format: XLS Size: 31KB Download file
This file can be viewed with: Microsoft Excel Viewer
cMyc Binding loci could be represented by multiple copies of Ebox sequences
We found that binding loci could be represented by multiple copies of Ebox sequences: 3527 binding loci of ± 150 bp regions are represented by 5546 Ebox sequences (mean. 1.57 (5546/3527) Eboxes per locus) (Additional file 9), and 3948 binding loci of ± 250 bp regions are represented by 7182 Ebox sequences (mean 1.82 (7182/3948) Eboxes per locus) (Additional file 9).
Figure 5D shows that the Eboxes often co localized in the ChIPseq cMyc binding loci. For instance, Venn diagram on left panel of Figure 5D demonstrates that in ± 150 bp around a centre of binding locus 19% of Ebox CACGTG sequence (363/1895) and 25% (363/1476) of E box CACGCG sequence are co localized in the same binding locus. Kappa correlation coefficient, which measures cooccurrence of the events (StatXact 5; Cytel Software Co), equals 0.66 (p < E230). Similar values of the correlation coefficient we found between two other Ebox pairs, which are presented on the right panel of Figure 5D. Similar results we obtained when 250nt vicinity around a centre of cMyc binding locus was analyzed (not presented).
A casestudy of multiple occurrence of cMyc Eboxes in ChIPseq –defined promoter regions of embryonic SCrelated genes
It was shown that the TFBS found in ChIPseq and ChIPPET study of cMyc have a potential to stimulate embryonic SCspecific gene expression [12,15]. In this section, we shall consider the examples of association between binding avidity level and potential relation of nearby genes with stemcell specific activity.
Figure 6A presents a genomic map of cMyc Eboxes distribution in the WEE1 homolog 1 (Wee1) gene region. ChIPseq cMyc binding locus (pointed by blue arrow) should be considered as highavidity TFBS due to peak height = 76. This binding locus contains Ebox sequences of two canonical Ebox CACGTG presented on Figure 6A. Wee1 was reported to be an important component of hESC cell cycle regulation pathway [22] and might be coregulated with cMyc [23]. In Figure 6B, a genomic region of Nucleoplasmin 3 (Npm3), a chromatin remodelling protein responsible for the unique chromatin structure and replicative capacity of ES cells [24], is presented. A BS with moderateavidity (height peak = 10), of cMyc is located in the region nearby the first intron. The locus contains five Ebox sequences: one canonical and three noncanonical Eboxes in the first intron and another canonical Ebox in the second exon. In addition, this BS and the Eboxes overlap with a CpG island. Figure 6C shows cMyc low avidity BS within FKBP5 (FK506 binding protein 5) promoter region. Two relatively low avidity cMyc BSs contain cMyc Eboxes. First BS located in the upstream region of the gene has a relatively low avidity (overlap peak value 7) and contains a canonical Ebox CACGTG. The second BS located in the first intron also has relatively low avidity (overlap peak value 7) and contains two noncanonical Eboxes CACGCG and two noncanonical Eboxes CGCGAG. The last BS overlaps with a CpG Island which suggests that this locus might be related to the regulation of the transcription of the gene.
Figure 6. Multiple occurrence of cMyc Eboxes in promoter region around transcription start site (TSS) of cMyc target genes identified in ChIPseq experiment. A: WEE 1 homolog 1(Wee1). The highavidity (height peak = 76) cMyc binding sites (pointed by blue arrow) in strong promoter region of Wee1 gene is supported with two canonical Ebox CACGTG. The binding locus and Eboxes are overlapped with CpG Island which might be bound by cMyc. B: Nucleoplasmin 3 (Npm3). The moderateavidity (height peak = 10) as in the previous case the ChIPseq cMyc binding locus is located in the first intron promoter region. The locus is supported with five Eboxes: one canonical and three noncanonical Eboxes in first intron and another canonical Ebox in second exon. In addition, this binding region and the Eboxes are located in CpG Island. C: FK506 binding protein 5 (Fkbp5). Two relatively low avidity cMyc binding sites identified in ChIPseq experiment confirmed with Eboxes. First ChIPseq loci in upstream gene region has relatively low avidity biding site (height peak = 7) supported with canonical Ebox CACGTG. The second ChIPseq loci located in first intron and it is also relatively low avidity peak (height peak = 7) which is supported with two noncanonical Eboxes CACGCG and two noncanonical Eboxes CGCGAG. The last locus overlaps with CpG Island which suggests that this locus might be functional.
FK506 binding protein 5 (FKBP5) belongs to the family of immuneglobulins named for their ability to bind immunosuppressive drugs, also known as peptidylprolyl cistrans isomerases, and also with chaperones (involved in protein folding) [25]. FK506 also plays an important role in cancer tumors growth and chemoresistance through regulating signal transduction of the NFkappaB pathway [25]. The expression and activity of NFkappaB is comparatively low in undifferentiated embryonic cells ES cells, but increases during differentiation of the ES cells [26]. Due to this finding we could suggest that cMyc binds to the low avidity BSs in the promoter region of FKBP5 and, through a modification of expression level of this gene, can provide a regulation of NFkappaB pathway in mouse ES cell. Interestingly, high, moderate, and relatively low avidity cMyc BSs in the promoter regions of Wee1, Npm3, and Fkbp5 genes have their binding association scores correlating with ChIPseq cMyc binding avidity. Binding association score estimates the genomic distance between a BS and a gene TSS [15]. Additional file 10 contains the data demonstrating that Wee1 could be under promoter regulating activity of Oct4, cMyc, and nMyc, while Npm3 could be under promoter activity of cMyc only. However, although Fkbp5 shows a moderate binding association score with cMyc and STAT3, Fkbp5 expression is strongly associated with Nanog and nMyc promoter activity. These results suggest that all three genes could be under the transcriptional control of cMyc and, additionally, that Wee1 and Fkbp5 could be associated with EScell specific expression.
Additional file 10. TFgene association scores. High, moderate, and relatively low avidity cMyc binding loci with multiple Eboxes in putative promoter region of Wee1, Npm3, and Fkbp5, respectively. Gene enrichment classes: I  gene enriched with binding site for Nanog, Oct4, Sox2, Smad1, and STAT3; II  gene enriched with binding site for cMyc and nMyc. The association score estimates distance between each pair of binding locus and gene based on genomic location of the binding locus that is closest to TSS.
Format: XLS Size: 19KB Download file
This file can be viewed with: Microsoft Excel Viewer
Discussion
In this work, we studied statistical characteristics of proteinDNA binding events for the eleven stem cell related transcription factors bound in the genome of mouse embryonic stem cells E14 and detected by ChIPseq assay [15]. Several methods for ChIPseq data analysis have been recently developed [11,13,27]. However, an appropriate mathematical model of TFDNA binding in ChIPseq binding assay and statistical evaluation of the sensitivity of the methods has been not developed. Several approaches to quantitative identification of individual TFbound loci have been recently developed(peak finder algorithms [7,9,1113,27]), however overall binding events for specific DNA loci including low and moderate avidity TFBS at the genome scale have been out of systematic consideration.
Our previous analysis of different ChIPbase TFDNA binding datasets [9] suggested that the mixture probability distribution model (1) could reflect a common property of TFDNA binding events in different cells in different experimental conditions. This model allowed us to estimate the specificity of ChIPbased TFDNA binding events for several TFs. In the present study, we develop a mathematical model of TFDNA bindingdissociation events in both TFspecific and TFin specific genomic loci and use this model to estimate a number of essential parameters of statistical distributions observed in ChIPseq assays. We include in our consideration the binding events of TFs in the entire range of their avidity to their binding loci from very high to very low. We focused on several practically important parameters closely related to analysis of the empirical TFDNA binding distributions: t, Se, Sp, p_{0}, and N_{tot }(Figure 3, Table 3). Interestingly, for all TF binding data, the parameter of the KW function equals to or slightly smaller than 1. It suggests that the Waring probability function could be used as good approximation of KW function for analysis of empirical frequency distributions of ChIPseq binding. Another interesting finding: for TFDNA bindingdissociation process in the mouse genome the rate of preferential dissociation equals to or little bit larger than the rate of preferential binding.
We found that the parameters of the mixture distribution are sensitive to 1) the sample size (number of nonredundant sequence tag reads) M, 2) the proportion of nonspecific sequence reads in the sample, 3) the sampling model and 4) the analytical models and the computational algorithm used for the parameterization of the distribution of TFDNA BEs. We have demonstrated that skewed shape and sample sizedependence is a common property of a specific TFDNA binding distributions [7,9,11,12]. Similar results we obtain here for 11 ChIPseq samples derive from mouse embryonic SCs. Strongly positive values of parameter β in the GDP model describing TFDNA binding in specific loci were found in all TF libraries (Table 2, Additional file 2 (Figure 2S)). Parameter J of the truncated GDP function reflects the maximal observed number of BEs of the GDP model. This parameter positively correlates with the sample size M [9,28]. These results agree with previous findings in ChIPPET TFDNA binding assay [9,12,13].
The values of observed and computationally predicted parameters of the statistical distribution k are found very close to each other for all ChIPseq TF libraries (Table 2, Additional File 1, 2, 3, 4, 5). Using 11 available ChIPseq TFDNA binding datasets, we revisited and improved the estimates of the levels of sensitivity and specificity of the TFDNA BEs.
Our mixture probabilistic model allows us to estimate the specificity cutoff value for ChIPseq library and also estimates the fraction of specific TFDNA binding loci for TFs in a ChIPseq data. Our probabilistic model estimates also the specificity threshold, which value often is close or more stronger than estimates by ChIPqPCR assay. Table 2 shows that for all 11 library our model (1) models provides >94% specificity. Thus, we conclude that the basic concept of mixture skewed scaledependent distribution, originally developed for ChIPPET data analysis [9,12], can be applicable to ChIPseq data.
We conclude that a model of random occurrence of DNA fragment clusters in the genome is not appropriate for quantitative determination of critical threshold of binding specificity in Chipbased genomewide binding analyses, including ChIpseq. This conclusion is in agreement with recent computational simulations of the frequency distributions of TFDNA BEs in ChipSeq data [13], where local background noise for Stat1 TF ChIPseq data [5] was modelled.
We developed a simulation model of unbiased identification of specificity of 'problematic' ChIPseq loci. The method could be used to optimize the experimental design of ChIPqPCR experiments.
Experimental data, as it is demonstrated in this work, has noise enriched BEs accumulated mainly below the specificity threshold values (Figure 2, Additional File 2).
Our probabilistic model shows that at the conventional specificity threshold % (> 95%), the fraction of highavidity specific sequences and TF binding loci containing these sequences are surprisingly low in all studied libraries. These results suggest that the major fraction of true binding sites could not be detected by the ChIPseq method without additional experimental validation and rigorous bioinformatics and extensive statistical analysis of data.
According to our model prediction, the number of low and moderate avidity TFBSs should monotonously increase when the number of unique overlapping ChIPseq DNA fragments in a binding locus becomes smaller. To validate these predictions, we used motifdiscovery program nminfer to identify a position weighted matrix (PWM) of cMyc motifs in the ChIPseq binding loci. Our bioinformatics and statistical analyses revealed that the moderate avidity BSs with peak values 7 to 11 include the major fraction (~60% (2113/3527)) of the Eboxcontaining TFBSs found in the ChIPseq library. We also observed that the number of BSs, containing each of the three major cMyc Ebox sequences and their combinations, monotonously increases when the binding avidity decreases (Figure 5C). All these trends are in agreement with predictions of our model of TFBS binding (Figure 2, Additional Files 1, 2). Thus, in combination with motiffinding techniques (and/or experimental validation assay ChIPqPCR) our modelling approach allowed us to identify the loci of many thousands of novel BSs with characterized with low and moderate avidity of TF. Moreover, the number of undetected low and moderate avidity specific TFBSs was estimated, which addresses common problem of sensitivity of a given ChIPseq assay for a given TF in cells under given experimental conditions. We show that although ChIPseq is a powerful technique still it produces essentially incomplete information about the low and moderate avidity TF DNA binding events in the complex (e.g. mammalian) genomes.
Our mathematical modelling of the mixture strong and weak TFDNA binding and sequence analysis of genomewide binding data suggests that integration of these approaches could help to reveal many new target genes for cMyc and for other studied TFs. For instance, bestfit KW function predicts in total 51114 Nanog BSs in the genome of E14 cells: 20% of these 51114 Nanog BSs were nonobserved in the ChIPseq library and 60% of the BSs were predicted by the model in the noiserich BE set. These modelbased and datadriven predictions of Nanog BSs could be validated in case study experiments. A functional significance of such low and moderate avidity BSs for putative target genes might be investigated in a near future.
The most important thing that our results suggest that low and moderate affinity BSs could have biologically meaningful functional roles. However, biological role of the enormous number of the moderate and low avidity BSs for TF is unknown. We speculate that these bindings could be used in the nucleus to storage large number diverse TFs and their cofactors at quasistationary and thermodynamicallydefined states at vicinity of doublestranded DNA. Such weak, unstable and multiple proteinDNA bindings might be use by a cell for recruitment and redistribution of specific TF molecules along the double strand DNA depending from external and internal regulatory signals. We also speculate that the strength of TFDNA bindingdissociation could be significantly modulated by cooperative interactions among TFs and DNAbinding signals.
For all studied TF library a relatively large fraction of low to moderate binding loci was not detected at all (Table 3). So the estimation of p_{0}, by KW model, is an informative measure of the incompleteness of experimental data. ChiPSeq derived frequency of DNATF BEs can be fitted well with truncated KW probability function and thus, allows us to estimate all specific TFBSs (Ns_{1}+Ns_{2}) which could be found in the library and also to estimate the total number of TFBSs (N_{tot}) entire genome.
However, father analysis of sensitivity and robustness of estimates and extrapolations of the probabilistic model of TFDNA binding model and analysis more complete and large ChIPseq datasets might be important for robust and accurate parameterization of our models and its applications.
Conclusion
We proposed a probabilistic model of TFDNA binding process at the genome level and based on the model we developed a computational method which allows us to 'denoise' ChIPseq datasets and to estimate the specificity and the sensitivity of ChIPseq assays.
Goodnessof fit analysis of GDP and KW functions suggests that the sensitivity problem has not yet been technically resolved by the ChIPbased methods, including ChIPseq. TFBS motif finding analysis supports our results. Due to the proposed improvements in the sensitivity and the specificity of ChIPseq assay, functional roles of an extremely large number of low/moderate avidity TF binding loci in the mammalian genomes can now be investigated. After these studies the models of transcriptional regulatory network in embryonic cells and other cell types should be carefully revised.
The numbers of the low/moderate and high avidity specific TF BSs are estimated here for all the studied data sets. We suggest that many low and moderate avidity BSs have biologically meaningful functional roles. Since in the previous studies only high avidity TF BSs could be reliably detected by ChIPseq assays, identification of other binding sites and elucidation their functional role in genome is a great imperative goal for biotechnology, computational biology and functional genomics.
It is likely that our approach could also be applied to the analysis of highresolution ChIPbased generated profiles of chromatin chemical modifications [29,30] in mammalian genomes. Our work provides a theoretical framework for a comprehensive computational prediction and a robust experimental identification of TFBSs (and other ChIPseq data) when low and moderate avidity sequences are overrepresented in ChIPderived sequence samples.
Methods
ChIPseq data sets
We used ChIPseq datasets (ChIPseq libraries) generated by Solexa sequencing for eleven TFs (Nanog, Oct4, sox2, KLf4, STAT3, E2F1, Tcfcp211, ZFX, nMyc, cMyc and Essrb). These TFs are considered as essential TFs for the maintenance of the pluripotency in mammalian stem cells and were studied using murine E14 embryonic stem cells cultured under feederfree conditions as described in Chen et al [15]. The mapped ChIPseq datasets were downloaded from T2G GIS DB (http://t2g.bii.astar.edu.sg webcite; see also NIH GEO ID:GSE 11431).The extended ChIPseq DNA fragments were clustered and the number of overlapping fragments were summed at each locus and used to construct empirical frequency distribution of TFDNA binding. Some statistical additional characteristics of the libraries are presented in Table 1 and Table 2 and Additional files 4, 5.
Motif finding and target gene counting
To validate predictions of our TFDNA binding model and to extend the sensitivity of ChIPseq derived sequences clustered the loci with low and moderate binding avidity, we used genome coordinates of the available extended ChIPseq fragments and provided computational identification of TFBSs by using motifdiscovery program nminfer from NestedMICA http://www.sanger.ac.uk/Software/analysis/nmica/ webcite[31]. cMyc TF ChIPseq data [15] were used to illustrate our approach NestedMICA program was used to identify position weighted matrixes (PWMs) Eboxes and consensus sequence motifs of cMyc TF into genomemapped ChIPseq DNA fragments. The sequences of strongly specific BEs with height 12 and higher (peak12+) were used as a training set for motif discovery. The training set sequences were downloaded from mouse genome (UCSC mm8 after repeat masked by capital Ns). The test set is peak regions (± 200 bp around centre of the cluster peaks) with height 7 and higher (peak7+), which have been not used in the training set.
The PWM found from training set were computationally scanned on the test set by nmscan program at score threshold3.0. The motifs found from in test set were used to scan and count number of each motif found in the mouse genome (mm8).
Empirical frequency distribution of the avidity of TFDNA binding
To quantify the avidity of TFDNA binding in a given locus, Chen et al. [15] extended the distinct fragments by 200 bps and clustered these fragments that were overlapping by at least 4 bps. In ChIPseq experiment, "a binding signal" for a given binding locus has been represented the number of the DNA fragments formed by these 5end "extended and overlapped" DNA fragments [15]. For Chipseq experiment, this number could the considered as the TFDNA binding event (BE) representing TFDNA "binding avidity" averaged across genome BSs of hundred millions cells.
We define a list of uniquely mapped ChIPseq DNA sequences observed in a given ChIPseq experiment as the Chipseq library. To quantify the data of an individual ChIPseq library, we defined the size of the library, M, as the total number of distinct ChIPseq reads uniquely mapped onto a reference genome. Let m denote the number of the BEs counted by a peakfinding program (e.g. T2G) in a DNA fragment cluster overlap of ChIPseq library. The single DNA fragment mapped on the genome is also included. m = 1, 2, 3,..., m_{max}, where m_{max} = J denotes the maximum value of m. Let n(m, M) denote the average number of genome loci in which the BEs found in a given ChIPseq data exactly m times in the library of a size M. Due to sample size, experimental errors and biological variation across many cells and environmental conditions, observed n is the function averaged across the cells and conditions and this function increases when M becomes larger. Thus, n only approximates a true number of TFs bound to genomic loci with BE value m. However, n=n(m, M) after an appropriate normalization could be used for statistical analysis of relative binding avidity of ChIPseq binding loci.
Let denote the total number of distinct loci counted in the ChIPseq library. Then and we could call M also as the "DNA sequence mass". The empirical frequency distribution of the number of DNA fragments in the locus within the ChIPseq dataset () might be considered as the empirical probability function of TFDNA binding avidity. Such histogram is an essential starting point for further statistical analysis of data and planning of validations studies [7,9,13].
An empirical probabilistic model
We assume that the probability distribution function of TFDNA binding avidity in ChIPseq experiments could be modeled as a mixture of two probability distribution functions
where P_{s }is the probability distribution function of specific binding P_{ns }is the probability function of nonspecific binding. The later function might be more complex and represent the technical background noise and natural biological noise determinate by large number of nonspecific "low avidity" TFDNA binding events. Parameter s is the relative frequency of specific bindings in (1). 0<s<1. m = 0, 1, 2,... The model (1) could be approximated by the following empirical probability distribution function:
where is the empirical probability distribution function of specific binding, is the empirical probability of nonspecific binding. The parameter α could be estimated as the fraction of the extended DNA fragments uniquely mapped on the genome and belonged to the true loci having specific TFBS. 0 <α < 1. Figure 2 shows the examples of nonnormalized empirical frequency distribution of TF BEs (e.g. Nanog TFDNA data) which after normalization to 1 can modeled by (2).
Due to sequence read sampling, the frequency distribution (1) is considered as scaledependent and skewed functions [20,32], i.e. when the sample size, M, increasing, the shapes of the noise and specific frequency distribution functions are changed correspondently with library size. We model the nonspecific avidity distribution function P_{ns }in (1) by an exponential function distribution with continuous exponent parameter as a decay function of sample size M [9,11].
The effect of a limited sample size, complex background noise, critical cutoff values, and the specificity and sensitivity of ChIPseq assays
If one has prior knowledge of the sets of all TFBSs and of all sequences not bound by a given TF in the genome, then conventional calculation of the specificity and sensitivity of genomewide TF BEs is straightforward. However, in the absence of such knowledge, one needs to rely on statistical analysis of datadriven physical models and computational estimates using available highlynoisy and incomplete DNA fragment samples [9,12].
A significant amount of nonspecific genomic DNA fragments (background noise) is always present in the inmmunoprecipitated DNA material of any ChIPderived dataset [9,12]. Some nonspecific DNA might be easily filtered out after computer mapping of the DNA fragments on the genome. In larger library, the number TF specific loci could be increased. However, background (or noise) genomic DNA fragments could nonuniformly located in the genome and thus false clustered should be occurred in any region of the empirical frequency distribution [7,9,12,13,15].
The following basic statistical tasks are becoming imperative: (i) specificity of the library, i.e. to identify statistically significant TFBSs and count their number at the given confidence level, t, of BEs; (ii) power of the library, i.e., to identify "true" specific BEs which are present in the noiseenriched subset of relatively low read counts (0<m<t) in the library and (iii) sensitivity of the detection, i.e. the number of "lost" BSs which are available for TF binding in the given cells at the given condition, but were not detected due to a limit of the TF library size and the technical implementation.
We analyze these problems via probabilistic modelling, goodnessoffit analysis and computational modelling of nonspecific and specific BEs loci for a given TF in the ChIPseq library. This analysis is used to quantify avidity of binding events of eleven TFs studied in the genome of mouse stem cells.
For a given TF library, let N denote the sum of two subsets of BEs:
where N_{1 }is the number of observed 'noiserich' TF DNA binding loci, having relatively low/moderate potential of TF binding avidity and N_{2 }is the number of observed 'specific rich' TFDNA binding DNA loci, having a relatively high potential of TF binding avidity. The parameter t is the TFDNA binding specificity threshold value.
To quantify specific and nonspecific BEs, we could separate the uniquelymapped ChIPseq DNA fragments on two subsets by the following:
where M_{1}is the number of ChIPseq DNA sequences which observed in the subset of 'noiserich' and nonreliable TFbounding DNA loci, M_{2 }is the number of ChIPseq DNA fragments which are observed in subset of reliable specific and TFbounding DNA loci.
For a given TF library, let denote the total number of specific TFDNA binding loci in the ChIPseq library. A set of specific TFDNA binding loci could be split in two subsets by the following:
where, is the estimation of the number of specific BEs at value m. is the estimate of the number of TFDNA binding loci at m≥t. is the estimate of the number of specific TFDNA binding loci at m<t.
To quantify avidity specific BEs and to estimate parameter α in (2), we could estimate the number of ChIPseq DNA fragments in the high confidence loci, , and separate this number on two numbers by the following:
Using (6), the weight parameter α in (2) can be estimated by the following:
Parameter t is an unknown threshold value of a random variable X domain separating the domain on two subdomains a binding specificity level defined by the following:
where and were defined in (2). Let denote an estimate of the total number of BEs in the entire genome in a given cell population at a given experimental conditions (e.g. genome of mouse embryonic stem cell line E14 at given treatment). We defined as the following
where the number of nondetected TFBSs, N_{0 }. Then, the sensitivity of the ChIPseq assay could be estimated by the following:
where is estimated in a domain of mvalue of ChIPseq library (m = 1, 2, 3,..., J).
Truncated Generalized Discrete Pareto (TGDP) function and estimation of , and
To quantify the empirical frequency distribution of the number of ChIPseq TFDNA BEs and to estimate , and , we model the probability function of specific binding in (1) using the truncated GDP (TGDP) function, which could be considered as a good limiting approximation of many random evolution models [20]. The GDP probability function is described as the following
where the random variable X is the number of BEs (m = 1, 2,... J), f (m; k, β, J) is the probability that a randomly chosen specific loci has exactly m BEs. The f involves two parameters, k, and β, where k > 0, and β > 1; the normalization factor ζ is the generalized (due to β > 1) and truncated (due to J < ∞) Riemann Zeta function value [33]:
The continue parameter k characterizes the skewness of the probability function; the continue parameter β characterizes the deviation of the GDP distribution from a simple power law. J denotes the maximum observed number of BEs and used as an empirical parameter of the model (11)(12). This parameter in scaledependent cases is positively correlated with the sample size M [20]. Since in loglog plot the truncated function (11)(12) exhibits systematic change of its shape when the sample size M is changed [20], the model could be cocalled the empirical scaledependent TGDP model [34].
When only the tail of the GDP is available for analysis, the doubletruncated GDP function
where
could be used for quantification of empirical distributions. In this case
Note, if the truncated distribution fits well to the left tail of the mixture distribution (e.g at m ≥ t), then ≈ N_{2}, and thus the number of specific BSs in TF ChIPseq data can be estimated by
Fitting and backextrapolation method for TGDP function
A noise background BEs could mask the specific moderate to low avidity TFBSs. It is important to estimate the numbers of specific moderate to low avidity TFBSs masked by noise background BEs in ChIPseq data. However, these subsets of BEs might be not easily separated due to the distributions overlapping and sample size dependence. To estimate tvalue at the given specificity level and the numbers of specific moderate to low avidity TFBSs associated with this tvalue, we used the fitting and backextrapolation method of recovery of the distributions of specific and nonspecific BEs.
Briefly, our algorithm includes several steps: (i) an identification of the functions which after optimization of parameters could provide the bestfit functions approximating the left side and the right side of the empirical distribution, respectively, (ii) an extrapolation of these functions to the function overlapping region, (iii) identification of the specificity threshold t, (iv) an estimation of the weight parameter α in (2), (v) final correction of the estimated parameters using complete model, (vi) restoration of the values and based on the extrapolation method applied to the bestfit distribution functions. To fit the distribution functions we used optimization criteria and methods reported by [20,35]. We used also the nonlinear regression tools of SigmaPlot software (Version 11).
KolmogorovWaring distribution function: an explanatory model of TFDNA bindingdissociation process
In ChIPseq experiments, short DNA sequence tags are randomly chosen and consequently aggregated onto genome clusters in the result of sampling of the tags derived from a large but finite number of a ChIPseq dataset. What kinds of exploratory models could be used to quantify forming of ensemble of TF clusters bound on the genome DNA?
We [7,9,12,36] have shown that the tail of the empirical distribution function of TFDNA BEs could be approximated by a skewed truncated Paretolike function. This function is sample size dependent. Due to a limited number of sequence reads, some true TFBSs, N_{0}, which are physically available for TF binding in the given cells at the given condition cannot be detected in a TF ChIPSeq library, even if the noisy BEs are fully suppressed. In ChIPbased experiments, the number of nondetected TFBSs, N_{0}, might be larger than the number of reliably detected specific TFBSs, N_{2 }[11].
In general, an estimation of N_{0 }within obtained from essentially incomplete samples is a very difficult statistical task [37]. First, this is due to (i) a large number of the real low and moderate avidity TFBSs (subset of real 'hidden' BEs, , indicated by the extrapolation curve on Figure 2AB) which provide only a minor admixed part of BEs in the histogram associated with noiseenriched BEs found in ChIPseq data. Second, a fraction of the rare BEs (subset N_{0}) which is not presented in the observed dataset could be estimated if an appropriate physicalmathematical model is used. Below, we present a stochastic model of TFDNA binding and dissociation together with the a method which allow us to estimate not only , but also N_{0}.
The GDP model (8)(9) can provide a good approximation of the empirical data; however this model does not allow estimating N_{0 }in a given assay. Eq(8) can be a good approximation of the Waring probability function [20,35,38], which could count explicitly a probability of nonobserved events. This function has been considered as an adequate distribution function derived from several explanatory stochastic process models including sampling genera from a heterogeneous biological population [38] and the aggregation process of particles [39]. The Waring probability function can be considered as a special steadystate solution of the stochastic birthdeath processes called KolmogorovWaring process [20] arising in 'omics' data analyses. In particular, this model has been used for the modelling of protein domains in molecular evolution [20,35]and the sampling of SAGE gene expression tag profiles [28].
We assume that KW function could be considered as an exploratory stochastic model of the evolution of specific TFDNA binding and use the model to estimate N_{0}. We assume that an evolution of specific TFDNA interaction in the genome can be considered as stochastic binding and dissociation events while taking into account two binding and two dissociation transition probabilities. For binding, we consider the preferential attachment process (due to the specific binding potential between TF and DNA) and the Poisson process (nonspecific potential). Similar two processes but with different intensities are assumed for detachments transitions (Figure 1D).
Let p_{m}(t) = P(D_{t }= m) denote the probability function associated with the random TFDNA bindingdissociation process {D_{t}, t ≥ 0} (Figure 1D). Then the rate of the probability functions p_{m }(m = 0, 1, 2...) of the number of TFDNA BEs could be described by the forward Kolmogorov differentialdifference equations:
where m = 1, 2,.... The initial probabilities p_{m}(0) ≥ 0 (m = 0, 1, 2,...) follow to the condition . Probably in the most evolving near steadystate, the random binding and dissociation processes of TF are kept near the equilibrium. This equilibrium solution can be written explicitly by stating dp_{m}/d_{t }= 0; m = 0,1,... in (17)(18) as
At m → ∞, a necessary and sufficient condition for the existence of the nontrivial stationary solution (19)(20) is provided by convergence of the series
where . This condition exists when starting from some i = i_{c }the condition η_{i }≤ ν < 1 takes place for all i ≥ i_{c }(i.e. on the right tail of the frequency distribution).
Using (19)(20) we can obtain the nonzero limiting probability function for the random process D_{t→∞}:
If we assume that the limiting probability distribution exists, then all dp_{m}/d_{t }(m = 1, 2,...) would necessarily converge to 0 as t → ∞ and we can obtain:
For our application purposes, we will consider the bindingdissociation process as a random process such that the intensity rates are the linear functions of the event m:
and
(m = 0, 1, 2,...), where the model parameters are defined by the following conditions . Hence, during an interval (t, t + h) where h is infinitively small, we assume that there are four independent processes: the spontaneous TF binding on and dissociation from a given specific TFBS DNA locus, with constant intensities and , respectively, and the TF binding on and dissociation from a specific TFBS DNA locus with the intensities proportional to the number of TFs already attached to the specific TFBS and respectively.
For steadystate distribution we could estimate three parameters of the steadystate random process. Let us denote . Let us also denote factorial power z^{[m] }= z(z + 1)...(z + m  1), where m = 0, 1, 2,...; z^{[0] }= 1. Using (17)(18) and (21)(22), we can obtain the unique limiting probability function for the process (17)(18) with the intensities given by (23) and (24):
m = 0, 1, 2,.... Γ (x) is the Gamma function, and B(x) is the Beta function [38]. The (25)(26) is KW probability function [20]. When the preferential attachment and the preferential dissociation rates are equalled (), then we have the Waring distribution function [38].
The probability function (25)(26) has the probability generating function [35]
where _{2}F_{1}(a, 1; b + 1; θ) is the hypergeometric Gauss series [33]
at α = a, β = 1, γ = b + 1.
In this work we will considered the following practically important conditions:
Then the limiting probability function is
and the probability
where and m = 0, 1, 2,...; p_{0,0 }≡ p_{0}.
At near steadystate of such a bindingdissociation stochastic process, the KW function can be simply calculated via the following simple recursive formula:
where m = 0, 1, 2, ... the other three parameters a, b and θ are unknown parameters. Importantly, the KW probability function allows us to estimate the value p_{0 }which gives the fraction of undetected (unobserved) events in a given ChIPseq experiment.
where _{2}F_{1 }is the hypergeometric Gauss series [33].
Specifically, if b >a > 0 and θ → 1  0, then
Eq(19) can be used for extrapolation of KW model up to m = 0 (unobserved number of BEs). Than by the following recursive formula (18), we can estimate the frequency of BEs at each value m (m = 1, 2, 3....).
If p_{0 }> 0, we can define the zerotruncated limiting probability distribution as
where m = 1, 2, 3...
Using this formula, we can prove the following useful approximation:
If a → 0+; θ → 1; b > 0, then
where B(b+1, m) is the Beta function.
The expressions (28)  (31) can be used for the quantitative analysis of the distribution of ChIPseq derived TFDNA binding events and for the calculation of p_{0 }. Note that (27)(28) provide more accurate estimates when a fraction of reliably detected TFBSs in ChiIPseq assay becomes larger.
DoubleTruncated KW Distribution and fitting of the ChIPSeq TFDNA binding distribution
The estimation of parameters in the general multiparameter families becomes problematic when the number of unknown parameters increases. However, the three, two or one parameter family distributions (27)(28) could be feasibly fitted to the empirical distributions.
In order to apply the probability function (27)(28) to the data, let us assume that the domain of the random variable X is doubly truncated, e.g. the random variable X is restricted to the range m = t, t + 1,..., J(t > 0, J < ∞). The probability function of the resulting truncated distribution can be renormalized by the following
This probability function corresponds to a common situation for ChIPseq data analysis in which the occurrence values m= 0 and m = J + 1, J + 2,...,∞ are not observed. As from the left, the (32) is transformed into practically useful expression
which as J → ∞ can be simplified to yield .
By analogy to (32), it is easy straight forward to derive the doubletruncated KW function for any low threshold value t (t = 1, 2,...) of the number of TFs bound to a given TFBS, and next using the renormalized KW function, to estimate parameters of the doubletruncated KW function using the optimization algorithm reported in [20].
Discrete Pareto distribution function is an asymptotic of the Warring distribution function
If θ = 1, b >a > 0, then is the zerotruncated KW distribution (32) and as m → ∞:
i.e. the probability distribution (16) approximates the discrete Pareto probability distribution [20,33] in the right tail of the specific case of KW distribution.
List of abbreviations used
(GDP): Generalized Discrete Pareto; (KW): KolmogorovWaring; (TF): Transcription Factor; (TFBS): Transcription factor binding site; (ChIP): Chromatin ImmunoPrecipitation; (SACO): serial analysis of chromatin occupancy; (STAGE): sequence tag analysis of genome enrichment; (SD): standard deviations; (MC): MonteCarlo method; (BEs): binding events; (PWM): position weighted matrix; (TSS): transcription start site.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
VK initiated and designed this study; he analyzed the computational results and wrote the manuscript. PJ provided databasing, prefiltering data, motif search and data analysis. OS provided computational analysis of ChIPseq data and wrote an initial draft of the manuscript. VK provided mathematical theory and computational algorithms, biological interpretation, feedback throughout the work, and wrote the manuscript. All authors have read and approved the final manuscript.
Definitions
ChIPseq library: list of ChIPseq DNA fragments uniquely mapped onto reference genome
m: number of ChIPseq DNA fragments shearing a unique locus in the genome; m could represent a relative level of binding avidity of a TFBS in a given genome locus. m = 0, 1, 2, 3,... J.
n(m, M): number of the loci representing by mvalue in the ChIPseq library.
N: number of distinct loci counted by the peakfinding algorithm (T2G) in a given ChIPseq library.
M: total number of DNA fragments uniquely mapped to the genome and counted in N loci (or " total DNA sequence mass").
t: specificity threshold
M_{1}: number of ChIPseq DNA sequences which observed in the subset of low/moderate avidity loci
M_{2}: number of ChIPseq DNA sequences in low/moderate avidity loci
N_{1}: number of observed 'noiselikely' TFbound DNA loci, having relatively low/moderate avidity
N_{2}: number of observed 'specific rich' TFbound DNA loci having relatively high binding avidity.
P: probability distribution function.
P_{s}: probability distribution function of specific binding.
P_{ns}: probability distribution function of occurrence of nonspecific binding : empirical probability distribution function
: empirical probability distribution function of TFDNA specific binding.
: empirical probability function of occurrence of nonspecific binding
: Predicted number of specific genome loci in the ChIPseq library.
: Predicted number of specific TFbound DNA fragments in .
estimate of the number of specific loci in the subset of observed 'specific rich' TFbound DNA loci.
estimate of the number of specific loci in the subset of observed 'noiserich' TFbound DNA loci.
s: relative frequency of specific BEs in the mixture probability function (1).
α: estimated fraction of specific DNA fragments representing
: Mean of TF specific binding.
TFDNA binding avidity: an integrative quantitative characteristic of availability of a DNA locus (e.g. TFBS and its flanking region) for a given protein (e.g. TF) binding Sp: specificity
Se: sensitivity: (1p_{0})100%.
N_{tot}: total number of specific TF bound loci in the genome
: model predicted total number of specific TFbound loci in the genome
p_{0}: predicted fraction of specific TF bound loci out of ChIPseq library data.
Acknowledgements
We thank Chia Lin Wei for access to original data of ChIPseq libraries. The authors acknowledge Drs. A. Batagov, J. Muller, and V. Boeva for fruitful discussions and comments. Onkar Singh also thanks BII/AStar for providing internship. This work was supported by Bioinformatics Institute/ASTAR, Singapore.
This article has been published as part of BMC Genomics Volume 11 Supplement 1, 2010: International Workshop on Computational Systems Biology Approaches to Analysis of Genome Complexity and Regulatory Gene Networks. The full contents of the supplement are available online at http://www.biomedcentral.com/14712164/11?issue=S1.
References

Bhinge AA, Kim J, Euskirchen GM, Snyder M, Iyer VR: Mapping the chromosomal targets of STAT1 by Sequence Tag Analysis of Genomic Enrichment (STAGE).
Genome Res 2007, 17(6):910916. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Johnson DS, Mortazavi A, Myers RM, Wold B: Genomewide mapping of in vivo proteinDNA interactions.

Loh YH, Wu Q, Chew JL, Vega VB, Zhang W, Chen X, Bourque G, George J, Leong B, Liu J, et al.: The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells.
Nature genetics 2006, 38(4):431440. PubMed Abstract  Publisher Full Text

Mardis ER: ChIPseq: welcome to the new frontier.
Nature methods 2007, 4(8):613614. PubMed Abstract  Publisher Full Text

Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, et al.: Genomewide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing.
Nature methods 2007, 4(8):651657. PubMed Abstract  Publisher Full Text

Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers RM, Sidow A: Genomewide analysis of transcription factor binding sites based on ChIPSeq data.
Nature methods 2008, 5(9):829834. PubMed Abstract  Publisher Full Text

Wei CL, Wu Q, Vega VB, Chiu KP, Ng P, Zhang T, Shahab A, Yong HC, Fu Y, Weng Z, et al.: A global map of p53 transcriptionfactor binding sites in the human genome.
Cell 2006, 124(1):207219. PubMed Abstract  Publisher Full Text

Yang A, Zhu Z, Kapranov P, McKeon F, Church GM, Gingeras TR, Struhl K: Relationships between p63 binding, DNA sequence, transcription activity, and biological function in human cells.
Molecular cell 2006, 24(4):593602. PubMed Abstract  Publisher Full Text

Kuznetsov VA, Orlov YL, Wei CL, Ruan Y: Computational analysis and modeling of genomescale avidity distribution of transcription factor binding sites in chippet experiments.
Genome informatics International Conference on Genome Informatics 2007, 19:8394. PubMed Abstract  Publisher Full Text

Impey S, McCorkle SR, ChaMolstad H, Dwyer JM, Yochum GS, Boss JM, McWeeney S, Dunn JJ, Mandel G, Goodman RH: Defining the CREB regulon: a genomewide analysis of transcription factor regulatory regions.
Cell 2004, 119(7):10411054. PubMed Abstract  Publisher Full Text

Kuznetsov VA: Relative avidity, specificity, and sensitivity of transcription factorDNA binding in genomescale experiments.
Methods Mol Biol 2009, 563:1550. PubMed Abstract  Publisher Full Text

Zeller KI, Zhao X, Lee CWH, Chiu KP, Yao F, Yustein JT, Ooi HS, Orlov YL, Shahab A, Yong HC, et al.: Global mapping of cMyc binding sites and target gene networks in human B cells.
Proceedings of the National Academy of Sciences of the United States of America 2006, 103(47):1783417839. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhang ZD, Rozowsky J, Snyder M, Chang J, Gerstein M: Modeling ChIP sequencing in silico with applications.
PLoS computational biology 2008, 4(8):e1000158. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lin CY, Vega VB, Thomsen JS, Zhang T, Kong SL, Xie M, Chiu KP, Lipovich L, Barnett DH, Stossi F, et al.: Wholegenome cartography of estrogen receptor alpha binding sites.
PLoS genetics 2007, 3(6):e87. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W, Jiang J, et al.: Integration of external signaling pathways with the core transcriptional network in embryonic stem cells.
Cell 2008, 133(6):11061117. PubMed Abstract  Publisher Full Text

Guillon N, Tirode F, Boeva V, Zynovyev A, Barillot E, Delattre O: The oncogenic EWSFLI1 protein binds in vivo GGAA microsatellite sequences with potential transcriptional activation function.
PLoS One 2009, 4(3):e4932. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Meyer N, Penn LZ: Reflecting on 25 years with MYC.
Nat Rev Cancer 2008, 8(12):976990. PubMed Abstract  Publisher Full Text

Blackwell TK, Huang J, Ma A, Kretzner L, Alt FW, Eisenman RN, Weintraub H: Binding of myc proteins to canonical and noncanonical DNA sequences.
Mol Cell Biol 1993, 13(9):52165224. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhou Q, Liu JS: Extracting sequence features to predict proteinDNA interactions: a comparative study.
Nucleic acids research 2008, 36(12):41374148. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kuznetsov VA: Family of skewed distributions associated with the gene expression and proteome evolution.
Signal Processing 2003, 83:889910. Publisher Full Text

Jiang H, Wong WH: SeqMap: mapping massive amount of oligonucleotides to the genome.
Bioinformatics 2008, 24(20):23952396. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Qi J, Yu JY, Shcherbata HR, Mathieu J, Wang AJ, Seal S, Zhou W, Stadler BM, Bourgin D, Wang L, et al.: microRNAs regulate human embryonic stem cell division.
Cell Cycle 2009, 8(22):37293741. PubMed Abstract  Publisher Full Text

Kiessling AA, Bletsa R, Desmarais B, Mara C, Kallianidis K, Loutradis D: Evidence that human blastomere cleavage is under unique cell cycle control.
J Assist Reprod Genet 2009, 26(4):187195. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Motoi N, Suzuki K, Hirota R, Johnson P, Oofusa K, Kikuchi Y, Yoshizato K: Identification and characterization of nucleoplasmin 3 as a histonebinding protein in embryonic stem cells.
Dev Growth Differ 2008, 50(5):307320. PubMed Abstract  Publisher Full Text

Jiang W, Cazacu S, Xiang C, Zenklusen JC, Fine HA, Berens M, Armstrong B, Brodie C, Mikkelsen T: FK506 binding protein mediates glioma cell growth and sensitivity to rapamycin treatment by regulating NFkappaB signaling pathway.
Neoplasia 2008, 10(3):235243. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kang HB, Kim YE, Kwon HJ, Sok DE, Lee Y: Enhancement of NFkappaB expression and activity upon differentiation of human embryonic stem cell line SNUhES3.
Stem Cells Dev 2007, 16(4):615623. PubMed Abstract  Publisher Full Text

Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH: An integrated software system for analyzing ChIPchip and ChIPseq data.
Nat Biotechnol 2008, 26(11):12931300. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kuznetsov V: Scaledependent statistics of the numbers of transcripts and protein sequences encoded in the genome. In Computational and Statistical Approaches to Genomics. 2nd edition. USA: Springer US; 2006:416.

Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K: Highresolution profiling of histone methylations in the human genome.
Cell 2007, 129(4):823837. PubMed Abstract  Publisher Full Text

Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim TK, Koche RP, et al.: Genomewide maps of chromatin state in pluripotent and lineagecommitted cells.
Nature 2007, 448(7153):553560. PubMed Abstract  Publisher Full Text

Down TA, Hubbard TJP: NestedMICA: sensitive inference of overrepresented motifs in nucleic acid sequence.
Nucleic acids research 2005, 33(5):14451453. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kuznetsov VA: Emergence of sizedependent networks on genome scale. In Lecture Series on Computer and computational Sciences. Netherlands: Brill Acad. Publishers; 2006:754757.

Johnson NL, Kotz S, Balakrishnan N: Discrete Multivariate Distributions. New York: John Wiley & Sons, Inc; 1997.

Barabasi AL, Albert R: Emergence of scaling in random networks.
Science 1999, 286(5439):509512. PubMed Abstract  Publisher Full Text

Kuznetsov VA: A stochastic model of evolution of conserved protein coding sequence in the archaeal, bacterial and eukaryotic proteomes.
Fluctuation and Noise Letters 2003, 3(3):295324. Publisher Full Text

Kuznetsov VA, Knott GD, Bonner RF: General statistics of stochastic process of gene expression in eukaryotic cells.
Genetics 2002, 161(3):13211332. PubMed Abstract  PubMed Central Full Text

Bunge J, Fitzpatrick M: Estimating the number of species: a review.
J Am Statist Assoc 1993, 88:364373. Publisher Full Text

Irwin JO: The place of mathematics in medical and biological statistics.
J Royal Statist Soc Ser A General 1963, 126:144. Publisher Full Text

Duerr HP, Dietz K: Stochastic models for aggregation processes.
Mathematical biosciences 2000, 165(2):135145. PubMed Abstract  Publisher Full Text