Abstract
Background
Chromatin immunoprecipitation followed by highthroughput sequencing (ChIPseq) can locate transcription factor binding sites on genomic scale. Although many models and programs are available to call peaks, none has dominated its competition in comparison studies.
Results
We propose a rigorous statistical model, the normalexponential twopeak (NEXTpeak) model, which parallels the physical processes generating the empirical data, and which can naturally incorporate mappability information. The model therefore estimates total strength of binding (even if some binding locations do not map uniquely into a reference genome, effectively censoring them); it also assigns an error to an estimated binding location. The comparison study with existing programs on real ChIPseq datasets (STAT1, NRSF, and ZNF143) demonstrates that the NEXTpeak model performs well both in calling peaks and locating them. The model also provides a goodnessoffit test, to screen out spurious peaks and to infer multiple binding events in a region.
Conclusions
The NEXTpeak program calls peaks on any test dataset about as accurately as any other, but provides unusual accuracy in the estimated location of the peaks it calls. NEXTpeak is based on rigorous statistics, so its model also provides a principled foundation for a more elaborate statistical analysis of ChIPseq data.
Keywords:
ChIPseq; Normalexponential distribution; Continuous mixture; Poisson regression; GoodnessoffitBackground
ChIPseq experiments use chromatin immunoprecipitation and then highthroughput sequencing, primarily to locate transcription factor binding sites across entire genomes, and to better our understanding of biological control systems [1]. As a brief overview of the relevant experimental protocols, they begin by irreversibly crosslinking a transcription factor (TF) molecule to its binding site in genomic DNA. They then shear the DNA into millions of short sequence fragments. Usually, the ends of a fragment are near the corresponding crosslink, but the exact distance between the end of the fragment and the crosslink is random. Moreover, the fragments on the two DNA strands show different systematic biases in the positions of their ends relative to the crosslink. Antibodies to the TF then precipitate each TF molecule along with its attached fragment. Fragments are dissociated from the TF molecules, amplified by polymerase chain reaction (PCR). The fragments are then sequenced into short subsequences, called “tags”. Computational analysis then enters by mapping the tag sequences to a reference genome. If a tag sequence is long enough, the tag matches only one genomic coordinate. Sometimes, however, its sequence is short and maps to more than one coordinate, making the mapping ambiguous. The possibilities of ambiguous mapping, false positive tagreads, and other experimental errors have motivated the development of programs to analyze ChIPseq experiments.
Peakcalling programs locate potential binding sites as “peaks” where mapped tags concentrate. Peakcalling programs use widely differing approaches, none of which has yet emerged as dominant in reviews [24], because relative accuracy of programs varies with the dataset examined [2,3]. Improvement is probably possible, however, because the models underlying existing programs do not consider mapping ambiguities directly, despite the existence of packages for enumerating the ambiguities, e.g., the PeakSeq suite [5]. Moreover, some programs ignore strandspecific information [6,7].
Most programs use (sometimes implicitly) kernel smoothing to compensate for mapping ambiguities, the most popular kernel being the uniform density, which is equivalent to counting tags in sliding windows of fixedlength [812]. Programs also manipulate information about a tag’s strand in various ways: as mentioned, some ignore it [6,7]; some use it explicitly (e.g. SISSRs [12], spp [11]); and some use it indirectly by transposing tag locations over to the other strand (e.g. MACS [8], QuEST [13], CisGenome [9]). Programs that combine strand information and windowing are essentially using two separate uniform densities as kernels for the forward and backward strands. For example, QuEST [13] (among other computer programs) estimates tag densities with a kernel smoother than the uniform density, to mimic the observed shape of tag peaks. QuEST did not dominate in comparisons, however, perhaps because it transposes tag locations, rather than estimating two separate tag densities, one for each strand.
The significance of peaks can be reported either as the number of tags in a window, a pvalue, a qvalue, or a posterior probability. Although pvalues guide a naïve user better than tag numbers, they introduce problematic assumptions. To derive a pvalue, some programs [7,13] assume a globally uniform background intensity of tags, an assumption known to fail in ChIP experiments. To assign the significance of peaks, different programs use various model assumptions such as Poisson [7,12,13], local Poisson [8], binomial [9], and hidden Markov [10] models. Some programs [8,9,1113] use control data to account for local variations in background tag intensity or to compensate for experimental artifacts like PCR overamplification, which can cause a spurious concentration of tags in a few specific locations. The reproducibility of control data is suspect, however, because it varies across cell types and ChIP protocol [4]. Although control data mitigates some experimental artifacts, its unreliability ultimately undermines any inference based on the corresponding pvalue.
Mapping ambiguities can also be problematic for a naïve pvalue calculation. Consider, e.g., a large genomic region where exactly A locations are ambiguously mapped (where A is fixed). In the same region, consider a window containing a total of L locations, including the A ambiguous locations. The window length is an arbitrary parameter (within limits), and as it increases, L increases. The A ambiguous locations are essentially censored data, so the simplest maximum likelihood estimate of the tag count in the window is L/(LA) times the observed tag count. Thus, if observed tag count is fixed, the estimated tag count decreases as the window length increases. If a pvalue decreases with the estimated tag count, it then depends on the window length. False discovery rates (FDR) depend on pvalues, so the use of FDRs does not remove the dependency. Under the circumstances described, therefore, the arbitrary choice of window length influences the number of peaks reported. A recent study on the number of binding sites in a genome [14] indicates that many real binding sites from ChIPseq data go unreported, suggesting that the assumptions and approximations underlying current pvalue estimates leave room for improvement.
Intuitively, the spatial resolution of a peak should also improve as more tags contribute to it. In principle, therefore, a program should also assign errors to its location estimates, but in fact, existing programs do not infer the accuracy of their estimated peak locations.
To examine the performance of the proposed model (NEXTpeak) against current standards, we selected several programs from a recent comparison [2]: HPeak [10], spp package (WTD and MTC) [11], CisGenome [9], MACS [8], QuEST [13], and SISSRs [12]. A summary of selected programs is given in Table 1. The details of NEXTpeak appears in the Methods.
Table 1. Summary of programs used for comparison
Results
Fitting the NEXTpeak model to ChIPseq data
Using ChIPseq datasets for three TFs: STAT1 [15], NRSF, and ZNF143 (see Methods for details), results with and without mappability information were examined; for each dataset, only the better of the two are presented here. Mappability information improved results for STAT1, but degraded results for NRSF and ZNF143.
Searches with positionspecific scoring matrices from JASPAR [16] yielded candidates for actual STAT1, NRSF, or ZNF143 sites within each region with a binding event. The searches used the pvalue cutoff 5×10^{6} for all datasets. See Methods for details on the pvalue computation for finding motif sites. Figure 1a shows a density of the normalexponential twopeak (NEXTpeak) model (see Methods). Figure 1bd displays the tag number, normalized to a probability density, for each location around the position of the candidate sites. The observed tag density is superimposed on the estimated density (derived from model estimates of the expected tag numbers, or in the Methods). Maximum likelihood estimation on the NEXTpeak model produced parameter estimates underlying and . Table 2 reports estimated parameter values for each dataset.
Figure 1. Profile of the normalexponential twopeak (NEXTpeak) density. (a) An example of NEXTpeak density profile without fitting to a particular dataset. The blue curve is for a tag profile on the left (positive) strand, the red curve is for the right (negative) strand. Parameter values are β = 60, and σ = 40 (see Methods). The two density curves mirror each other around the center location. (b) Tag profile of STAT1 ChIPseq data. From the motif search, thousands of motif sites were found. The cumulative tag counts were rescaled and displayed as densities. (c) NRSF. (d) ZNF143. Table 2 reports estimated values and for each dataset.
Table 2. Summary of ChIPseq datasets
For STAT1, the observed tag counts follow the density curve of the NEXTpeak model with a small difference in terms of average trend, except for two unexplained dips (Figure 1b). The dips (one for right tags, and one for left tags) display symmetry around the binding site. The tag counts for NRSF (Figure 1c) also follow the NEXTpeak density with a small trend difference. The tag counts for ZNF143 (Figure 1d) show a larger trend difference from the NEXTpeak density, perhaps because the ChIP experiment was noisier.
Examples of regions with unmappable locations
Figure 2 shows three regions with large number of unmappable locations from STAT1 data. In Figure 2, unmappable locations are marked by grey blocks. In Figure 2a, 49% of locations are unmappable; in Figure 2b, 41%; and in Figure 2c, 38%. The circles indicate motif sites; the triangles, estimated sites from the NEXTpeak model. The estimated sites approximate the motif sites reasonably well. The estimated tag counts due to the binding event are 636.8, 264.3, and 699.3; the total observed tag counts in the region are 396, 209, and 492. Although tags at unmappable locations are not observable, the NEXTpeak model increases the corresponding estimated tag counts to compensate. The compensation permits NEXTpeak to sharpen estimates of binding strength.
Figure 2. A plot of regions with large number of unmappable locations from STAT1 ChIPseq. Tag counts in the left strand are shown as blue bars, tag counts in the right strand, as red bars. The unmappable locations are marked by grey blocks. The circles represent motif sites; the triangles, estimated sites. (a) 49% locations are unmappable. (b) 41% unmappable. (c) 38% unmappable.
Comparison of the programs: top 2,000 peaks
To compare peakcalling programs, we run NEXTpeak along with other popular programs like HPeak [10], spp package (WTD and MTC) [11], CisGenome [9], MACS [8], QuEST [13], and SISSRs [12] (see Background for details). On running these programs including NEXTpeak, following standard practice, we used the default parameters to ensure reproducibility. As the first stage of comparison, we consider the 2,000 top peaks from each. The NEXTpeak program uses the estimated tags per binding event () to rank its peaks. We considered every peak called within 250 bp of a candidate site (as determined by positionspecific scoring matrix search) to be a true positive (TP). Our primary performance measure was the number of TPs within the 2,000 highest peaks. (Precision provides a standard but equivalent performance measure: the precision at 2,000 positives is the number of TPs within the 2,000 top peaks divided by the constant, 2,000.) Our secondary performance measures considered placement of TP peaks: (1) the mean distance between a TP peak center and the nearest motif site, and (2) the mean bias between a TP peak center and the nearest motif site. A TP peak upstream of the nearest motif site contributes to a negative bias; downstream, a positive bias. Thus, small distances and biases are desirable.
Table 3 contains summary statistics for various peakcalling programs. For STAT1, NEXTpeak found 781 TPs, a full 41 TPs more than any other program. For NRSF, NEXTpeak found 1,507 TPs, more than any other program; MTC was the second at 1,498 TPs. For ZNF143, NEXTpeak found 707 TPs, less than only MACS (709 TPs). For all three datasets, NEXTpeak had the smallest mean distances among all programs; it had one of the smallest biases as well. In addition, NEXTpeak is the only program that produced small biases for all three datasets. All other programs show a noticeable bias in at least one dataset. Specifically, HPeak and QuEST had a noticeable bias in STAT1; WTD, MTC, MACS and SISSRs had a noticeable bias in NRSF and ZNF143; CisGenome had a noticeable bias in STAT1 and ZNF143.
Table 3. Program result summary for ChIPseq datasets
Comparison of the programs: top peaks in general
The previous section gives performance measures based on the top 2,000 peaks called by each program. Researchers might wish to compare the measures based on lists of top peaks with different truncations, e.g., lists truncated at rank 1,000, 4,000, or 10,000. Figure 3 shows the precision (fraction of TPs among the top peaks) for top peaks truncated at ranks up to 10,000. For each rank r in the xaxis, the precision is computed for cumulative peaks between rank 1 and rank r. As expected, precision generally decreased with the length of the list. For STAT1, NEXTpeak had the largest precision (of all programs examined) over the full range of lengths, up to 10,000 peaks. For NRSF, NEXTpeak had nearly the best precision up to 4,500 peaks (and in fact, it had the best precision at 2,000 peak; see the previous section.); NEXTpeak had the best precision between 4,500 and 10,000 peaks. For ZNF143, NEXTpeak had near the best precision up and 10,000 peaks. For ZNF143, MACS performed similarly to NEXTpeak between 1,500 and 10,000 peaks, but MACS had a significantly poorer performance between 0 and 1,500 peaks compared to other programs.
Figure 3. A plot of percentage of top peaks with motif. (a) STAT1. (b) NRSF. (c) ZNF143. Some curves were truncated, because QuEST called fewer than 3,000 peaks; and WTD and MTC, fewer than 4,000 peaks. (Large percentages are desirable.)
Figure 4 shows mean distances for TP peaks ranked up to 10,000. In all three datasets, for the most of the range up to rank 10,000, NEXTpeak had the smallest mean distances. Note that other programs did not show the same consistency among three datasets in terms of mean distances. For example, MTC was the second best in STAT1 but performed poorly for NRSF; QuEST was the second best in ZNF143 but performed poorly for STAT1.
Figure 4. A plot of mean distance between top peaks and motif. (a) STAT1. (b) NRSF. (c) ZNF143. Mean distances are average distances between motif sites and estimated sties, where estimated sites contain a motif site within 250 bp distance. (Small distances are desirable.)
Figure 5 shows mean biases up to 10,000 peaks. As noted in the previous section, NEXTpeak is the only program showing small biases for all three datasets. Any other program shows a noticeable bias in at least one dataset. That is, for ZNF143, only NEXTpeak, QuEST, and HPeak had small biases, but QuEST and HPeak had noticeable biases in STAT1, making their performances highly dependent on the dataset at hand.
Figure 5. A plot of mean bias between top peaks and motif. (a) STAT1. (b) NRSF. (c) ZNF143.The bias is the (signed) distance in bp between an estimated site and the nearest motif site. (Small biases are desirable.) For all datasets, NEXTpeak biases were near zero. NEXTpeak was the only program with nearzero bias for all three datasets.
Correlation of estimated standard deviation and distance to motif site
Unlike other programs, NEXTpeak indicates the accuracy of a peak’s estimated location by estimating the corresponding standard deviation. Figure 6 displays smooth scatterplots for the estimated standard deviation versus distance to the nearest candidate site. (The plot is truncated on both axes at 250 bp, to dampen the influence of outliers caused by the omission of a true site among the candidate sites.) Dark regions represent high densities of data points and small dots represent isolated points. Ideally, all points should fall near the line y= x. Most data points, however, are concentrated at the bottomleft corner for all three plots. That is, most standard deviation estimates are small and actual distances to the motif site tend to be small as well. The Pearson correlation coefficients corresponding to the smooth scatterplots are 0.43 (STAT1), 0.50 (NRSF), and 0.33 (ZNF143), indicating that although the relationship is rather weak, the estimated error is positively correlated with the actual distance between the estimated peak location and the motif site.
Figure 6. Smooth scatterplots for estimated standard deviation vs. actual distance to nearest motif site. (a) STAT1. (b) NRSF. (c) ZNF143. Both axes truncated values at 250 bp. For all datasets, the estimated standard deviation and the actual distance to the nearest motif site had a positive Pearson correlation coefficient. Dark regions represent high densities of data points and small dots represent isolated points. The majority of data points are located at the bottomleft corner for all three panels, hence most standard deviation estimates are small and actual distances to the motif site are also small in general.
Discussion
Alone among existing peakcalling programs, NEXTpeak analyzes data with a parametric statistical model. Of the existing programs, therefore, it alone provides a principled foundation for elaborating the statistical analyses of ChIPseq data. One obvious elaboration is to model multiple binding events in a region. This work is currently underway and the results will be reported elsewhere.
NEXTpeak can estimate the average fragment length, even if the experiment does not measure the average fragment length. Let d denote the tag length, e.g., d=27 for STAT1. In the NEXTpeak model, the average distance from a fragment end to a crosslink is β. The average distance between the fragment ends is therefore 2β + d − 1 (because the “location” of the right end is the leftmost position of the corresponding tag). For the STAT1 dataset, =73.5 and d =27, so the NEXTpeak model estimate of the average fragment length is 173.0, consistent with a previous estimate of 174 [15]. For NRSF data, =30.4 and d=36, the estimate of the fragment length is 95.8, and for GABP data, =35.3 and d=36, the estimate is 105.6.
Existing programs simply discard ambiguously mapped reads. In contrast, NEXTpeak explicitly models the locations where reads do not map uniquely into the reference genome. NEXTpeak can therefore adjust for ambiguous mapping while estimating the total number of tags in a region, thereby sharpening its estimates of TF binding strength. Sharper estimates of binding strength can promote better physical interpretation of ChIPseq results.
Existing peakcalling programs require tedious visual screening of up to tens of thousands of binding regions, to eliminate experimental artifacts like spikes in tag numbers caused by PCR anomalies. The goodnessoffit tests in NEXTpeak can reduce the burden of visual screening. Moreover, the same tests can detect the presence of multiple TF binding sites, which are usually found in regions longer than those containing PCR anomalies. The long regions with small pvalues can therefore be set aside for further, more intensive analyses, such as searching for multiple binding events or sequence motifs.
For STAT1, the observed tag distribution follows the NEXTpeak density closely, indicating that the NEXTpeak model captured the essence of the physical processes in the ChIPseq experiment. Consequently, NEXTpeak outperformed its competitors, possibly because the NEXTpeak model successfully mimicked the true experimental kernel. On the other hand, for ZNF143, the observed tag distribution is somewhat deviated from the NEXTpeak density, possibly degrading NEXTpeak’s performance slightly. The observed tag density might reflect a mixture of multiple binding events, however, resulting from TF binding fluctuating between different protein complexes. Mass and structural differences between the protein complexes could cause binding locations or mean fragment lengths to fluctuate. Conventional motif analysis or a more elaborate model including multiple binding sites might expose the proteinprotein interactions, however.
By adding mappability information, STAT1 increased truepositive binding sites by 4.0% on average. Unlike STAT1, mappability information for NRSF and ZNF143 actually degraded the performance of NEXTpeak: on average, it decreased truepositives by 0.5% and 0.6%, a surprising result given that both NRSF and ZNF143 had large numbers of mapped tags (33.1 millions and 25.2 millions). The truncated read length for NRSF and ZNF143 was 36, however, much larger than read length of 27 for STAT1. Thus, fewer genomic locations were mapped ambiguously for NRSF or ZNF143 (10.3%) than for STAT1 (16.2%), diminishing NEXTpeak’s ability to enhance its performance by adding mappability information.
This article examined three ChIPseq datasets with a single dominant binding motif, permitting motif sites to serve as surrogates for the true binding sites. In general, however, even with an antibody specific to a protein, proteinprotein interactions between TF molecules might cause multiple TFs (and hence, multiple motifs) to crosslink to an antibody. The two global parameters σ (the standard deviation for the crosslink locations) and β (the intensity of the Poisson process modeling shearing) then require delicate estimation. One could select a few hundred of the most tagrich regions. One could screen the regions visually, choosing the ones with a good fit to the dual normalexponential density and then estimate σ and β. Alternatively, one could perform a motif search on the tagrich regions. The observed tag density for each motif then can be fit to the NEXTpeak model. Thus, NEXTpeak can analyze any ChIPseq experiment, even without specific information on the protein interactions.
Conclusions
We proposed a new statistical model for identifying binding sites from ChIPseq data. The model successfully mimics the underlying datagenerating process in ChIPseq experiments by using the dual density of a normalexponential twopeak model. The NEXTpeak program produced better prediction with more true positives and a better spatial resolution than any other program tested. The NEXTpeak program tests the validity of its underlying NEXTpeak model without depending (as many programs do) on an unrealistic assumption of a global uniform background tag distribution. The NEXTpeak program stands alone in quantifying errors by reporting a standard error for its estimates of binding intensity. Moreover, smooth scatterplots showed that its standard errors are informative about errors in motif location, as estimated from external standards. The NEXTpeak program also provides a goodnessoffit test, automating screening of the spurious binding, and work is in progress to extend its model to locate multiple binding events in a region.
Methods
ChIPseq datasets
Our analysis used ChIPseq datasets corresponding to three TFs: STAT1, NRSF, and ZNF143. Because the three datasets correspond to known binding motifs, they provide a goldstandard for evaluating peakcalling programs [2]. Table 2 presents summary statistics for the three datasets.
The STAT1 dataset [15] was downloaded at http://www.bcgsc.ca/downloads/chiptf/human/STAT1/stimulated/july_23_2008/ webcite. The NRSF[SRA:SRR577995] and ZNF143[SRA:SRR243553] datasets were downloaded from the SRA database at http://www.ncbi.nlm.nih.gov/sra webcite. Bowtie [17] mapped tags into a reference human genome (NCBI Build 36.1) for all three datasets. Mismatches of up to 2 bases were permitted, if they mapped uniquely within the genome.
For all datasets, the PeakSeq suite [5] then determined whether tag sequences in the reference genome were unique. PeakSeq requires tags of a uniform length. The tags for the downloaded STAT1 dataset, however, had varying length although most tags had length 27. We truncated the tags to length 27, if they were longer, or we discarded them, if they were shorter. Thus, we used the mappability information for 27 bp tags to approximate the complete STAT1 data. The downloaded NRSF had tags of length 50 and the downloaded ZNF143 had tags of length 40. We truncated tags from both datasets to length 36 to make it easy to investigate the effect of the mappability information (36 bp mappability information was used). Additional file 1 also reports on results from three additional datasets, MAX, GABP, and FoxA1.
Additional file 1. Supplementary material. Supplementary material contains results for additional datasets, MAX, GABP, and FoxA1. This file contains four supplementary figures and two supplementary tables: Figure S1. A plot of percentage of top peaks with motif. Table S1 reports estimated values and for each dataset. Figure S2. A plot of percentage of top peaks with motif. Some curves were truncated in (a), because QuEST called fewer than 5,000 peaks; MTC, fewer than 7,000 peaks; and WTD, fewer than 9,000 peaks. In (b), QuEST, WTD, and MTC called fewer than 4,000 peaks. Figure S3. A plot for mean distance between top peaks and motif. Mean distances are average distances between motif sites and estimated sites, where estimated sites contain a motif site within 250 bp distance. Figure S4. A plot of mean bias between top peaks and motif. The bias is the (signed) distance in bp between an estimated site and the nearest motif site. Table S1. Summary of additional ChIPseq datasets. Basic characteristics of additional datasets. Table S2. Program result summary for additional ChIPseq datasets.
Format: PDF Size: 134KB Download file
This file can be viewed with: Adobe Acrobat Reader
Notations for the ChIPseq data
Let some preliminary method (see NEXTpeak algorithm for detail) flag possible crosslinks in candidate genomic regions R_{r} (r = 1, …, S). Computational time is a consideration, because S can be on order of 10^{4} or more. Consider a specific genomic region R_{s}, where s ∈ {1, …, S}, and let R_{s} have width w_{s}, with the nucleotide bases having coordinates 1, …, w_{s}. Call the forward and backward DNA strands “left” and “right”, so the bases at each location j on the left and right strands are complementary (j = 1, …, w_{s}). Let denote the set of locations within R_{s} where a tag sequence is not unique within the genome, so the corresponding tag maps ambiguously.
The superscripts “L” and “R” distinguish quantities pertaining to the left and right strands: note in particular that “L” and “R” are not exponents. Within R_{s}, let a total of n^{L} “left tags” be observed on the left strand; n^{R} “right tags”, on the right strand. Define the “location” of a tag as its leftmost position. Let the location of the left tags map to (i = 1, …, n^{L}); the location of the right tags, to (i = 1, …, n^{R}). Thus, the data are where no location or is in .
An alternative representation is occasionally useful. Let be the number of left tags observed at location j ∈ {1, …, w_{s}}; , the number of right tags observed at position j. Tags cannot be observed at locations . Thus, provides an equivalent representation of the data X. The model parameters for the data in R_{s} are (μ_{s}, ν_{s}, ρ_{s}) and (σ^{2}, β), defined below. Parameters specific to the region R_{s} are subscripted with “s”; the parameters common to all genomic regions lack subscripts.
Dual density for a binding event
Let the standard normal distribution have the density function ϕ(•) and the cumulative distribution function Φ(•). Each observed right tag location in R_{s} corresponds to an underlying (and unobservable) random variate ξ_{i}, the coordinate of the corresponding crosslink. For mathematical convenience, assume ξ_{i} ~ N(μ_{s}, σ^{2}), i.e., ξ_{i} is a normal variate with mean μ_{s} (specific to R_{s}) and variance σ^{2} (common to R_{r} for r = 1, …, S). The density of the distribution of ξ_{i} is
Assume that upon shearing, the breaks in the DNA form a Poisson process over the entire genome. Thus, the break point at location () corresponds to an exponential random variate . For simplicity, assume that the mean β of the exponential distribution is common to all regions R_{r} (r = 1, …, S), so the density function corresponding to is
The joint density ofis therefore. Integrate ξ_{i} out, to derive the density function of :
The above distribution is a marginal distribution of the normalexponential joint density, which we call a “normalexponential” distribution in short. Figure 7 shows a schematic representation for the role of parameters σ and β in a ChIPseq experiment. Figure 1a shows a normalexponential density for both left and right tags (as discussed in Results).
Figure 7. ChIPseq experiment with NEXTpeak model parameters. A genomic location of the center of the TF is denoted as μ. Green bidirectional arrows represent crosslinks between a TF and a genomic sequence. Crosslinks are assumed to be normally distributed with standard deviation σ. Tags are shown as small black rectangles at the 5′ end of fragments. The distance from a crosslink to a tag location is assumed to be exponentially distributed with mean β. When tags are mapped to a reference genome, then tags are projected onto the corresponding genomic locations. Blue arrows represent “left” tags mapped on the forward strand; red arrows, “right” tags mapped on the backward strand. The tag distribution is the NEXTpeak under the previous assumptions.
Poisson regression model for the observed tags
Within R_{s}: (1) let ν_{s} be the expected number of right tags for each TF molecule that binds; (2) let ρ_{s} be the uniform background intensity of right tags; and (3) let be the expected number of right tags at location j. Assume
Approximate the sum over all locations j with an integration, to produce a consistency check: ν_{s} = ∫ ν_{s} · f^{R}(xθ)dx. The expected total number of right tags within R_{s} due to binding is therefore approximately ν_{s}, ν_{s} = 0 being equivalent to the absence of TF binding in R_{s}. On the other hand, the expected total number of right tags within R_{s} due to noise is w_{s}ρ_{s}.
Let be the random variable that counts right tags at j, and assume Y_{j}^{R} has a Poisson distribution with mean , i.e.,
for y ϵ {0, 1, 2, …}.
The models for the tag locations and on the left and right strands share the parameters (μ_{s}, σ^{2}, β, ν_{s}, ρ_{s}) and differ only in mirroring the tag location densities around the link location μ_{s}, i.e.,
The models for the left and right tags share ν_{s} and ρ_{s}, so as in the model for the right tags, ν_{s} is the expected number of left tags for each TF molecule that binds, and ρ_{s} is the uniform background intensity of left tags.
Figure 1a shows a normalexponential twopeak (NEXTpeak) density. In this example, μ = 0, β = 60, and σ = 40. The two density curves mirror each other around the center location μ = 0. The curves are asymmetrical distributions skewed in opposite directions. Although both tails of each density rapidly approach zero, one tail approaches zero much faster than the other. Figure 1a motivates the model name: the “normalexponential twopeak” (NEXTpeak) model.
Let θ_{s} = (μ_{s}, ν_{s}, ρ_{s}). Under the NEXTpeak model in Figure 1a, the likelihood of the data in R_{s} is
Thus, the likelihood of the complete dataset is
where θ = (θ_{r} : r = 1, …, S).
Because σ and β do not depend on the region R_{s}, training data can yield maximum likelihood estimates (MLEs) and . Fix the values and , so the remaining parameters requiring estimation for the NEXTpeak model are θ = (θ_{1}, …, θ_{S}). Maximization of the profile likelihood then yields the estimate within each region R_{s}. The estimate’s components are: (1) the estimated mean location of a binding event, (2) the estimated mean total number of right (or left) tags within R_{s} due to the binding event, and (3) the estimated uniform background intensity of right (or left) tags within R_{s}. As usual, the inverse of the Fisher Information matrix (i.e., the inverse of the negative expectation of the Hessian of the loglikelihood) estimates the asymptotic variancecovariance matrix for θ_{s}.
Let denote the maximum likelihood estimate (MLE) of θ_{s} = (μ_{s}, ν_{s}, ρ_{s}) under the restriction ν_{s} = 0. To test whether or not a binding event occurred in R_{s}, under the null hypothesis H_{0}:ν_{s} = 0, the likelihood ratio (LR) statistic
has an asymptotic chisquare distribution with 1 degree of freedom. Unlike the null hypotheses in many existing programs, the NEXTpeak model does not assume a globally uniform background intensity. Instead, its locally uniform background intensity is equivalent to assuming that the expected number of background tags per location varies slowly enough to be almost constant within each region R_{r} (r = 1, …, S).
Goodnessoffit test
The following can test whether observed tag data are consistent with the NEXTpeak model. PCR overamplification and other experimental artifacts can cause spikes in the observed tag distribution. Likewise, multiple binding events in a region R_{r} cause deviations from the unimodal density of the model. Accordingly, consider Pr (DH_{0}), the probability of the data D under the null hypothesis H_{0} of a NEXTpeak model. Under H_{0}, the counts of right tags at location j are generated with the Poisson intensity given above, the counts of the left tags being generated with the mirror intensity . Consider also the probability Pr (DH_{1}) of the data under an alternative model H_{1} where the underlying intensity is unrestricted. The LR statistic 2 log[Pr(DH_{1})/Pr(DH_{0})] for the models follows a χ^{2} distribution, with degrees of freedom equal to the difference of the number of parameters in H_{0} and H_{1}. The LR test yields a pvalue for each region R_{s}, with a small pvalue indicating a poor fit within R_{s} to the NEXTpeak model underlying H_{0}.
Pvalue computation for finding motif sites
From a positionspecific score matrix, any segment receives a score by adding the corresponding columns scores from the positionspecific score matrix. The probability of observing the score or higher is computed based on the null model that nucleotides (A, C, G, and T) can appear at random with equal probabilities. We used the Staden’s method [18] for the convolution computation of the score distribution, The principal of our pvalue computation has been used for PSIBLAST [19] and AGLAM [20], among others, for their pvalue computation, where pvalues are used to compute Evalues.
Screening based on goodnessoffit tests
Like many peakcalling programs, NEXTpeak masks locations having anomalously many tags, but it does so in a principled manner, with pvalues based on goodnessoffit tests. Long regions with small pvalues suggest multiple binding sites, so for all datasets NEXTpeak masked regions less than a certain length long and with small pvalues. When motif site locations are available, based on the area under the precision curve up to the top 10,000 peaks (e.g. Figure 3), the NEXTpeak program automatically reports a cutoff recommendation for each dataset. The Results section uses the length 400 and the pvalue 10^{8} as a cutoff for STAT1 and the length 300 and the pvalue 10^{4} as a cutoff for NRSF and the length 400 and the pvalue 10^{6} as a cutoff for ZNF143, all suggested by the NEXTpeak output.
NEXTpeak algorithm
The NEXTpeak program goes through the following procedures for producing an output (Figure 8 shows a flowchart for the NEXTpeak program). (1) Read the mapped tag location file, e.g., from a Bowtie [17] output. (2) Select regions based on the tag count with a user specified length of the window (default: 150) and a user specified minimum count (default: 15). When a neighboring window has more than the minimum count, the window under scrutiny is combined with its neighbor. The region lengths range from the minimum length (default: 150) to several thousands. (3) When motif site locations are available, estimate σ and β by maximizing the likelihood using motif site locations. For a known TF, a publicly available motif pattern is used, e.g. from JASPAR [16]. For an unknown motif, run the NEXTpeak program with default values (σ = 30 and β = 50), and identify the strongest motif from a motif search. Alternatively, a user can estimate these parameters with selected regions. (4) For each region, estimate μ and ν by maximizing the likelihood. It computes the standard deviation estimates for these estimates. Then, perform a goodnessoffit test for each region. (5) As a postprocessing step, compute the region length and pvalue cutoff recommendations to screen out potential spurious regions when the motif site locations are available.
Figure 8. A flowchart for NEXTpeak algorithm. First, the program reads the mapped tags. Then, it selects regions based on the tag count with a user specified length of the window (default: 150) and a user specified minimum count (default: 15). If motif site locations are provided, it estimates σ and β using motif site locations. For an unknown motif, run the NEXTpeak program with default values (σ = 30 and β = 50), and identify the strongest motif from a motif search. For each region, the program estimates μ and ν. It computes the standard deviation estimates for these estimates. As a postprocessing step, the program computes the region length and pvalue cutoff recommendations to screen out potential spurious regions when the motif site information is available.
NEXTpeak software
The NEXTpeak program is implemented in C++. The code is publicly available at http://www.odu.edu/~nxkim/nextpeak/ webcite. A typical computation time is about 15 ~ 45 minutes, depending on the size of the input data.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
NKK conceived the study, developed the model, developed the program, performed a comparison study, and drafted the manuscript. RVJ helped parameter estimation and performed a comparison study. JLS conceived the study, developed the model and drafted the manuscript. All authors read and approved the final manuscript.
Acknowledgements
This work was supported by the Summer Research Fellowship Program of Old Dominion University Research Foundation and the Intramural Research Program of the National Library of Medicine at the National Institutes of Health.
References

Johnson DS, Mortazavi A, Myers RM, Wold B: Genomewide mapping of in vivo proteinDNA interactions.
Science 2007, 316(5830):14971502. PubMed Abstract  Publisher Full Text

Wilbanks EG, Facciotti MT: Evaluation of algorithm performance in ChIPseq peak detection.
PLoS One 2010, 5(7):e11471. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Laajala TD, Raghav S, Tuomela S, Lahesmaa R, Aittokallio T, Elo LL: A practical comparison of methods for detecting transcription factor binding sites in ChIPseq experiments.
BMC Genomics 2009, 10:618. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Pepke S, Wold B, Mortazavi A: Computation for ChIPseq and RNAseq studies.
Nat Methods 2009, 6(11 Suppl):S22S32. PubMed Abstract  Publisher Full Text

Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB: PeakSeq enables systematic scoring of ChIPseq experiments relative to controls.
Nat Biotechnol 2009, 27(1):6675. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Fejes AP, Robertson G, Bilenky M, Varhol R, Bainbridge M, Jones SJ: FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel shortread sequencing technology.
Bioinformatics 2008, 24(15):17291730. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNASeq.
Nat Methods 2008, 5(7):621628. PubMed Abstract  Publisher Full Text

Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nussbaum C, Myers RM, Brown M, Li W: Modelbased analysis of ChIPSeq (MACS).
Genome Biol 2008, 9(9):R137. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH: An integrated software system for analyzing ChIPchip and ChIPseq data.
Nat Biotechnol 2008, 26(11):12931300. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Qin ZS, Yu J, Shen J, Maher CA, Hu M, KalyanaSundaram S, Chinnaiyan AM: HPeak: an HMMbased algorithm for defining readenriched regions in ChIPSeq data.
BMC Bioinforma 2010, 11:369. BioMed Central Full Text

Kharchenko PV, Tolstorukov MY, Park PJ: Design and analysis of ChIPseq experiments for DNAbinding proteins.
Nat Biotechnol 2008, 26(12):13511359. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Jothi R, Cuddapah S, Barski A, Cui K, Zhao K: Genomewide identification of in vivo proteinDNA binding sites from ChIPSeq data.
Nucleic Acids Res 2008, 36(16):52215231. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers RM, Sidow A: Genomewide analysis of transcription factor binding sites based on ChIPSeq data.
Nat Methods 2008, 5(9):829834. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kuznetsov VA, Singh O, Jenjaroenpun P: Statistics of proteinDNA binding and the total number of binding sites for a transcription factor in the mammalian genome.
BMC Genomics 2010, 11(Suppl 1):S12. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A: Genomewide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing.
Nat Methods 2007, 4(8):651657. PubMed Abstract  Publisher Full Text

Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B: JASPAR: an openaccess database for eukaryotic transcription factor binding profiles.
Nucleic Acids Res 2004, 32(Database issue):D91D94. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memoryefficient alignment of short DNA sequences to the human genome.
Genome Biol 2009, 10(3):R25. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Staden R: Methods for calculating the probabilities of finding patterns in sequences.
Comput Appl Biosci 1989, 5(2):8996. PubMed Abstract

Altschul SF, Koonin EV: Iterated profile searches with PSIBLAST–a tool for discovery in protein databases.
Trends Biochem Sci 1998, 23(11):444447. PubMed Abstract  Publisher Full Text

Tharakaraman K, MarinoRamirez L, Sheetlin S, Landsman D, Spouge JL: Alignments anchored on genomic landmarks can aid in the identification of regulatory elements.
Bioinformatics 2005, 21:I440I448. PubMed Abstract  Publisher Full Text  PubMed Central Full Text