Abstract
Background
The nucleosome is the fundamental packing unit of DNAs in eukaryotic cells. Its detailed positioning on the genome is closely related to chromosome functions. Increasing evidence has shown that genomic DNA sequence itself is highly predictive of nucleosome positioning genomewide. Therefore a fast software tool for predicting nucleosome positioning can help understanding how a genome's nucleosome organization may facilitate genome function.
Results
We present a duration Hidden Markov model for nucleosome positioning prediction by explicitly modeling the linker DNA length. The nucleosome and linker models trained from yeast data are rescaled when making predictions for other species to adjust for differences in base composition. A software tool named NuPoP is developed in three formats for free download.
Conclusions
Simulation studies show that modeling the linker length distribution and utilizing a base composition rescaling method both improve the prediction of nucleosome positioning regarding sensitivity and false discovery rate. NuPoP provides a userfriendly software tool for predicting the nucleosome occupancy and the most probable nucleosome positioning map for genomic sequences of any size. When compared with two existing methods, NuPoP shows improved performance in sensitivity.
Background
Most eukaryotic genomic DNA is wrapped in nucleosomes, which occlude and strongly distort the wrapped DNA. Accumulating evidence shows that the DNA sequence itself is highly predictive of nucleosome positioning in vivo [17], and that nucleosome positioning is closely related to chromosome functions [1,811]. A fast software tool for predicting nucleosome positioning is highly desirable.
Several statistical methods for nucleosome positioning prediction have been proposed in the literature. In [2] a method was proposed based on crosscorrelation with a nucleosome DNA sequence signature. In [1] a Markov model was used together with consideration of steric exclusion and thermodynamic equilibrium. In [3], a support vector machine (SVM) was trained based on the differential kmer usage between nucleosome and linker DNAs. In [4], the authors proposed an Nscore model to discriminate nucleosome and linker DNAs using wavelet energies as covariates in a logistic regression model. In [5], a webinterface called "nuScore" for estimation of the affinity of histone core to DNA and prediction of nucleosome positioning was developed based on the DNA deformation energy score. In [6,7], the model from [1] was improved by incorporation of differential kmer usage (most notably, poly(dA:dT) tracts, which are strongly disfavored by nucleosomes). This model can be further improved by accounting for nucleosomenucleosome interaction [12].
While the nucleosomal features are universal, eukaryotic genomes vary in nucleosomal repeat length [13] and base composition. The nucleosomal repeat length is dictated by the length distribution of linker DNAs that separate neighboring nucleosomes, and it determines the overall nucleosome density in the chromatin fiber. The contribution of this paper is a duration Hidden Markov Model and a software tool called NuPoP for genomewide nucleosome positioning prediction. We show that incorporation of linker length information can achieve better sensitivity in prediction. In addition, we propose a rescaling method to adjust for base composition variation when using yeast models to make predictions for other species. A relatively superior performance of this approach is established by comparing it with other existing tools.
Methods
Model
The hidden Markov model (HMM) has been known for decades. An excellent and famous tutorial is Rabiner's 1989 paper [14], in which the model, algorithms, and applications were carefully and thoroughly reviewed. A conventional HMM implicitly assumes a geometric duration distribution for each state, which can be wrong in real applications. Modeling the explicit duration of each state can improve the prediction of HMMs (e.g., [1417]). We model each chromosomal DNA sequence with a duration hidden Markov model (dHMM) of two oscillating states: nucleosome (N) and linker (L), where the nucleosome state has a fixed length of 147 bp, and the linker state has a variable length. We assume that at the end of each state, the chain must transit to the other state; additionally, a complete chromatin sequence must start with and end in a linker state. We trained a 4th order timedependent Markov chain for the the N state, and a homogeneous 4th order Markov chain for the L state to distinguish the kmer usage preferences for k up to 5 between the nucleosome and linker states as shown in other methods, e.g., [3,6] (see below for details).
Let e = e_{1}, ..., e_{147 }be a nucleosome DNA sequence. Let P_{N }be the probability of observing e as a nucleosome, computed as the product of probabilities for both Watson and Crick strands under the 4th order Markov Chain model. We assume that the linker DNA length of a given species has an unknown distribution F_{L}(k) defined for k = 1, ..., τ_{L }(the maximum linker length we allow). An observed linker DNA sequence e = e_{1}, ..., e_{k }carries two pieces of information, the length is k bp, and given which, the emitted letters are e_{1}, ..., e_{k}. Let G_{L}(ek) denote the homogeneous Markov chain model for the linker DNA (again including both strands). Then observing e as a linker DNA has probability
Suppose x = x_{1}, ..., x_{n }is a genomic DNA sequence of length n, where x_{i }= A/C/G/T. Let z = z_{1}, ..., z_{n }be the corresponding hidden state path, where z_{i }= 1 if x_{i }is covered by a nucleosome state, and 0 otherwise. Suppose that the path z = z_{1}, ..., z_{n }partitions x into k consecutive nucleosome or linker state blocks, in which the nucleosome blocks have a uniform length of 147 bp, whereas the length of linker blocks may vary. We denote these blocks as y = y+, ..., y_{B}, and their state identification as s = s_{1}, ..., s_{B}, where s_{i }= 1 if y_{i }is nucleosome state, and 0 otherwise. The probability of observing (x, z) is given by
where π_{0}(s_{1}) and πe(s_{B}) stand for the probabilities that the chain initializes and ends with state s_{1 }and s_{k }respectively, and I is an indicator function. Since we assume that a complete chromatin sequence must start with and end in a linker state, π_{0}(s_{1 }= 0) = π_{e}(s_{B }= 0) = 1. We define the nucleosome occupancy at a specific position i, denoted o_{i}, as the posterior probability that z_{i }= 1, i.e.,
We also define the histone binding affinity score at position i as the log likelihood ratio for the region x_{i}73, ... x_{i}, ..., x_{i}+73 to be a nucleosome vs. a linker, i.e.,
Given the models P_{N}, G_{L }and F_{L}, the optimal path z can be found by the standard Viterbi algorithm, and the nucleosome occupancy score can be estimated using forward and backward algorithms.
Data and model training
We utilized the 503,264 yeast nucleosome DNA reads from 454 pyrosequencing published in [6] for model training and assessment. Among 371,914 reads that each were mapped to a unique region of the yeast genome, we first selected reads of length between 146 and 149 bp. If multiple such reads existed for the same nucleosome, we selected the one with the highest BLAST score. The resulting nonredundant set of 18,547 nucleosome sequences were center aligned to train the nucleosome model P_{N}. The 4th order time dependent Markov chain can be defined by the base composition at the first position qN(x_{1}), and the transitional probabilities qN(x_{2}x_{1}), qN(x_{3}x_{1}, x_{2}), qN(x_{4}x_{1}, x_{2}, x_{3}), qN(x_{k}x_{k4}, x_{k}3, x_{k2}, x_{k1}), for k = 5, ..., 147, x_{i }= A/C/G/T, i = 1, ..., 147, where the subscript k, i index the positions within a nucleosome. These probabilities are trained using the corresponding observed fractions or conditional fractions based on the center alignment, with a three bp moving average (as explained in [1,18]). We further identified 8,090 readsfree regions of length 7500 bp to train the linker state model G_{L}. The 4th order homogeneous Markov model for the linker DNAs can be completely defined by the stationary base composition qL(xi), and the transition probabilities qL(x_{i}x_{i1}), qL(x_{i}x_{i1}, x_{i2}), qL(x_{i}x_{i1}, x_{i2}, x_{i3}), qL(x_{i}x_{i1}, x_{i2}, x_{i3}, x_{i4}). By "homogeneous", we mean that these probabilities are all constants as functions of i. These probabilities were trained using their observed values as in the nucleosome model. For example, qL(x_{i}x_{i1}, x_{i2}, x_{i3}, x_{i4}) was trained by calculating the fraction of occurrences of transitions from any four letters to the fifth letter in the selected putative linker DNA sequences.
Our initial nucleosome/linker model was trained using the yeast data. A complication arises when predicting nucleosomes for other species because organisms may differ significantly in their DNA base composition. We propose to scale up or down the probabilities in the Markov models by a factor determined by the difference of the base composition between the current species and yeast. For example, in C. elegans, the fraction of A plus T bases is 0.645 compared to 0.617 in yeast. For a specific transition probability qN(A....) at any specific nucleosomal position defined for yeast, we scaled it up as qN(A....) × 0.645/0.617 for the corresponding transition probability at that given position for C. elegans. Likewise the transition probabilities for G/C will be scaled down by a factor of 0.355/0.383.
All the rescaled probabilities are then normalized. The same rescaling applies to the linker model. We shall use simulations below to show that rescaling improves prediction regarding sensitivity and false discovery rate. Using the trained nucleosome model (P_{N}) and linker model (G_{L}), we further train the linker DNA length distribution as follows. We assume that the linker DNAs in any given species or cell type have a maximum length τ_{L }= 500 bp.
This algorithm contains the following steps:
1. Initialize the algorithm with a uniform distribution for F_{L}(k) for k = 1, ... τ_{L }where τ_{L }is the maximum allowable linker length.
2. Use the forward and backward algorithm to obtain the posterior expectation of F_{L}(k) for each k. For a sequence x = x_{1}, ..., x_{n}, let n_{k }be the number of linker DNAs of length k. Then
for k = 1, ..., τ_{L }.
for k = 1, ..., τ_{L }.
3. Update the empirical linker length distribution from step 2 using a kernel smoothing method as follows:
where K is the standard Gaussian kernel and h is the bandwidth parameter optimally chosen by the leaveone crossvalidation method as in [19].
4. Use the updated linker length distribution from step 3 to compute the nucleosome occupancy and optimal positioning.
Compared to Viterbi training (i.e., using linker length predicted from the Viterbi algorithm), using the posterior expectation obtained in Eq. (1) combined with the kernel method in Eq. (2) performs overwhelmingly better in minimizing the summed square errors (unpublished work [17]). In the developed software tool NuPoP, we have trained the linker DNA length distributions for 11 different species including human, mouse, rat, zebrafish, D. melanogaster, C. elegans, S. cerevisiae, C. albicans, S. pombe; A. thaliana and maize. The linker DNA length distribution (F_{L}) for each species has been trained by scanning the corresponding genome sequences based on τ _{L }= 500. We found that the rescaled nucleosome and linker profiles, together with the trained linker length distribution, not only roughly recover the genomewide base compositions, but also the dinucleotide frequencies for different species. The frequency of each single or dinucleotide in simulated genomes typically differs by ≤ 1% from that observed in the corresponding real genomes (results not shown). As different cell types from the same organism (with the same genome) can exhibit quite different linker DNA length distributions [13], a useful future refinement would utilize high quality nucleosome maps for the given cell type, when such data become available.
Software tools
We have developed a software tool called NuPoP, implemented in three different formats: an R package tested for Windows XP, Linux and Mac OS X; a standalone Fortran program; and an NuPoP web server, all available from http://nucleosome.stats.northwestern.edu webcite. The R package is built upon the Fortran program. It provides additional handy functions to visualize the resulting Viterbi (most probable nucleosome position map) and nucleosome occupancy predictions. Both the R package and Fortran program can handle a genomic sequence of any length with a RAM demand <400 M bytes. The predicted results are stored locally in the working directory. The web server provides an interface through which the user can submit their own sequence up to 500 K bp in length for fast online prediction. When making a prediction, the user is required to specify which species the genomic sequence is from. If the species is not on the list, NuPoP will calculate the base composition of the input DNA sequence and then choose the nucleosome and linker models from a species that has the most similar base composition. An alternative model with a 1st order timedependent Markov chain for the nucleosome state and a homogeneous 1st order Markov model for the linker state, trained in the same way, is also implemented in NuPoP as an option.
Results
Updating F_{L }improves prediction
Updating the linker length distribution not only helps recover the true nucleosome density, but also improves prediction. We demonstrated this by simulation as follows. We simulated 10 genomic sequences with the 4th and 1st order yeast models respectively, each containing 10,000 nucleosomes and 10,001 linkers. The linker DNA length was simulated from a Normal distribution (μ = 100, σ = 20) and a Gamma distribution (α = 1, β = 1/40). If a nucleosome is predicted within ±35 bp of a true nucleosome, we define it as a correct positive prediction. The rate of the correct positive prediction, referred to as sensitivity, is defined as the percentage of the 10,000 nucleosomes that are correctly predicted. In addition, we include the false discovery rate (FDR), defined as the fraction of the predicted nucleosomes that reside > ±35 bp away from any true nucleosomes, as the second measure for model performance. In analogy to statistical hypothesis testing, the sensitivity measures the power of prediction, while the FDR measures the fraction of type I errors in the positive claims. The results are presented in Table 1 and 2. In both cases, the linker length was initialized as a uniform distribution with τ_{L }= 200. Compared to the Gamma distribution, the Normal distribution is relatively flatter. Thereby updating the linker length distribution did not significantly change the total number of predicted nucleosomes (or nucleosome density). The sensitivity increased on average by ~ 45% and the FDR dropped by ~ 5% after one update (for both the first and fourth order models). Further updating continued to improve the prediction until it stabilized after four iterations. In contrast, the Gamma model is much more skewed. The uniform linker length distribution resulted in an underestimation of the total number of nucleosomes. By four updatings, the sensitivity increased by 8%, while FDR remained relatively more stable.
Table 1. Updating linker length improves prediction  Normal linker length model
Table 2. Updating linker length improves prediction  Gamma linker length model
We also observe that under the same setting and condition, the fourth order model performs slightly but uniformly better than the first order model in both sensitivity and FDR. This is given that the true models are known and we scan the sequence using the true models. In theory, the first order Markov chain model is nested in the fourth order model. Therefore if the true model is the first order, a well trained fourth order model will have the same prediction power as the first order model, but not vice versa. Since training a higher order Markov chain model requires more data, inadequate training can undermine the prediction power.
Rescaling vs. not Rescaling
To illustrate the advantages of rescaling, we rescaled the yeast profiles according to the base composition of the maize genome (G/C scaling factor in maize is 1.2). Using the rescaled profiles we simulated 10 genomic sequences that each contain 10,000 nucleosomes and 10,001 linkers. The linker DNA length followed the same two distributions as in Table 1 and 2. We compare the prediction results from the scaled and nonscaled yeast profiles in Table 3 and 4. We found using the rescaled ("correct") profile yields a lower FDR than using the yeast profile. In addition, updating the linker length under the correct profile consistently improves the sensitivity and FDR until prediction stabilizes. In contrast, while using the yeast profile to scan the simulated maizelike genome, the prediction drastically deteriorates as the linker length updating proceeds. The same simulation was repeated on other species including human and C. elegans, where the base composition is similar to yeast (A/T scaling factor is 0.96 for C. elegans, and 1.03 for human). Unsurprisingly, the results from the scaled yeast profile were still better than those from the original yeast profile in terms of both sensitivity and FDR, while the difference is much smaller than for the maize case (results not shown).
Table 3. Rescaling models improves prediction  Normal linker length model
Table 4. Rescaling models improves prediction  Gamma linker length mode l
NuPoP vs. other software tools
We briefly assess the prediction performance of NuPoP by comparing it with two existing methods: the Nscore method of [4] (results kindly provided by Dr. G. Yuan, personal communication) and the Markov model/thermodynamic equilibrium method of [7] (to be called MM/TE method below). As the exact genomewide nucleosome positioning map is unknown, we utilize the 371,914 454 highthroughput sequence reads to identify welldefined nucleosomes. We first selected sequences of length between 130160 bp and constructed a readsbased occupancy map. The readsoccupancy score at a specific position is defined as the number of reads that covered this position. Then we calculated the moving average of this reads occupancy score using a 147 bp window. A sharp peak in the average occupancy curve indicates a nucleosome with welldefined positioning. Considering that the average linker DNA length is 20 bp in yeast [13], we quantified the sharpness of the peak by calculating the slope from the peak point to the up/downstream 20 bp point on the average occupancy curve. We set one condition for the peak to be selected as the center of a nucleosome to be that the absolute value of slope from either side should be > 0.01. Secondly, we required that the peak height itself must be ≥ 1.9, i.e., a welldefined nucleosome must be testified by at least two welloverlapped reads. We chose the threshold value as 1.9 instead of 2.0 because the overlap of the two reads can be less than 147 bp, resulting in a peak on the moving average curve slightly lower than 2.0. With these criteria, a total of 20,471 welldefined nucleosomes are selected from the 16 chromosomes of yeast. A snapshot of a region with many selected welldefined nucleosomes is presented in Figure 1. Figure 2 provides a snapshot of nucleosome occupancy predicted by NuPoP together with the readsoccupancy.
Figure 1. A plot of the experimentally defined reads occupancy score (red curve) for a region of yeast chromosome 4 showing the selected welldefined nucleosomes (grey shaded bars).
Figure 2. A snapshot of predicted nucleosome occupancy from NuPoP (shaded grey) compared with the experimentally obtained readsoccupancy (red).
We first assess the sensitivity of predictions from NuPoP using the welldefined control set. If there is a predicted nucleosome within ±k bp of any welldefined nucleosomes (center to center), we count this as one correct prediction. We varied k from 5, 10, ... 70, 73 to investigate the sensitivity behavior at different precision thresholds. The Nscore model predicted 48,394 nucleosomes. The current software tool for MM/TE method does not provide Viterbi predictions, but only the nucleosome occupancy scores. Therefore we calculated the moving average of the occupancy score using a 147 bp moving window. The resulting peaks were treated as the centers of predicted nucleosomes. If two peaks reside within 127 bp, we discarded the one with smaller moving average of occupancy score. This procedure identified 43,979 predicted nucleosomes (if we had required two nucleosomes to be 147 bp away, even fewer nucleosomes would have been identified). Likewise, using the occupancy scores from NuPoP, we identified 52,327 and 51,380 predicted nucleosomes under the 4th and 1st order models respectively. In Figure 3a, we compare the sensitivity estimates from the 4th order model of NuPoP with the Nscore and MM/TE methods at different threshold values of prediction accuracy (the fourth order model from NuPoP performs better, but very slightly, than the first order model. Hence the latter was omitted in Figure 3 for better presentation). As the sensitivity tends to increase with an increase in the total predictions, we further selected 43,979 and 48,394 best predictions from NuPoP model (here "best" is in the sense of the largest sum of occupancies over 147 bp) to compare with the Nscore method and in Figure 3b and 3c respectively.
Figure 3. Comparing sensitivity of NuPoP predictions with existing methods. The sensitivity is assessed based on 20,471 welldefined nucleosomes from 454 nucleosomes reads. We call a prediction correct if a nucleosome is predicted within +/ k bp distance (Xaxis) of a welldefined nucleosome (center to center) for k = 5 to 73. (a) Sensitivity plot of NuPoP (black) compared with Nscore method (blue), MM/TE method (red), and the duration Hidden Markov Model using a uniform distribution on 1,...,500 (pink). The random expectation curve is not calculated because the total number of predictions varies in different methods. (b) Sensitivity plot of NuPoP (black) compared with MM/TE method (red) and random expectation (green) while controlling the total predictions to be the same as MM/TE method. Green: random expectation. (c) Sensitivity plot of NuPoP (black) compared with Nscore method (red) and random expectation (green) while controlling the total predictions to be the same as Nscore method.
The sensitivity results suggest that the predictions from NuPoP outperforms the other two methods in two senses. Firstly, the sensitivity from NuPoP is 4.98.3% higher than the other two methods at different threshold values (Figure 3a). Secondly, while controlling the total predictions to be the same, NuPoP has ~ 3.25.3% higher sensitivity than the MM/TE method (Figure 3b), when the precision threshold is ≤ ±35. As the precision threshold gets less stringent, the difference attenuates and eventually vanishes. The contrast between NuPoP and the Nscore method is even larger as shown in Figure 3c.
As a further comparison, we computed the predictions of the dHMM method under a uniform linker length distribution defined on 1, 2,..., 500. This method predicted 48,334 nucleosomes under the 4th order models, achieving a sensitivity 3.06.6% higher than the MM/TE method (Figure 3a). When controlling the total predictions to be the same as MM/TE method or Nscore method, the resulting sensitivity curve almost perfectly overlaps with that from NuPoP. Therefore we omitted these results from Figure 3b and 3c.
One could further attempt to evaluate the false positive rate (FPR), measuring the fraction of linker regions that were falsely classified as nucleosome regions (or similarly the false discovery rate, FDR). This task requires welldefined linker regions. A problem, however, is that the average length of linker DNAs in yeast (20 bp; [13]) is smaller than the dispersion in lengths of the nucleosome DNAs as isolated biochemically (which is often 3050 bp full width at half maximum, notwithstanding that the nucleosome as defined crystallographically has precisely 147 bp of DNA). Thus existing nucleosome maps lack the precision needed to define such short linker DNAs. Moreover, various sampling biases such as the DNA sequence preferences of the micrococcal nuclease used to liberate nucleosomes biochemically (which preferentially cleaves A/T rich regions) could yield longer genomic regions that are free of recovered nucleosome DNA reads even if they are actually nucleosome occupied [20]. Attempts to evaluate the FPR given these problems in the data could result in misleading conclusions. For these reasons, FPR evaluation is not pursued in this paper.
Discussion
The duration Hidden Markov model proposed in this paper is a generic model for the oscillating structure of nucleosome and linker DNAs in chromatin fiber. The Markov models can be replaced by any other models for the nucleosome and linker states. The kernel method for linker length training is nonparametric and typically robust. We showed in the simulation that updating the linker length distribution iteratively improves sensitivity and FDR in prediction if appropriate nucleosome and linker models are used. In particular, the first iteration often achieves the most pronounced improvement. In contrast inappropriate nucleosome and linker models could lead to the opposite outcome, as shown in the simulation studies (Table 3 and Table 4). In reality, the genomic DNAs are complicated by their biological functions. The models trained based on typical nucleosomes or linkers may not well fit some special genomic regions like repeated elements. To avoid possible risks due to such complications, we trained the linker length distribution less greedily by using only one iteration in NuPoP.
Limitations may still exist in the model training and assessment used in this study. The MNase is known to have strong preference to cleave dinucleotides containing only A/T [21]. Consequently the MNasemapped nucleosome sequences can be systematically biased in some regions. This bias could undermine the prediction power because of the dampened signal in the trained nucleosome model. The systematic bias may also exist in the welldefined nucleosomes, causing inaccuracy in sensitivity estimation. A better map of nucleosomes is highly desirable for both purposes. For species other than yeast, we currently lack highquality genomewide nucleosome sequence data (e.g., like the 454 reads) for model training and model validation. The advantages of the rescaling method shown using simulation in this paper need to be further assessed once such highquality data becomes available. Moreover, the results from different methods in this paper were all based on the default settings. The Nscore method was originally trained based on a much smaller set of nucleosome and linker sequences. A better training using a larger set could improve this method's predictions. In addition, different settings in the Nscore or MM/TE methods can lead to different predictions, which we did not further investigate here. Finally, the software for MM/TE method only provides the occupancy score. Different ways to call a predicted nucleosome based on the occupancy score might lead to different conclusions.
Finally, we address the question of which subset of the available 454 reads data might best be used for training the nucleosome model. In NuPoP, we trained the nucleosome model using selected nonredundant nucleosome reads of length within a short range (146149 bp), to retain strong high resolution nucleosome sequence signatures, e.g., the _10 bpperiodic dinucleotide signals. As comparisons, we trained two additional nucleosome models: one using the selected nonredundant reads of length 122177 bp (retaining the nonredundancy but yielding far more training data), and the other using all reads of length 122177 bp. The resulting models both contain the kmer usage information that distinguishes nucleosomes from linkers (e.g., [3,4]), while the dinucleotide signals in these models are much weaker due to poor alignment of these reads. Furthermore, as the reads count at a nucleosome site is heavily biased by the G/C content due to MNase specificity and other effects in the experiment, the model trained from the redundant reads tends to be overenriched in G/C. When combined with the linker model from NuPoP, the two alternative nucleosome models yielded comparable sensitivity as NuPoP in predicting the approximate positioning of nucleosomes, assessed based on the 20,471 welldefined nucleosomes used above. This comparison, however, is not sensitive to spatial precision of the predictions. Therefore, we asked further, given that a nucleosome is predicted within ±73 of a true nucleosome, which model predicts the location more accurately? To investigate this, we simulated genomic sequences using the nucleosome and linker models from NuPoP. We compared the prediction from the three models and found that the true model with strong signals achieves much better prediction accuracy than the two alternative models. For example, 16.1% of the predictions from the true model were prefect (with 0 bp offset), compared to 8.7% and 5.9% respectively from the other two models (results not shown).
Conclusions
The dHMM model proposed in this paper is effective in characterizing the oscillating structure of nucleosome and linker DNAs in chromatin fiber. Explicit modeling of linker length improves the prediction of nucleosome positioning regarding sensitivity. The developed software tool NuPoP provides a userfriendly interface for predicting nucleosome occupancy and the most probable nucleosomes positioning map genomewide.
Availability and requirements
NuPoP software tools are freely available from http://nucleosome.stats.northwestern.edu webcite. The R package shall be made available through bioconductor http://www.bioconductor.org webcite upon publication. To run the NuPoP Fortran standalone program, a Fortran compiler is required. For the NuPoP R package, an R version later than 2.9 is required.
Authors' contributions
LXi did all the data analyses. JPW, JW, and LXi wrote the paper. JPW and JW directed the research. YFM conducted all lab work for data generation and validation. JPW and LXi developed the NuPoP software tools. LXia and JF implemented the NuPoP webserver. All authors read and approved the final manuscript.
Acknowledgements
The research is supported by NIH grant R01GM075313 and NCI grant U54CA143869. We acknowledge with gratitude the gift of a parallel sequencing run from Roche/454 Life Sciences, and thank Ms. Jolene Osterberger (Roche/454 Life Sciences) and Dr. Nadereh Jafari (Northwestern University) for arranging this. The authors would also like to thank Drs. Eran Segal, Guocheng Yuan, Zhiping Weng and Yutao Fu for help in providing data and their codes.
References

Segal E, FondufeMittendorf Y, Chen L, Thåström A, Field Y, Moore I, Wang JPZ, Widom J: A genomic code for nucleosome positioning.
Nature 2006, 442:772778. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ioshikhes I, Bolshoy A, Derenshteyn K, Borodovsky M, Trifonov EN: Nucleosome DNA Sequence Pattern Revealed by Multiple Alignment of Experimentally Mapped Sequences.
J Mol Biol 1996, 262:129139. PubMed Abstract  Publisher Full Text

Peckham H, Thurman R, Fu Y, Stamatoyannopoulos J, Noble W, Struhl K, Weng Z: Nucleosome positioning signals in genomic DNA.
Genome Res 2007, 17(8):11707. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yuan GC, Liu JS: Genomic sequence is highly predictive of local nucleosome depletion.
PLoS Computational Biology 2008, 4:e13. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tolstorukov M, Choudhary V, Olson W, Zhurkin V, Park P: nuScore: a webinterface for nucleosome positioning predictions.
Bioinformatics 2008, 28:14561458. Publisher Full Text

Field Y, Kaplan N, FondufeMittendorf Y, Moore I, Sharon E, Lubling Y, Widom J, Segal E: Distinct modes of regulation by chromatin encoded through nucleosome positioning signals.
PLoS Comput Biol 2008, 4(9):e1000175. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kaplan N, Moore I, FondufeMittendorf Y, Gopssett A, Tillo D, Field Y, LeProust E, Hughes T, Lieb J, Widom J, Segal E: The DNAEncoded Nucleosome Organization of a Eukaryotic Genome.
Nature 2009, 458:362368. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Schones D, Cui K, Cuddapah S, Roh T, Barski A, Wang Z, Wei G, Zhao K: Dynamic regulation of nucleosome positioning in the human genome.
Cell 2008, 132(5):88798. PubMed Abstract  Publisher Full Text

Lee C, Shibata Y, Rao B, Strahl B, Lieb J: Evidence for nucleosome depletion at active regulatory regions genomewide.
Nature Genetics 2004, 36:900905. PubMed Abstract  Publisher Full Text

Lee W, Tillo D, Bray N, Morse R, Davis R, Hughes T, Nislow C: A highresolution atlas of nucleosome occupancy in yeast.
Nature Genetics 2007, 39(10):123544. PubMed Abstract  Publisher Full Text

Whitehouse I, Tsukiyama T: Antagonistic forces that position nucleosomes in vivo.
Nat Struct Mol Biol 2006, 13:633640. PubMed Abstract  Publisher Full Text

Lubliner S, Segal E: Modeling interactions between adjacent nucleosomes improves genomewide predictions of nucleosome occupancy.
Bioinformatics 2009, 25(12):i348.
i355
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Rabiner LR: A tutorial on hidden markov models and selected applications in speech recognition.
Proc IEEE 1989, 77:257285. Publisher Full Text

Rabiner L, Wilpon JG, Soong FK: High performance connected digit recognition using hidden Markov models.
IEEE Transaction on Acoustics, Speech and Signal Processing 1990, 37(8):12141225. Publisher Full Text

Burshtein D: Robust parametric modeling of durations in hidden Markov models.
IEEE Transactions on Speech and Audio Processing 1996, 4:240242. Publisher Full Text

Wang JP, Xi L: Duration estimation for Hidden Markov Models.

Wang JPZ, Widom J: Improved alignment of nucleosome DNA sequences using a mixture model.
Nucleic Acids Research 2005, 22(22):674355. Publisher Full Text

Wang JP: Estimating the species richness by a Poissoncompound Gamma model.

Valouev A, Ichikawa J, Tonthat T, Stuart J, Ranade S, Peckham H, Zeng K, Malek J, Costa G, McKernan K, Sidow A, Fire A, Johnson S: A highresolution, nucleosome position map of C. elegans reveals a lack of universal sequencedictated positioning.
Genome Res 2008, 18(7):105163. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wang JP, FondufeMittendorf Y, Xi L, Tsai GF, Segal E, Widom J: Preferentially quantized linker DNA lengths in Saccharomyces cerevisiae.
PLoS Computational Biology 2008, 4(9):e1000175. PubMed Abstract  Publisher Full Text  PubMed Central Full Text