Abstract
Background
In eukaryotes, most DNAbinding proteins exert their action as members of large effector complexes. The presence of these complexes are revealed in highthroughput genomewide assays by the cooccurrence of the binding sites of different complex components. Resampling tests are one route by which the statistical significance of apparent cooccurrence can be assessed.
Results
We have investigated two resampling approaches for evaluating the statistical significance of bindingsite cooccurrence. The permutation test approach was found to yield overly favourable pvalues while the independent resampling approach had the opposite effect and is of little use in practical terms. We have developed a new, pragmaticallydevised hybrid approach that, when applied to the experimental results of an Polycomb/Trithorax study, yielded pvalues consistent with the findings of that study. We extended our investigations to the FL method developed by Haiminen et al, which derives its null distribution from all binding sites within a dataset, and show that the pvalue computed for a pair of factors by this method can depend on which other factors are included in that dataset. Both our hybrid method and the FL method appeared to yield plausible estimates of the statistical significance of cooccurrences although our hybrid method was more conservative when applied to the Polycomb/Trithorax dataset.
A highperformance parallelized implementation of the hybrid method is available.
Conclusions
We propose a new resamplingbased cooccurrence significance test and demonstrate that it performs as well as or better than existing methods on a large experimentallyderived dataset. We believe it can be usefully applied to data from highthroughput genomewide techniques such as ChIPchip or DamID. The Cooccur package, which implements our approach, accompanies this paper.
Background
A large number of proteins are known to bind DNA in a locationspecific manner. These include transcription factors, replication factors and chromatin components. It is widely accepted that individual proteins do not usually act in isolation but form multiprotein effector complexes on DNA. When the binding sites of the individual proteins within a complex are determined by genomewide highthroughput assays, these complexes are revealed as regions where the binding sites of multiple proteins are clustered. When evaluating the apparent colocalisation of the binding sites of a pair of proteins it is necessary to determine whether they genuinely colocalise or whether the observations could have arisen by chance. Many methods have been proposed for assessing the statistical significance of such clusters (reviewed in [1]). The classical statistical methods rely on obtaining the null distribution for a statistic that correlates with the phenomenon of interest. The two key design decisions are therefore the choice of how the statistic is computed and how the null distribution is obtained. We will discuss how we have addressed these questions when considering the merit of a particular test and when devising an improved test.
Choice of statistic
General considerations
Some latitude exists in the choice of statistic for a cooccurrence test. With cooccurrence, we are primarily interested in different DNAbinding factors being located closely together more frequently than might be expected 'by chance', the latter being determined by the null distribution. A useful statistic is expected to be increased by the proximity of the factors and the number of clusters of factors in the genome. Using the term heterogeneous overlap to refer to overlap between the binding sites of two different factors, one can readily envisage as useful statistics either counting the number of pairs of locations where heterogeneous overlaps occur, or the total number of bases of heterogeneous overlap. With the former, it is the presence or otherwise of an overlap that determines whether to increment the statistic while the latter weights overlaps according to the degree they are overlapped.
We desire a general framework for a statistic that is applicable to different policies for scoring cooccurrence. Additionally, we also wish to divorce the specific scoring policy from the statistical test itself. A cooccurrence matrix can satisfy both objectives.
Cooccurrence matrix
We will use the term binding profile to refer to the list of locations bound directly or otherwise by a specific DNAassociated factor (DAF). Let and be the binding profiles of a pair of DAFs, A and B, respectively and define their joint set of locations as .
Further, let there be a cooccurrence matrix, C with elements, c_{ij}, such that the cooccurrence statistic can be defined as
i.e. the cooccurrence statistic is the sum of the scores of every pairwise combination of locations where one is drawn from those bound by A and the other from one bound by B (Figure 1 with the specific implementation used in this paper described in Figure 2).
Figure 1. An example of a cooccurrence matrix. (a) A binding profile for two proteins, A and B, are displayed schematically in red and blue respectively. The sequence coordinate is along the horizontal axis and each rectangle represents a location. (b) A schematic of the cooccurrence matrix for the joint profile of (a). Locations are numbered sequentially by start coordinate. Rows and columns are associated with factors A and B respectively. The metric is used is whether locations overlap. Overlapped pairs have nonzero c_{ij }and are indicated by diagonal hatching of the element. For example, we see from the matrix that locations 3 and 4 overlap and when they are bound by factors A and B, c_{34 }is nonzero and therefore the statistic is increased by its value. (c) The binding profiles of A and B shown in (a) have been indicated on the cooccurrence matrix with faint red and blue backgrounds respectively. The elements that will be summed in evaluating the statistic are outlined in bold. These represent every pairwise combination of the two profiles.
Figure 2. Statistic used for assessing cooccurrence. For the purposes of this paper, a simple statistic was used that counted each instance of an overlap between locations bound by different factors. Examples of clusters scoring (a) one, (b) two, and (c) four are shown. Note that in (c), in multiplyassigned locations, each assignment is treated as an independent location.
The cooccurrence matrix embodies all information relevant to the calculation of the statistic. c_{ij }is the contribution to the statistic when location i is bound by A and location j is bound by B. That C is defined in general terms allows its use with a wide range of different approaches to scoring cooccurrence.
For example, if the cooccurrence metric is whether the features overlap irrespective of its extent, c_{ij }is set to unity when an locations i and j overlap and to zero otherwise. Alternatively, if the number of bases overlapped between two locations is the metric, c_{ij }is set to the number of bases overlapped between locations i and j. Other scoring strategies can be readily considered and implemented via a cooccurrence matrix in a similar manner. The use of a matrix removes subsequently irrelevant details such as the actual base coordinates and the specific overlap scoring metric from further consideration since it incorporates their entire effect on the calculation of the statistic. Computation of the statistic can therefore be performed during resampling solely by reference to the cooccurrence matrix.
Choice of null distribution
The difference between statistical methods frequently reduces to the manner in which the null distribution is obtained. With parametric methods, the observed data is used to parameterise one of the classical statistical distributions which is then used as the null distribution. Nonparametric methods use resampling approaches to approximate the null distribution.
Chromatin components naturally bind in a highly nonuniform manner and the null distribution needs to account for this appropriately. Indeed, failure to account for so called bursty sequence effects can result in many false positives [2]. In addition, cooccurrence tests must be cognisant of the provenance of the binding site data. Binding profiles arise from experiments and the mapped sites reflect the actual physical state of chromatin. The contribution of the underlying sequence is therefore already entirely accounted for in the binding profile. In contrast, attempts to identify clustering of sequencespecific proteins by the cooccurrence of their binding site motifs try to reconstruct the chromatin state from sequence: in this case, the bindingsite motifs specified by the positionweight matrices (PWMs) and the underlying sequence composition are pertinent to the specification of the null distribution: for example, two similar PWMs will have a large number of cooccurring hits anyway. So while both sources of data generate sets of locations that can be analysed for cooccurrence, the null distributions are likely to be very different. Here, our interest is strictly confined to the simpler case of binding profiles alone and our method is unlikely to be applicable to motif cooccurrence data. A survey of the motif cooccurrence problem can be obtained from papers cited in [3].
Tests based around resampling strategies have been previously explored as a means of accounting for variation in binding site distribution [4]. As they rely on randomising the observed binding sites for the factors, they can be expected to automatically account for binding site nonuniformity. In the following sections, we will demonstrate that the utility of these methods is highly dependent on the resampling scheme used.
Resampling strategies rely on sampling a pool of locations to infer the null distribution. The pool can be constituted in two ways. First, it can comprise only the binding profiles of the pair of factors being examined for cooccurrence. Alternatively, it could also be constituted by pooling the binding profiles of a large number of factors, some or all of which are to be investigated for pairwise cooccurrence. All methods we discuss except for the FL method of Haiminen et al use the former approach.
Resamplingbased tests
Resampling methods operate by randomly sampling the observed data to construct a null distribution. When evaluating the statistical significance of cooccurrences between a pair of binding profiles containing m and n locations respectively, their constituent sites are combined to form a joint pool of locations. Further pairs of binding profiles containing m and n locations respectively are then repeatedly sampled from this joint pool and scored for cooccurrence to estimate the null distribution of the test statistic.
We can describe resampling formally with two sampling operators F(X, n) and G(X, n) that draw n elements from X. Then, at each iteration, two new lists of binding sites, and , the same size as the originals are generated:
and the distribution of is used as the null distribution. and are the numbers of elements in and .
In the following sections, we report the results of our investigations into three different resampling strategies, including the previously described permutation test approach.
Permutation test
In this approach, a joint pool of all binding sites is created and sites randomly assigned to DAF A. The remaining sites are then assigned to DAF B (see Figure 3(a)).
Figure 3. Examples of permutation schemes. In the top panel a binding profile in which two factors, A (red) and B (blue), bind at four different locations yielding two overlapped clusters. The colour of outlines indicate how that location can be assigned. A location outlined in black can only be assigned once. Red and blue outlines indicate that location can be assigned to A or B respectively. Assignments are without replacement, i.e. multiple assignments of the same location to the same factor are not permitted. Asterisks show where an overlap is scored. Examples of the combinations that arise from different schemes are shown. (a) Permutation: all locations are assigned to either A or B but not both. No locations are left unassigned. Overlaps can only occur where overlaps have been observed before. (b) Independent assignment. All locations can be assigned to either A or B or both factors. Previously unobserved overlaps arise when the same location is assigned to both factors. (c) Hybrid: Locations previously observed to be involved in cooccurrences can only be assigned to A or B but not both. Singletons can be freely assigned to either factor or both factors.
Let ρ and be operators for sampling with replacement and without replacement respectively. The permutation test can then be formally described in set notation as:
Independent resampling
Instead of permuting the joint pool of locations, this approach reconstitutes and independently (see Figure 3(b)), i.e.
Note that the same location can be assigned to both factors with independent sampling.
Hybrid method
The approach here is to split the joint locations into two pools. The first pool comprises locations at which cooccurrence has been observed in the dataset. The second pool comprises the remaining singleton locations.
During resampling, locations in the first pool may either remain unassigned or allocated to one factor only (see Figure 3(c)). Thus, factors cooccur in this pool only when they are assigned to a pair of overlapped locations. Locations in the second pool are freely assigned to the factors. But since the locations in this pool are singletons, overlaps arise only when both factors are assigned to the same location.
A formal statement of this strategy proceeds along the following lines. The locations in are partitioned into sets of heterogeneously overlapped and singleton locations. We can define the set of singleton locations with reference to the cooccurrence matrix by
and the set of heterogeneously overlapped locations is then
The resampling is then performed by:
Multiple factor methods
Haiminen et al proposed and analysed the performance of several interesting approaches that can be applied where the binding sites of many factors are simultaneously known on the same sequence extents, a situation that arises frequently with highthroughput datasets [2].
Two approaches of theirs are of particular interest. In the FL approach, the joint pool comprises all binding sites for all factors in the dataset. Permutation is performed by assigning sites to DAF A from this pool. A further sites are then assigned to DAF B from the remaining unassigned members of the pool. The null distribution is the distribution of cooccurrence scores obtained by this permutation method. In the FL(r) method, the binding sites of the first factor, DAF A, remain unchanged while those of the second factor are permuted as described above.
The FL but not the FL(r) approach will be included in our analysis. We are not entirely at ease with treating one of a pair of factors differently from the other nor with the possibility that a cooccurrence may be significant when one factor is fixed but not when the other is fixed.
Desired behaviour of a cooccurrence significance test
It is important to form some expectation of the pvalues that could arise from a sensible method for assessing factor cooccurrence. First, greater significance would be expected when a larger proportion of sites cooccur. For example, one might expect 50 cooccurrences where each factor only bound to 100 sites to yield a lower pvalue than if each factor bound to 10000 sites. In addition, given the same proportion of cooccurring sites, a greater significance is expected if the total number of sites is larger.
Results and Discussion
A simple model system
A particularly simple cooccurrence model is one with m paired cooccurrences out of a total of (n_{A }= ) and (n_{B }= ) binding sites for DAFs A and B respectively.
The behaviour of different approaches to implementing permutation tests will initially be investigated on this analytically tractable system. Unless otherwise stated, the statistic used is the number of cooccurrences as described above and a cooccurrence is deemed to occur if two locations bound by different factors overlapped each other. As we note above, the specific overlap definition used has little influence on the validity or otherwise of the approaches.
Permutation test
This approach appears to be similar to that reported by Hannenhalli and Levy [4].
Permutation test: synthetic data
When a simulation was conducted with m = 10 and n_{A }= n_{B }= 10, i.e. wherein all 10 sites bound by A and B were paired, 139 cooccurrences were observed in 100000 resamplings yielding a pvalue of 0.00139. When the total number of sites bound was increased to n_{A }= n_{B }= 1000 without change to the number of cooccurrences, 98 cooccurrences were observed which yields an estimated pvalue of 0.00098. An apparent anomaly arises therefore where the pvalue is only modestly changed when the proportion of cooccurring sites changes from 1 in 2 to 1 in 100. More surprisingly, the pvalue actually falls with this change. What is the source of this anomaly? A better understanding can be gained from examining the mathematics behind this approach.
The pvalue for the simple model can be derived analytically. The observed number of cooccurrences in this case is also the highest value it could attain  no permutation can cause a cooccurrence at any of the singleton locations since each of these can only be assigned to one of the factors. The pvalue is therefore the probability of exactly m paired cooccurrences given the values of n_{A }and n_{B}. This probability is, in turn, the product of the probability of having exactly m sites of each factor assigned to the overlapped pairs, P_{m }and the probability of them being arranged as pairs, P_{pair}. The former is described by the hypergeometric distribution
The latter is the product of selecting m pairs starting with m of each factor. The probability of each pair is described by the hypergeometric distribution from the number of sites left to assign. On simplifying
For any given ratio of n_{A }to n_{B}, p_{m }changes little with increasing n = n_{A }+ n_{B}. For example, with n_{A }= n_{B }= 20, the probability of assigning 10 sites to each factor is 0.247. With n_{A}= n_{B }= 1000, it is 0.177. The pvalue is therefore dominated by P_{pair}, i.e. it is the number of overlaps that determines the pvalue: the proportion of sites overlapped has little influence on it. For example, P_{pair }for 10 overlaps is 0.0055 which is the upper limit for pvalue irrespective of the total number of sites. With 20 overlaps, P_{pair }= 7.6 × 10^{6 }which guarantees statistical significance on almost any reasonable threshold irrespective of how small a proportion of the total number of sites bound that might be. This does not accord with what might be deemed reasonable behaviour from a cooccurrence test.
Permutation test: other implications
As previously noted, the permutation test approach appears to have been used by Hannenhalli and Levy [4] albeit in a different statistical framework. We were interested to determine whether their framework circumvents the drawbacks we identified with a permutation test approach assessed with the hypergeometric distribution.
Hannenhalli and Levy studied the cooccurrence of hits generated by a pair of positionweight matrices (PWMs) and counted the number of times hits from different PWMs were within some threshold distance of each other. They then computed a colocalization index (CI):
where N_{ij }is the number of cooccurrences observed between two PWMs, i and j, and R_{ij}is the number observed after a permutation. Repeated permutations of the dataset can be expected to yield differing values of R_{ij }and it is not clear how the value of R_{ij }used above is selected. Further, it should also be noted that both the logratio and the ratio have been used to define CI [4,5]. To get an idea as to how the distribution of CI varies with the fraction of sites overlapped, we simulated 5000 resamplings of a simple model having 40 paired overlaps and a varying number of singleton sites (Figure 4). It is clear that the distribution of CI varies little with the fraction of cooccurring sites and is almost wholly determined by the absolute number of overlapped sites. Hannenhalli and Levy [4] commented on the high frequency of cooccurring transcription factor binding and the paradox that higherorder cooccurrences were not enriched amongst factors showing highlysignificant pairwise cooccurrences. While cooccurring PWM pairs will be expected to yield many overlapped pairs and therefore generate high CI scores, it could be that some of the highly scoring PWM pairs had large number of hits such that a small fraction of overlaps is suficient to yield enough overlaps to give a high CI thereby causing those factors to be incorrectly identified as cooccurring pairs.
Figure 4. CI distribution for different degrees of overlap. The distribution of cooccurrence index was computed with 5000 resampling iterations for 40 overlapped pairs and varying numbers of singletons (1/10/100/1000 singletons per PWM).
Independent resampling
A major reason for the highly significant pvalues observed with the permutation test approach is that the observed score is maximal and there are so many ways in which m sites of each factor can be arranged to result in fewer overlaps. Instead of permuting the available sites, if two independent draws of size n_{A }and n_{B }are made, it is possible for the same location to be assigned to both factors (see Figure 3(b)). The observed score is then no longer maximal since overlaps can potentially occur at any location. Further, when both locations involved with the observed overlaps are assigned to both factors, four overlaps can be scored instead of the single overlap observed (e.g. fig 3(b)). This change in sampling procedure has the effect of reducing statistical significance since the observed score can be matched or exceeded in many different ways.
In contrast to the permutation test approach, which leads to overly significant pvalues, statistical significance was never achieved with independent resampling. For example, using synthetic data with 20 locations, all presenting as overlapped pairs, a surprisingly high value of 0.64 is obtained. This arises because whenever a pair of overlapped locations has each of its locations assigned to both factors, it increases the score by four. This is four times that obtained when two isolated locations, each bound by a different factor, overlap each other.
Hybrid approach
The behaviour of this model is expected to be intermediate between permutation test and independent draws in that the first pool is treated with the former strategy while the second is treated with the latter.
Hybrid approach: synthetic data
With a profile consisting of paired overlaps only, as with the earlier example with 10 pairs (m = 10), the result is identical to that obtained from the permutation test strategy, i.e. the pvalue is approximately 0.005. However, merely adding 5 singletons of each factor to the model raises the pvalue to approximately 0.12. We now compare other scenarios against this baseline example. With more pairs, the pvalue becomes more tolerant of singletons. For example, with m = 20, adding 5 singletons to each factor yields a pvalue of around 0.001 (as opposed to 0.12 with the baseline). We expect that for the same proportion of overlapped locations, the pvalue should fall with an increase in the total number of locations. Adding ten locations to each factor with m = 20 gives the same proportion of overlapped locations as the baseline but yields a pvalue of around 0.02 (which is lower than the baseline value of 0.12). Both these behaviours are concordant with the desired behaviour of an appropriate cooccurrence test.
The asymptotic behaviour of a statistic S_{hyb }computed by the hybrid method can be considered for the case where a binary scoring is used. At one extreme, when there are no singleton sites, the method is entirely equivalent to the permutation test and should yield significant pvalues provided the number of sites overlapped is large enough. At the other extreme, when a negligible fraction of the locations are overlapped, the distribution of S_{hyb }can be expected to approach that where n_{A }As and n_{B }Bs bind a total of n_{A }+ n_{B }sites. The probability mass function (pmf) of S_{hyb }for this limiting case is
where 0 ≤ k ≤ min(n_{A}, n_{B}). (derivation in Materials and Methods section).
As the average value of S_{hyb }is then n_{A}n_{B}/(n_{A }+ n_{B}), we may expect that statistical significance will only be achieved if the observed number of overlaps considerably exceeds this.
Hybrid approach: Polycomb/Trithorax data
The hybrid technique was tested on data published by Schuettengruber et al [6]. The authors published ChIPchip [7] binding profiles for Polycomb and Trithorax proteins (PC, TRXN, TRXC), other chromatin proteins, Polyhomeotic, Pleiohomeotic, Pleiohomeoticlike, Dorsal switch protein1, GAGA factor (PH, PHO, PHOL, DSP1, GAF) as well as histone marks (Me3K4, Me3K27) in Drosophila melanogaster. In all, ten chromatin components bound over 28068 locations were examined.
In this test, the presence of cooccurrence was investigated for all possible pairs of binding profiles. Of the 45 potential pairs, 11 highlysignificant cooccurring pairs (pvalue < 0.05) were identified. These were PCMe3K27, PCPH, Me3K27PH, Me3K4PHO, Me3K4PHOL, PHOPHOL, Me3K4TRXN, DSP1GAF, DSP1PHO, PHOLTRXN and PHOTRXN.
The cooccurrences divided the factors nicely into two clusters (Figure 5) with a single cooccurrence linking the clusters (PHPHO). The first cluster includes the two Polycomb recruitment complex I (PRC1) proteins (PH, PC) while the Pleiohomeotic recruitment complex (PhoRC) proteins (PHO, PHOL) localise with the other cluster. It was particularly pleasing to detect a PHPHO cooccurrence since these are similar proteins that bind the same target sequence in vitro but have somewhat different binding profiles in vivo [6]. Unlike PHOL, which only interacts with the PRC2 component Esc [8], PHO is known to interact directly with the PRC1 components PC and PH as well as with Esc [9].
Figure 5. Clusters of statistically significant cooccurring pairs in Polycomb/Trithorax dataset. This figure shows two clusters formed by statistically significant pairwise cooccurrences (pvalue << 0.001; marked by bold lines) between binding profiles of factors analysed in Schuettengruber et al [6]. Two lower confidence cooccurrences are shown with dashed (pvalue < 0.05) and dotted lines (pvalue < 0.1) respectively.
Trithorax is known to be cleaved into two independently acting fragments, TRXN and TRXC [10,11]. A very weak score for the PHTRXC cooccurrence was again consistent with the finding that while TRXN is associated with Me3K4marked sites, TRXC is associated with PRC1 complexes [6].
We were interested in how the pvalue behaved with different degrees of overlap between factors and in particular where the threshold between statistically significant and insignificant overlap lay (Figure 6). It is noteworthy that this method does not only detect cases where both sets overlap extensively, i.e. where percentage overlaps in both cases exceed 50%, but also detects occasions where one factor almost wholly cooccurs with a subset of binding sites of the other factor. Our hybrid method appears to detect cooccurrences broadly consistent with known interactions. A summary of results is shown in Table 1 with the pvalues obtained by each of the methods described in this paper. Note that our earlier objections to the permutation test and the independent resampling approach have been amply confirmed when applied to this dataset.
Table 1. Summary table of Polycomb/Trithorax analysis
Figure 6. Relationship between pvalue and degree of overlap. The data in this plot is derived from the Polycomb/Trithorax dataset [6]. The axes represent the percentage of binding sites overlapped for the a pair of factors, arranged such that the larger percentage is plotted on the xaxis. Factors with significant cooccurrence are plotted in red while those without are plotted in black.
FL method
An experimental implementation of the FL method [2] was used for the following studies.
FL method: synthetic data
Where the dataset consists of only the pair of factors of interest the FL method is identical to the permutation test. However, this is not the context for which the FL method is intended and we examine the effect of additional binding site data from further factors. Consider two factors present as 40 cooccurring pairs. The pvalue obtained over 100000 trials is, as expected from our foregoing permutation test analysis, very low: < 1 × 10^{5}. The presence of other factors with a very similar pattern of binding can be examined by adding further pairs of factors with the same binding sites as the first pair. The pvalue escalates rapidly on doing so, to 0.012, 0.108 and 0.198 on adding one, two and three replicates respectively. If the effect of the presence of a factor with a disjoint set of binding sites can be readily seen by adding to the fourreplicate set, a further factor binding to 100 novel nonoverlapping sites. The pvalue then declines precipitiously from 0.198 to very significant score of ≈ 1 × 10^{4}. The synthetic data analysis presented above illustrates that the FL method is potentially sensitive to the choice of factors in the dataset, which can be expected given that the null distribution is obtained from combining all binding sites in the dataset. Where factors cooccur frequently, the statistical significance of an observed cooccurrence is reduced accordingly. Correspondingly, where that is not the case, an observed cooccurrence is more readily deemed statistically significant.
FL method: Polycomb/trithorax data
To further investigate the potential of the FL method, it was applied to the same Polycomb/Trithorax data previously examined with the hybrid method. A summary of the results from this method are juxtaposed with the results of all other methods in Table 1. In general, with this dataset, the FL method reported more significantly cooccurring pairs than our own hybrid method (27 vs 11 of the 45 possible pairs). All pairs scored as cooccurring by the hybrid method were also cooccurring by the FL method indicating that the FL method was either more sensitive/less conservative than the hybrid method.
Given our concerns over the sensitivity of the FL method to the choice of factors in a dataset, we also determined the effect of removing each factor in turn from the dataset and repeating the analysis. In general, the FL method was found to be quite robust to these exclusions. The results remained unchanged when any one of five factors were excluded (GAF, Me3K4, Me3K27, Pc, TrxN). Exclusion of one of Dsp1, Ph, Pho, Phol or TrxC caused the GAFMe3K4 cooccurrence to become statistically significant. In addition excluding Pho also caused the PholTrxC cooccurrence to be deemed statistically significant. The Suppressor of Hairywing insulator protein, Su(Hw), binds at 3794 sites in the Drosophila genome, almost all of which are distinct from those of other known chromatin components and its addition to this dataset may be expected to raise the statistical significance of cooccurrences between other factors. This was indeed observed, with both the GAFMe3K4 and PholTrxC cooccurrences already mentioned above becoming significant, as well as three previously nonsignificant cooccurrences, GAFTrxN, Me3K27Pho and TrxCTrxN. It is noteworthy that while sites in the Su(Hw) binding profile only comprise around 15% of the total original number of locations, they were enough to drive a further five candidate cooccurring pairs to significance.
Performance
As with other resampling methods, our hybrid method test is computationally expensive, especially when low pvalue thresholds are required. The current implementation is written in pure R and is singlethreaded. The user time for analysing the PCMe3K27 cooccurrence data with 5000 resamplings was 1231 s on a Core 2 Quad Xeon T5600 (1.83 GHz). To improve compute times we exploited the inherent parallelism in the resampling process, utilising the R interface to a MessagePassing interface (MPI) implementation, Rmpi [12,13]. When run on a dual Core 2 Quad Xeon T5600 with six slave processes, the elapsed time for the PCMe3K27 task fell to 328 s, a reduction of 3.75fold which amounts to an acceptable 0.8fold per slave process. Further speedup may be possible by reimplementing parts of the code in a lowerlevel language.
Discussion
Our analysis of four different approaches to the use of permutation tests showed the pvalues obtained to be highly sensitive to the manner in which the resampling is done. Two aspects are of concern. First, the independent sampling approach cannot be useful in any practical context as it is incapable of yielding significant pvalues under any practical conditions. It readily explains why it has not been previously encountered. Of greater concern is the permutation test approach. This method readily yields highly significant pvalues with our statistical framework even in cases that should not warrant it. The permutation test approach has been previously used for cooccurrence analysis and we urge caution be used when interpreting data obtained in this way. The FL and hybrid methods were found to be effective on a realworld dataset with the latter being more conservative on this dataset: all pairs scored as significantly cooccurring with the latter were similarly scored with the former.
It is difficult to determine which method is more appropriate. Clustering of factors represent a multiway cooccurrence which may not be adequately detected/rejected by the presence or otherwise of a pairwise cooccurrence. The use of the FL method is sensitive to the selection of factors used to determine the null distribution and some thought should go into selecting appropriate factors to include in the dataset. In particular, a data mining exercise screening large numbers of binding profiles for cooccurrences might favour the inclusion of the hybrid method in the test repertoire because of its insensitivity to factor selection. Given the limited data available to validate the relative performance of the methods with regard to selectivity and sensitivity, it may be advantageous to use both. Generally, resampling tests are limited in the range of pvalues they can yield by the number of resamplings performed and the use of two methods with different sensitivities may allow the very high confidence cooccurrences to be differentiated from the weaker ones. To conclude, we report the development of a hybrid approach based on pragmatic grounds that we believe has utility. Our evaluation, using a comprehensive realworld data set, indicates that the derived cooccurrence data appear reasonable. Indeed, we believe our approach is conservative in that it assumes that all singletons are capable of generating a cooccurrence. While the method may be computationally expensive, it is no more so than other techniques relying on resampling.
An initial release of the Cooccur package implementing our method accompanies this paper (Additional file 1).
Additional file 1. The Cooccur package. A development version as installation package suitable for installation with the R CMD INSTALL command. Subsequent versions will be released via Bioconductor.
Format: GZ Size: 138KB Download file
Conclusions
We have proposed a hybrid approach to sampling in a permutation test for detecting pairwise cooccurrences of factor binding. We have also demonstrated that the method is applicable to realworld data and yields results consistent with previous expectations. It is likely to be useful for analysis of data from highthroughput genomewide screens of factor binding.
Methods
Software
All statistical manipulations were implemented in R, a dialect of the S language [14].
Assessment of cooccurrence
The cooccurrence table used in this study was defined with binary weights, with a unit weight used for overlapping locations. Fig. 3(c) shows an example of our implementation works.
Extension to multiple contigs
The technique is described above with all locations being on the same sequence. However, as this technique is concerned with assessing cooccurrence only, it is possible to build a cooccurrence table across multiple sequences. Note that locations on different sequences can never result in an overlap with each other and the overlap table can therefore be assembled a contig at a time.
To compute the cooccurrence table for the entire dataset, the locations were searched for overlaps, one sequence at a time and overlaps were appended to the table.
Runtime optimisations
When determining statistical significance of cooccurrence between a pair of factors, a short survey run of 100 samples with a pvalue cutoff of 0.1 was used to filter out factors that clearly do not cooccur. A further 5000 iterations was then performed and an estimate of the pvalue obtained from that.
Derivation of nonoverlapped site distribution
In this case, we have a total of h = n_{A }+ n_{B }sites, all singletons and we wish to determine the probability of obtaining an k sites which are assigned to both DAFs A and B where 0 ≤ k ≤ min(n_{A}, n_{B}).
Then, the probability of a cooccurrence of size k is
List of abbreviations used
ChIPchip: chromatin immunoprecipitation on chip; ChIP on array; DamID: CI: colocalization index; DNA adenine methyltransferease identification; DAF: DNAassociated factor; PWM: positionweight matrix
Authors' contributions
DH conceived the approach used in this study, wrote the software used in this study and characterised its performance on published experimental data. SR directs the research group. Both authors wrote the paper.
Acknowledgements
DH thanks Jelena Aleksic for bringing the problem of evaluating the statistical significance of binding site cooccurrence to my attention. I am grateful to Boris Adryan and Audrey Fu for providing their review prior to publication. BA also provided useful comments during the drafting of the manuscript. We also thank the referees for helpful comments on how the manuscript could be improved. DH is funded by a Grand Challenges in Global Health grant to a consortium led by Austin Burt.
References

Fu AQ, Adryan B: Scoring overlapping and adjacent signals from genomewide ChIP and DamID assays.
Molecular BioSystems, in press. PubMed Abstract  Publisher Full Text

Haiminen N, Mannila H, Terzi E: Determining significance of pairwise cooccurrence of events in bursty sequences.
BMC Bioinformatics 2008, 9:336. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Pape UJ, Klein H, Vingron M: Statistical detection of cooperative transcription factors with similarity adjustment.
Bioinformatics 2009, 25:21032109. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hannenhalli S, Levy S: Predicting transcription factor synergism.
Nucl Acids Res 2002, 30:42784284. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Levy S, Hannenhalli S, Workman C: Enrichment of regulatory signals in conserved noncoding genomic sequence.
Bioinformatics 2001, 17:871877. PubMed Abstract  Publisher Full Text

Schuettengruber B, Ganapathi M, Leblanc B, Portoso M, Jaschek R, Tolhuis B, van Lohuizen M, Tanay A, Cavalli G: Functional Anatomy of Polycomb and Trithorax chromatin landscapes in Drosophila embryos.
Plos Biology 2009, 7:e1000013. Publisher Full Text

Solomon MJ, Larsen PL, Varshavsky A: Mapping proteinDNA interactions in vivo with formaldehyde  evidence that histone H4 is retained on a highlytranscribed gene.
Cell 1988, 53:937947. PubMed Abstract  Publisher Full Text

Wang L, Brown JL, Cao R, Zhang Y, Kassis JA, Jones RS: Hierarchical recruitment of Polycomb group silencing complexes.
Mol Cell 2004, 14:637646. PubMed Abstract  Publisher Full Text

MohdSarip A, Venturini F, Chalkley GE, Verrijzer CP: Hotspots of transcription factor colocalization in the genome of Drosophila melanogaster.
Proc Natl Acad Sci USA 2006, 103:1202712032. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hsieh JJD, Cheng EH, Korsmeyer SJ: Taspase1: A threonine aspartase required for cleavage of MLL and proper HOX gene expression.
Cell 2003, 115:293303. PubMed Abstract  Publisher Full Text

Hsieh JJD, Ernst P, ErdjumentBromage H, Tempst P, Korsmeyer SJ: Proteolytic cleavage of MLL generates a complex of N and Cterminal fragments that confers protein stability and subnuclear localization.
Mol Cell Biol 2004, 23:186194. Publisher Full Text

Gabriel E, Fagg GE, Bosilca G, Angskun T, Dongarra JJ, Squyres JM, Sahay V, Kambadur P, Barrett B, Lumsdaine A, Castain RH, Daniel DJ, Graham RL, Woodall TS: Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation.
Proceedings, 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary 2004, 97104.

Rmpi home [http://www.stats.uwo.ca/faculty/yu/Rmpi/] webcite

R Development Core Team: [http://www.Rproject.org] webcite
R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2009.
[ISBN 3900051070].