The Bioinformatics Centre, Department of Biology & Biotech Research and Innovation Centre, University of Copenhagen, Ole Maaløes Vej 5, 2200 Copenhagen N, Denmark

Department of Mathematical Sciences, University of Copenhagen, Universitetsparken 5, 2100 Copenhagen Ø, Denmark

Abstract

Background

A central question in molecular biology is how transcriptional regulatory elements (TREs) act in combination. Recent high-throughput data provide us with the location of multiple regulatory regions for multiple regulators, and thus with the possibility of analyzing the multivariate distribution of the occurrences of these TREs along the genome.

Results

We present a model of TRE occurrences known as the Hawkes process. We illustrate the use of this model by analyzing two different publically available data sets. We are able to model, in detail, how the occurrence of one TRE is affected by the occurrences of others, and we can test a range of natural hypotheses about the dependencies among the TRE occurrences. In contrast to earlier efforts, pre-processing steps such as clustering or binning are not needed, and we thus retain information about the dependencies among the TREs that is otherwise lost. For each of the two data sets we provide two results: first, a qualitative description of the dependencies among the occurrences of the TREs, and second, quantitative results on the favored or avoided distances between the different TREs.

Conclusions

The Hawkes process is a novel way of modeling the joint occurrences of multiple TREs along the genome that is capable of providing new insights into dependencies among elements involved in transcriptional regulation. The method is available as an R package from

Background

Uncovering the details of the machinery involved in gene regulation remains an open problem in both experimental and computational biology. Part of this machinery is the collection of factors, along with the cognate transcription regulatory elements (TREs) that they bind to, that are responsible for the transcription of a given gene. This includes transcription factors and their sites, as well as histone modifications and other DNA-associated proteins. How these factors interact is to a large extent unknown. A fundamental problem in gene regulation bioinformatics is the limited information in the DNA binding typically displayed by transcription factors, which leads to many false positives when predicting binding sites in genomic sequences (reviewed in

Until recently, it was only possible to study the organization of binding sites for regulatory elements via computational methods, since experimental determination of single sites was time-consuming. Examples include

Despite the technical and experimental developments, we still lack a suitable multivariate model for the joint occurrences of multiple transcription factor binding sites and other TREs. Computational approaches generally only treat co-occurrence of sites in a pairwise manner. Pairwise analyses of the TREs, like the inter-motif distance analysis in

To describe the phenomenon that TREs do not occur completely independently of each other, we will throughout this paper use the terms

We analyze two data sets; for comparison, we review previous studies based on these data. The first data set is from Chen et al.

The second data set is from the ENCODE project _{ij}

While correlation analyses based on MTLs or genomic bins might provide insights into occurrences of sites, we have some concerns with these approaches. Most importantly, the choice of the clustering distance in the case of MTLs and the genomic bin size limit the analyses to dependencies that are compatible with the chosen scale. In particular, all specific details of dependencies within the locus or genomic bin are lost. In addition, the binning is initiated at an arbitrary starting point and the choice of starting point could affect the results obtained; in other words, the placement of the bins might affect the final result. Consequently, the correlation analyses may not provide a complete picture of how correlated TREs affect one another's occurrences. Moreover, the analyses in

Our main result is to show that multivariate point process models - and in particular the Hawkes process - are suitable for analyzing TRE occurrences. Moreover, we provide detailed results of separate analyses of the distribution of eleven TREs from Chen et al

Results

The objective is to investigate whether the occurrence of one TRE directly affects the occurrences of other TREs. The correct scale for studying the organization of TREs on the genome seems to be a scale where most regulators show point-like interactions with the genome at binding sites that each cover only a few nucleotides, since this corresponds to actual binding site sizes. At this stage, it is helpful to review the ChIP-seq and ChIP-chip techniques. ChIP-seq/chip are based on a protocol that first fixes DNA-bound proteins to DNA by cross-linking, followed by shearing of the DNA. Antibodies are then added to isolate the DNA bound by a protein of interest (see

Figure

ChIP-chip to point process

**ChIP-chip to point process**. Illustration of the way in which data from the ChIP-chip experiment can be viewed as a point process. In each cell, the different TREs are positioned along the double stranded DNA-sequence (top). The abundance of binding sites across cells at a particular position of the sequence results in a signal generated from the ChIP-chip experiment (middle). The midpoint of the interval where the signal is above a specified cut off is used as a proxy for the actual binding site. The midpoints for each of the TREs considered are viewed as points from a multivariate point process along the line (bottom).

Multivariate analysis of TREs for mouse embryonic stem cell data

For the application of our model to the ChIP-seq data from Chen et al

denote the set of these TREs. Inspired by _{m, k }
_{m, k}
_{m, k }

In the multivariate analysis of the 11 TREs we initially allow for all 121 potential interactions among the TREs. As described in the Methods section, each of the _{m, k }

Estimates of the 121

Estimated ** g**-functions, forward direction, mouse embryonic stem cell data

**Estimated **. Plots of the

It is important to point out that the implementation of the Hawkes process treats the genome simply as a line along which events (the occurrences of TREs) happen. This means that the descriptors "downstream" and "upstream" are dependent only on the direction we assign to the genome and not on the actual direction of genes. The estimated _{m, k}

The overall impression from Figure

The largest positive effects are found among the four TREs in the upper left corner, E2f1, Zfx, c-myc and n-myc and among the three TREs Nanog, Sox2 and Oct4. This indicates that the factors in the two groups often bind in proximity to each other. In addition, the three TREs Stat3, p300 and Smad1 seem to be more related to the group consisting of Nanog, Sox2 and Oct4 than to the other group. This is consistent with the analyses by Chen et al

Based on the log _{m, k}

The functions on the diagonal (from upper left corner to bottom right corner) in Figure

Clustering of TREs based on interaction graphs, mouse embryonic stem cell data

**Clustering of TREs based on interaction graphs, mouse embryonic stem cell data**. Result of a hierarchical clustering procedure based on the Ward method of the graphs for each TRE given in Figure 2. The clustering is based on the integral of the absolute value of the logarithm of the functions in Figure 2.

In most cases, transcription factors have no strand preference relative to their regulated gene when it comes to binding, and regardless, strand information is lost in the ChIP experiment and the experiment will not explicitly tell us which gene is the regulatory target of a given site. Consequently, we fit the model to a mixed signal. If two TREs typically occur in a specific order when involved in the regulation of a gene, then the order is reversed from the forward direction point-of-view if the TREs are involved in the regulation of a gene in the reverse direction. We argue that if there is an equal distribution of TREs involved in regulation in the forward and reverse directions, the mixed signal should be approximately symmetric, which would then imply that the shapes of _{m, k }
_{k, m }
_{E2f1, Zfx }and _{Zfx, E2f1}) but we also see some deviations from this (e.g. _{Nanog, Smad1 }and _{Smad1, Nanog}).

To further investigate the estimated effects for each combination of

This is the hypothesis of local independence of the _{0}(_{0}(_{0 }is rejected shown as red squares. As in Figure

Tests for local independence, mouse embryonic stem cell data

**Tests for local independence, mouse embryonic stem cell data**. This figure shows results for the 121 parallel likelihood ratio tests for local independence between all pairs of the 11 TREs in the multivariate model. We show the results for the model estimated in the forward direction (squares, effect of TRE (column) on TRE (row)). The size of the symbol for each test corresponds to the magnitude of the test statistic. Correcting for multiple testing using Holm's procedure the hypotheses of local independence that are rejected are shown in red while the hypotheses that are not rejected are shown in blue.

We found a significant effect of the occurrence of Suz12 on downstream occurrences of Oct4 but not the reverse. We argue that this asymmetry is a consequence of the inclusion of self-dependence terms in the model, combined with Suz12's strong self-dependence, as can be seen in Figure

Multivariate analysis of TREs for the pilot ENCODE regions

To further investigate the applicability of our model we analyze a subset of the ENCODE pilot data produced by Affymetrix: the "Affymetrix Sites" track from the UCSC ENCODE browser database resulting from a study of retinoic acid-treated HL-60 cells 0, 2, 8 and 32 hours after treatment. Initially, we focus on the 8-hour post-treatment results from the data and investigate the effects of the oriented specification of the model and the inclusion of histone modifications in the model. Subsequently, we compare the results obtained at the four different time points. We focused on 10 TREs, selecting classical transcription factors, the transcription machinery and chromatin boundary elements. Because some regulatory elements, such as histone modifications, can not always be regarded as point-like, we include the two histone modifications, H3K27me3 and H4Kac4, as covariates in the modeling of the remaining eight TREs. We let

denote the set of these TREs. Aside from the inclusion of the histone modifications, an important feature, the model is the same as in the previous section. The intensity of the occurrence of a TRE at a given location depends on upstream occurrences of other TREs and on whether the histone modifications are present at the same location.

The set of TREs available from the ENCODE data is quite diverse and potential interactions among them are to a large extent previously undescribed. In the multivariate analysis of the 8 TREs, we initially allow for all 64 potential dependencies among the TREs. Again, as described in the Methods section, each of the

Estimates of the 64

Estimated ** g**-functions, forward direction, ENCODE data

**Estimated **. Plots of the

Investigation of the oriented specification of the model

To investigate whether the estimated effects are statistically significant, for each combination of _{0}(_{m, k }

Tests for local independence, ENCODE data

**Tests for local independence, ENCODE data**. This figure shows results for the 64 parallel likelihood ratio tests for local independence between all pairs of the 8 TREs in the multivariate model adjusting for histone modifications and different baseline intensities. We show the results for the model estimated in the forward direction (squares, effect of TRE (column) on TRE (row)) as well as in the reverse direction (circles, effect of TRE (row) on TRE (column)). The size of the symbol for each test corresponds to the magnitude of the test statistic. Correcting for multiple testing using Holm's procedure the hypotheses of local independence that are rejected are shown in red while the hypotheses that are not rejected are shown in blue.

Keeping in mind that we fit the model to a mixed signal, we would expect to reject the hypothesis _{m, k }
_{k, m }

Regardless of a mixed signal, if we fit the model using the reverse direction of the genome, we would expect the estimate of _{k, m }
_{m, k }
_{k, m }
_{m, k }

Estimated ** g**-functions, reverse direction, ENCODE data

**Estimated **. Plots of the-functions modeling the effect of the occurrence of one TRE (row) on the occurrence of another TRE (column), estimated in the reverse direction. Note that the figure is transposed compared to Figure 5. The effects are estimated in the multivariate model adjusting for the histone modifications and allowing for different baseline intensities for the ENCODE regions. A value less than one indicates that this inter-TRE distance tends not to occur while a value greater than one indicates an inter-TRE distance that is likely. Point-wise 95% confidence intervals for the functions are also shown.

We would also hope that the conclusions reached would be qualitatively symmetric - that is to say, that we reject _{0}(_{0}(_{0 }is rejected shown as red circles. As in Figure

An explanation for seeing a direct relation of TRE

In conclusion, certain findings are consistent when estimating in either direction. The self-dependencies are all significant, and RARA seems to have direct relations to all or most of the other TREs both up- and downstream. RARA is known to function as an active repressor by recruiting corepressors and/or deacetylases when its ligand is not present, and an activator when the ligand is present

Effect of histone modifications on occurrences of TREs in the pilot ENCODE regions

As mentioned above, the two histone modifications are included as covariates in our model. The effects of the histone modifications on the intensity for the occurrence of TRE

Figure

Effect of histone modifications, ENCODE data

**Effect of histone modifications, ENCODE data**. Estimates and 95% confidence intervals for the parameters

The effect of H3K27me3 on the occurrence of PU1 and POL2 is, on the other hand, negligible, which might have to do with its effect as a repressor

Results for all four time points for the pilot ENCODE regions

Since data are available for 0, 2 and 32 hours post-treatment, in addition to the 8 hours analyzed initially, we can investigate whether our findings are consistent over time. Hence we fit our multivariate model to the data at these time points in the forward direction and test the hypotheses _{m, k }

Tests for local independence, all four time points, ENCODE data

**Tests for local independence, all four time points, ENCODE data**. This figure shows the results for the 64 parallel tests for local independence between all pairs of the 8 TREs in the multivariate model, adjusting for all covariates, at the four time points (0, 2, 8, 32 hours post-treatment). The models are estimated in the forward direction with the effect of TRE (column) on TRE (row). The size of the symbol for each test corresponds to the magnitude of the test statistic. Correcting for multiple testing using Holm's procedure the hypotheses of local independence that are rejected are shown in red while the hypotheses that are not rejected are shown in blue.

Discussion

The analyses presented here for the Pilot ENCODE data are based on a single set of ChIP-chip data from the Pilot ENCODE project, which only covers 1% of the genome. The interactions between the factors in this set have not been verified experimentally to date. This means that the findings from our analysis should be interpreted with some caution, in particular when extrapolating them to the whole genome. As with all computational methods, experiments are needed to verify that specific interactions are significant; the role of computational analyses is to give good starting points for experimental studies. However, our analysis of the genome-wide ChIP-seq data from mouse embryonic stem cells shows that our method is able to identify interactions that can be verified experimentally. Moreover, the estimated

Ultimately, the goal is to understand the causal relations among the many components involved in the regulation of gene expression. Our analysis provide a step in that direction but, as always with statistical analyses of observational data, we can not prove that an observed direct relation is causal - and even if it is, the analysis can not show the direction of the causality. To draw such conclusions, we need either experimental data or stronger causal assumptions

A major contribution of our multivariate analysis consists of the collection of local independencies among the TREs that we identify and which would not have been revealed using pairwise methods. Our analysis enables us to say which TREs that do not seem to interact in the regulation of genes, allowing subsequent experimental studies to be focused on other combinations of TREs. To illustrate this point, we observe in Figure

The Hawkes process is not the only suitable model for these data. The main reason for focusing on the Hawkes process is that it is a flexible class of models, and the specification of the model as given in the Methods section allows us to compute the likelihood function directly, such that we can easily apply standard methods (maximum-likelihood estimation and likelihood-ratio tests). It might be argued that a drawback of the model is that we can only include information about upstream events in the conditional specification of the Hawkes process. However, the specification is a purely technical matter that does not by any means rule out the possibility of the Hawkes process capturing relations in both the up- and downstream directions, as seen from the comparison of results from the analysis in the forward direction with those from the analysis in the reverse direction. If we were to use spatial point process models, it would be possible to specify models that have no directionality in their specification. However, we have showed that most of the conclusions we obtain from the analysis using the Hawkes process are robust to the direction we use for estimation. A range of point process models, including the Hawkes process are used to model financial trading data

It can be argued that the choice of knots, and hence the number and range of the spline basis functions used in the specification of the Hawkes process, could affect the results obtained. We used a relatively small number of knots and a relatively narrow range, but although details might change had we used more sophisticated knot positioning strategies, we found that our results, the qualitative conclusions in particular, were robust to the actual choice of knots.

We illustrated the use of the Hawkes process with analyses of ChIP-chip/seq data for TREs but the model can be equally useful for other types of multivariate, positional genome data, whether these data are experimental or computational. Examples of such data are transcription factor binding sites, small RNAs, or even genetic polymorphisms in different individuals. Compared with the use of alternative methods such as in

Conclusions

We have presented a statistical method to analyze the multivariate distribution of TREs along the DNA sequence. We have shown that by using the point process approach, we can perform a detailed analysis of the multivariate distribution of TREs, providing both insightful qualitative information about local independence among the TREs and quantitative information on how the TREs affect the occurrence of one another. Furthermore, we have shown that our method is able to detect experimentally verified interactions, as well as interactions missed by other computational methods. We find that to understand the interactions among many TREs, it is crucial to carry out the analyses in a multivariate framework that includes all available information and relevant covariates; such an analysis emphasizes direct relations rather than indirect relations among the TREs investigated.

Methods

Mouse embryonic stem cell data

The analysis of the core transcriptional network in mice embryonic stem cells presented in

**Mouse embryonic stem cell data - Part I**. Coordinates of loci bound by Nanog, Oct4, Sox2, E2f1, Smad1, Zfx, c-myc, n-myc and Stat3.

Click here for file

**Mouse embryonic stem cell data - Part II**. Coordinates of loci bound by p300 and Suz12.

Click here for file

ENCODE data set

In this analysis, we consider ChIP-chip data produced by Affymetrix for the ENCODE pilot project as given the supplementary material, Additional file

**ENCODE pilot data - hr00**. Affymetrix ChIP-chip sites for the ENCODE pilot project to time hr00 with chromosome number, start position and end position for the enriched regions.

Click here for file

**ENCODE pilot data - hr02**. Affymetrix ChIP-chip sites for the ENCODE pilot project to time hr02 with chromosome number, start position and end position for the enriched regions.

Click here for file

**ENCODE pilot data - hr08**. Affymetrix ChIP-chip sites for the ENCODE pilot project to time hr08 with chromosome number, start position and end position for the enriched regions.

Click here for file

**ENCODE pilot data - hr32**. Affymetrix ChIP-chip sites for the ENCODE pilot project to time hr32 with chromosome number, start position and end position for the enriched regions.

Click here for file

**ENCODE pilot data - The 44 pilot regions**. The locations and names of the 44 ENCODE pilot regions.

Click here for file

**Illustration of the occurrences of TREs in the ENCODE pilot regions**. Illustration of the pilot ENCODE regions with the occurrences of the 10 TREs marked as point processes.

Click here for file

The ChIP-chip regions, which in this study have a mean length of approximately 400 base pairs, are regions of DNA enriched with a regulatory element. To model the binding site locations from the ChIP-chip experiments as a point process, we choose the midpoints of the ChIP-chip signals as the binding sites (see Figure

The two histone modifications enter as covariates in the model; in this case, we choose to use the whole enriched sequence by including them as indicator functions.

Point processes

A point process is a model for points or events that occur randomly in time and/or space. Here, we consider point processes for points occurring along the DNA sequence, i.e., points that can be represented on a one-dimensional line. We assume that no more than one point occurs at the same location, yielding a simple point process. Points of interest along the DNA sequence will typically be the locations of TREs, but the points could represent the positions of any feature e.g. transcription start sites. We use simple point processes on ℝ_{+ }consisting of a sequence of points, (_{
i∈ℕ}, where 0 ≤

Since there is a one-to-one correspondence between the point process and the corresponding counting process, the point process will simply be denoted by

The best-known point process is the Poisson process, for which points occur completely at random. For a homogeneous Poisson process, the points occur with a constant intensity (rate),

A point process on the line can be uniquely specified by defining the intensity process,

A marked point process is a simple point process with marks in a set _{+ }and _{k}

The history of a marked point process contains information about both the location of points and the type of mark at each point.

The likelihood function

When the marked point process is specified by a family of parameterized intensities,

(see

Given _{i }
_{
i = 1,..., M
}, the log-likelihood function is the sum of the individual log-likelihood functions above:

where

In our analysis, the

Interchanging the first two sums in the formula for the log-likelihood function yields a sum of

Multivariate nonlinear Hawkes process

Our setup consists of multiple observations of point processes within bounded intervals, [_{i}
_{i}

The _{i}
^{(i)k
}is a parameter vector. Included in _{i }
^{(i)k
}is the logarithm of the baseline intensity for sequence ^{(i)k
}do not vary with

The

then if a point of type _{m, k}

we observe that

In principle, the

where the ^{mk}
_{l}

The value of the largest knot gives the maximum range within which we will be able to detect dependencies with the method and hence must be chosen carefully. The number of knots determines how detailed the description of the dependencies can be. Choosing too many knots will cause the model to be over-fitted. To select the placement and number of knots, we conducted 2 pilot studies. One was based on an analysis of the occurrences of three TREs on one chromosome of the mouse genome and the other was based on an analysis of the occurrences of three TREs from the ENCODE data in the pilot ENCODE regions. These pilot studies suggested that 8 equidistant knots in the range -400 to 1000 base pairs was computationally feasible while still sufficiently flexible for the current analysis.

We have established a fully parameterized specification of our model and, given a realization of a point pattern, we estimate the parameter values by using maximum likelihood methods. This is implemented in the R package ppstat, given in Additional file

**Information on installation of the R package ppstat**. A PDF file of the web page for the R package ppstat (as of 12 August 2010) including information on installation.

Click here for file

**Source code for the R package ppstat**. Source code for the R package ppstat.

Click here for file

**Note on the computations of the log-likelihood function**. Note on the computations of the log-likelihood function and its first and second derivatives.

Click here for file

With a probability tending to one, the likelihood function has one local maximum and the maximum likelihood estimates are normally distributed with mean equal to the true mean and a covariance matrix that can be estimated as the inverse of the matrix of second-order partial derivatives of the negative log-likelihood function, see

The properties of the likelihood function enable us to construct pointwise confidence intervals for the

where _{0.975 }is the 97.5% quantile for the normal distribution. These confidence intervals are transformed using the exponential function to yield confidence intervals for the

The likelihood ratio test statistic for _{0 }: _{m, k }
_{1 }- _{0}), where _{0 }
_{m, k }
_{1 }is the value of the maximized log-likelihood function for the full model. In this case, the null distribution for the test statistics can be approximated by the ^{2 }distribution with 4 degrees of freedom.

Local independence

In the context of multivariate point processes there is a concept of local independence between the _{B }
_{A }
_{B }
_{C }
_{B }
_{0 }: _{m, k }
_{k }
_{m }
_{m, k }
_{k }
_{m}
_{k }
_{m }
_{k }
_{m }
_{k }
_{m}
_{m, k }

Clustering

To find groups of TREs that are likely to act together in the regulation of genes, we propose a simple cluster analysis based on the results from the multivariate analysis. We consider a hierarchical cluster analysis based on the

as a measure of similarity. Euclidean distance is used to create the distance matrix and a hierarchical clustering procedure is applied based on the Ward method

Computational considerations

The central computations in the current implementation involve a large, sparse model matrix. The number of columns in the matrix is of the order _{0}
_{0}), where _{0 }is the number of spline basis functions and _{0 }is the number of TREs analyzed. The number of rows is of the order

List of abbreviations

PCA: principal component analysis; TRE: transcriptional regulatory element; BRG1: SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily a, member 4; CEBPE: CCAAT/enhancer binding protein (C/EBP), epsilon; CTCF: CCCTC-binding factor (zinc finger protein); c-myc: myelocytomatosis oncogene; E2f1: E2F transcription factor 1; H3K27me3 (H3K27T): Histone H3 tri-methylated lysine 27; H4Kac4 (HisH4): Histone H4 tetra-acetylated lysine; Nanog: Nanog homeobox; n-myc: v-myc myelocytomatosis viral related oncogene, neuroblastoma derived; Oct4: POU domain, class 5, transcription factor 1; p300: E1A binding protein p300; POL2: polymerase (RNA) II (DNA directed) polypeptide A, 220 kDa; PU1: Spleen focus forming virus proviral integration oncogene; RARA (RARecA): Retinoic Acid Receptor-Alpha; SIRT1; sirtuin (silent mating type information regulation 2 homolog) 1; Smad1: MAD homolog 1; Zfx: zinc finger protein X-linked; Sox2: SRY-box containing gene 2; Stat3: signal transducer and activator of transcription 3; Suz12: suppressor of zeste 12 homolog.

Authors' contributions

LC developed, with assistance from NRH and OW, the multivariate Hawkes model as used in the paper. LC and NRH implemented the model. LC carried out the data analysis and wrote a first draft of the paper. AS assisted with obtaining, selecting and interpreting the ENCODE data and with the biological implications of the data analysis. All authors collaborated on the interpretation of the data analysis. LC and NRH wrote, with assistance from AS, the final version of the paper. All authors read and approved the final manuscript.

Acknowledgements

LC and NRH were supported by the Danish Natural Science Research Council, grant 272-06-0442 and 09-072331. LC, AS and OW were supported by a grant from the Novo Nordisk Foundation to the Bioinformatics Centre. The European Research Council has provided financial support to AS under the EU 7th Framework Programme (FP7/2007-2013)/ERC grant agreement 204135.