Abstract
Background
The task of spatial cluster detection involves finding spatial regions where some property deviates from the norm or the expected value. In a probabilistic setting this task can be expressed as finding a region where some event is significantly more likely than usual. Spatial cluster detection is of interest in fields such as biosurveillance, mining of astronomical data, military surveillance, and analysis of fMRI images. In almost all such applications we are interested both in the question of whether a cluster exists in the data, and if it exists, we are interested in finding the most accurate characterization of the cluster.
Methods
We present a general dynamic programming algorithm for gridbased spatial cluster detection. The algorithm can be used for both Bayesian maximum aposteriori (MAP) estimation of the most likely spatial distribution of clusters and Bayesian model averaging over a large space of spatial cluster distributions to compute the posterior probability of an unusual spatial clustering. The algorithm is explained and evaluated in the context of a biosurveillance application, specifically the detection and identification of Influenza outbreaks based on emergency department visits. A relatively simple underlying model is constructed for the purpose of evaluating the algorithm, and the algorithm is evaluated using the model and semisynthetic test data.
Results
When compared to baseline methods, tests indicate that the new algorithm can improve MAP estimates under certain conditions: the greedy algorithm we compared our method to was found to be more sensitive to smaller outbreaks, while as the size of the outbreaks increases, in terms of area affected and proportion of individuals affected, our method overtakes the greedy algorithm in spatial precision and recall. The new algorithm performs onpar with baseline methods in the task of Bayesian model averaging.
Conclusions
We conclude that the dynamic programming algorithm performs onpar with other available methods for spatial cluster detection and point to its low computational cost and extendability as advantages in favor of further research and use of the algorithm.
Background
The task of spatial cluster detection involves finding spatial regions where some property deviates from the norm or the expected value. In a probabilistic setting this task can be expressed as finding a region where some event is significantly more likely than usual. Spatial cluster detection is of interest in fields such as biosurveillance, mining of astronomical data, military surveillance [1], and analysis of fMRI images [2]. A wellknown early method for spatial cluster detection is the spatial scan statistic developed by Kulldorff [3]. This approach is formulated as a frequentist method for cluster detection and can be summarized as the process of maximizing the likelihood ratio statistic
for each subregion of interest S and computing the statistical significance of the statistic using randomization testing. Here H_{0 }represents the null hypothesis that no clusters (e.g., disease outbreak regions) are present, and H_{1}(S) represents the alternative hypothesis that subregion S is the location of a cluster. While the original paper [3] is restricted to scanning for a single circular cluster, the method has been extended to ovals and other shapes as well as to spacetime statistics in later works [410].
One of the main drawbacks of Kulldroff's approach is the computational expense associated with the randomization testing required to calculate statistical significance. Neill and Moore [11] address this problem by developing a method of cleverly reducing the number of different possible subregions considered. While the cost of randomization testing of each hypothesis stays the same, reducing the number of hypotheses to consider reduces the overall computational cost significantly. Their method makes use of an overlapkd tree data structure to apply bounds on F(S) to prune the search. This method was later extended to multiple dimensions in [12].
Another approach taken in the literature is the development of Bayesian variants on Kulldroff's statistic [13,14]. The Bayesian approach replaces the likelihood ratio with a posterior probability
The use of a Bayesian approach eliminates the need for significance testing since the purpose of significance testing is to measure how likely the alternative hypothesis is given the data (more precisely, it tells us how likely we are to obtain a likelihood ratio that is at least as extreme under the null hypothesis). Since the posterior probability gives us the probability of the alternative hypothesis given the data directly, the need for timeconsuming randomization testing is eliminated. An additional benefit of the Bayesian approach is that the introduction of prior probabilities enables these methods to incorporate prior information to potentially enhance detection [13,15]. In some cases, however, the task of selecting useful priors can prove challenging.
The methods mentioned above so far are "countbased" methods which detect clusters based on aggregate counts of data such as the total number of ED visits in a given ZIP code on a given day, for example. An alternative approach is "agentbased," where each individual in the population is modeled. This is the approach taken by Cooper et al. [16] in developing a disease surveillance system called PANDA. The work of Jiang et al. [17] extends PANDA by incorporating spatial information, resulting in an agentbased Bayesian scan statistic. The disease model used in our work is also agentbased and can be viewed as a simplified version of the model used by PANDA. It should be noted that unlike the disease model that we use to illustrate this algorithm, PANDA and its extensions model multiple diseases and are designed to differentiate between outbreaks of the different diseases modeled.
A common way of articulating the spatial scan problem is one that considers the surveillance region to be a rectangular R × C grid of cells. The methods that use this approach, among which are [7,1113,17,18], are sometimes referred to as gridbased methods. With this representation every subset of cells is a possible cluster S, or in the context of outbreak detection, a potential outbreak region.
Most gridbased methods have mainly focused on the detection of single rectangular clusters aligned with the grid.
However, since an outbreak may take other shapes, or may occur in multiple disconnected regions of the surveillance grid, it is of interest to develop algorithms that are able to detect more general subregions. Jiang and Cooper [18] developed a recursive algorithm that operates by detecting the most likely rectangular subregion that contains a cluster and refines it through a combination of unions with additional rectangular subregions as well as refined scanning within each rectangle. While this approach improves the detection of nonrectangular subregions it comes at the cost of repeatedly scanning the same region multiple times as the recursion occurs.
Complementary to these approaches are methods for calculating the posterior probability of the presence of clusters anywhere in the region using Bayesian model averaging, such as the work of Shen et al. [19]. They present a method that is sensitive to the presence of irregularly shaped or sparsely distributed outbreaks but does not favor spatially grouped clusters since individual cells are considered independently.
The task of detecting irregularly shaped clusters has been addressed by distancebased methods which when, given a set of points in (possibly manydimensional) space find clusters of elevated or uniform density [20,21]. While these methods may be considered more suitable than gridbased methods for certain domains, they are not addressed in the current paper, which aims to build on previous gridbased methods.
We implement, develop, and test an algorithm for finding multiple rectangular clusters on a grid that employs dynamic programming in order to consider an exponential number of hypotheses in polynomial time. The algorithm finds a hypothesis H_{1}(S_{i}) which is most probable in the Bayesian sense given the data.
We also present an adaptation of the algorithm that averages over all considered hypotheses H_{1}(S_{i}) to obtain a posterior probability for the presence of a cluster anywhere in the surveillance region.
Methods
Below, we describe and investigate an algorithm for the detection of multiple rectangular spatial clusters. The algorithm is applied to the simple outbreak model described next. While we restrict the analysis to this particular model, the algorithm is general and can be applied to any underlying model that can provide a likelihood function such as the function lik described in Equation (8) below. We then describe the scanning algorithm itself.
Outbreak model
We define the disease model as a Bayesian Network (BN) model, which is a type of graphical model for representing joint probability distributions; it is particularly suited for modeling conditional independence between variables. In the graphical representation each node represents a random variable and the distribution of each variable is defined as a conditional distribution that is conditioned on the variables from which it receives incoming edges, or as a prior probability distribution if the node has no incoming edges [22]. An extension to the BN representation employs plates when subgraphs of the network are replicated many times. In the graphical plate representation, plates are shown as boxes, the nodes and edges that are on each plate are replicated the number of times shown on the plate, replicated nodes are indexed, and edges that cross plate boundaries are also replicated [23]. In particular, in Figure 1 we have m copies of nodes L_{i}, D_{i}, and I_{i}. We further extend the notation by allowing the number of replications to depend on random variables, as represented by the dotted arrow from SUB to the plate indexed by j. Here, the values that SUB takes are ordered sets and the number of replications n is defined to equal the size of the set SUB. The model is explained in further detail below, where we first describe the model and the meanings of the variables in general terms, and then provide the particular parameterization that we used to instantiate the variables in our tests.
Figure 1. Simple disease model. A plate representation of the simplified entitybased model used.
Model definition and likelihood function
We employ an agentbased model where each cell of the R × C grid contains a number of individuals. The entire population, made up of m individuals, is modeled.
Figure 1 shows a BN plate representation of the model that governs the state of each individual: SUB = (rect_{1}, ..., rect_{n}) represents the (possibly disconnected) hypothesized location of an ongoing outbreak as an ordered collection of n nonintersecting rectangles, OB_{j }are binary variables each of which determines whether an outbreak is present in a given rectangle indexed by j, and F_{j }∈ [0,1] represent frequencies of outbreak disease cases per day, each corresponding to an outbreak rectangle indexed by j. We will use the vector notation shorthand F = (F_{1}, ..., F_{n}) where convenient. K ∈ [0,1] represents the proportion of the population that visits the ED per day in the absence of an outbreak, D_{i }represents the disease state of a given individual, I_{i }represents the observable evidence about the individual, L_{i }∈ 1, ..., R} × {1, ..., C} represents the grid cell in which the individual is located, and Φ_{i }is the manifested outbreak frequency, which is the frequency of the outbreak disease per day in the individual's grid cell. The dotted arrow pointing from SUB to the plate containing OB_{j }and F_{j }is meant to make explicit that the number of indexes j, and hence copies of these nodes, is dependent on the size of SUB.
The random variable Φ_{i}, which represents the outbreak frequency manifested in individual i's grid cell, is not used to capture uncertainty as random variables normally do, but rather plays the role of a logic switch. It is defined as follows:
Note that this is a welldefined probability distribution only under the constraint that the members of SUB do not intersect. We could alternatively merge this logic into the definition of D_{i}, however, this decomposition of the model yields a clearer presentation.
The presence of OB_{j }as separate from F_{j }serves a similar logical purpose. The distribution of OB_{j }will depend on the hypothesis space and is left for the model instantiation section. The distribution of F_{j }is defined below in terms of OB_{j }and a "template" frequency distribution F that is also a detail of model instantiation.
Where (F_{j}OB_{j }= true) ~ F indicates that F_{j }is i.i.d. according to the distribution of F when OB_{j }= true.
The primary purpose of the model is to enable us to calculate the likelihood of the presence of an outbreak in a given subregion, where a subregion is taken to be any set of cells. Throughout this paper, we will often also refer to rectangular subregions, or simply rectangles, to make explicit instances where we require the sets of cells to form multicell rectangles on the grid. We will also introduce below the concept of tilings and tiles. In the context of this paper, tiles are always rectangular subregions.
The data likelihood of the presence or absence of an outbreak in a subregion is given by the likelihood of the data about individuals located within the subregion's limits supporting a uniform outbreak of some frequency f from the frequency distribution F or the absence of an outbreak, respectively. In order to arrive at the likelihood of an outbreak, consider the case of calculating the likelihood of a particular observation I_{i }given a manifested outbreak frequency Φ_{i }and a particular K:
Note that the likelihood for the individual given the absence of an outbreak can be obtained from the expression above by letting Φ_{i }= 0. To obtain the likelihood of an outbreak state (not just a particular outbreak frequency) in a subregion of the grid S, we obtain a likelihood for each frequency by taking the product over all perperson likelihoods and take an expectation over those likelihoods to get the expected likelihood of the desired outbreak state. Formally, we will use the variable x to represent the outbreak state. Let x = 0 indicate that a subregion S contains no outbreak, x = 1 indicate that S is an outbreak subregion, and let the notation E_{K,F }represent the expectation over the random variables K and F. Then the likelihood for any set of grid cells S under an outbreak state x is given by:
Here we are simplifying some of the calculation by capturing the logic behind determining the manifested frequency Φ_{i}. Particularly, we are using the fact that the manifested outbreak frequency in a nonoutbreak cell is 0, which is the end result of multiplying x by F when x = 0 to indicate the absence of an outbreak.
Model instantiation
Three disease states are modeled: noED, flu, and other, which respectively represent the events that an individual did not come in to the ED, came in to the ED due to influenza, or came to the ED for some other reason. The probability that an individual i comes in to the ED due to influenza is equal to the manifested outbreak frequency Φ_{i}. Those individuals that do not come in to the ED due to influenza have a probability K of coming in to the ED for some other reason (e.g. due to acute appendicitis). Under this model the distribution of D_{i }is defined by:
In the initial stages of the investigation K was treated as a constant value, i.e., a distribution with probability mass 1 at K = 3.904 × 10^{4 }which is an estimate that comes from data obtained for ED visits in Allegheny County, Pennsylvania in May of the years 20032005. We also consider a model where K is a discrete distribution with positive mass at multiple values estimated from that data. However, due to practical limitations mentioned in our discussion of additional tests below, most tests were performed with the model that uses a constant K.
To estimate the distribution F that ultimately governs the distributions of outbreak frequencies in the model, we base it on a distribution previously used for a similar, more complex disease model in [17]. We adjusted the distribution to reflect a uniform density over the interval (0, 6.50 × 10^{4}].
Four evidence states are modeled for the evidence variable I_{i}, in the form of chief complaints, taking the values cough, fever, other, and missing. I_{i }is governed by the following distribution:
A value of "missing" indicates that individual i did not come in to the ED. The values that I_{i }takes for individuals that did come in to the ED (either due to influenza or other reasons) are governed by a probability distribution that is based on expert assessments that are informed by the medical literature.
The proper choice of prior distributions is closely tied to the proper choice of the prior probability of an influenza outbreak being present anywhere in the region. For that purpose, we take the value P(outbreak) = 0.04 from previous work [17,24]. This value will appear at multiple points throughout this paper.
The distributions of the subregion, outbreak, and frequency variables
The distributions of SUB, OB_{j}, and F_{j }are inherently tied together, as made explicit in the BN structure, and since the space of SUB is the space of considered outbreak subregions, it is different for different algorithms that consider different hypothesis spaces. This section describes the distributions of SUB and F_{j }for the two baseline methods we use in our evaluation and for the dynamic programming algorithm.
The singlerectangle case
The simplest case is the case where only singlerectangle hypotheses are considered. Under that model, SUB can represent any of the R(R + 1)C(C + 1)/4 rectangles that can be placed on the R × C grid, each representing a hypothesis of an outbreak within the rectangle and no outbreak outside the rectangle. SUB can take an additional value to represent a nonoutbreak hypothesis. Since we defined SUB to be an ordered set of rectangles, formally, each singlerectangle hypothesis is represented as a singleelement set, and the nonoutbreak hypothesis is represented by the empty set ∅. The associated prior distribution of SUB is:
Note that when SUB is the empty set, OB_{j }and F_{j }do not need to appear in the BN as they play no role in the distribution of Φ_{i }and the rest of the BN. In the case when SUB represents a rectangle, it is a singleelement set, and there is only one j, namely j = 1. OB_{1 }is then taken to be true and consequently F_{1 }~ F.
The multirectangle case
The case that corresponds to the greedy algorithm to which we will be comparing our algorithm is the case where each element sub of the hypothesis space is an ordered set of nonoverlapping rectangles. Call this hypothesis space , that is, is a set of values that the random variable SUB can take, where each value sub is an ordered set of rectangles. Here the prior distribution of SUB is governed by a structure prior parameter q as follows:
The parameter α can be used to adjust the prior probability of the absence of an outbreak, while the parameter q controls the relative prior probability of having a more complex outbreak hypothesis (that is, a hypothesis with more hypothesized rectangles) vs. a less complex outbreak hypothesis. Z_{M }is a normalization constant described by:
Where \ is the set difference operator and sub denotes the size of the ordered set sub (a particular value that SUB can take) in terms of the number of rectangles in the set.
Since in our experiments this model is used only in the context of selecting the most likely outbreak hypothesis, there is no need to specify α or calculate Z_{M}. The value q = P(outbreak) × 4/(R(R + 1)C(C + 1)) was used for the structure prior parameter.
Since in this model every rectangle that appears in SUB is an outbreak rectangle, we again have OB_{j }= true for all j and consequently F_{j }~ F are i.i.d., one per rectangle.
The tiling case
The hypothesis space used by the dynamic programming algorithm, the centerpiece of this work, is a space of tilings (discussed in more detail in the next section.) To express this hypothesis space, each value of SUB is a tiling: a collection of nonoverlapping rectangles such that every cell of the grid is covered by exactly one of these rectangles. In order to represent an outbreak hypothesis as a tiling we use the variable OB_{j }to indicate whether a tile is hypothesized to represent an outbreak or an outbreakfree region. In the tiling context the distribution over the variables SUB and OB_{j }is defined as follows:
Here Z_{T }is a normalization constant and p is a structure prior representing the conditional probability that the rectangle rect_{j }∈ sub (where sub is a value in the range of the r.v. SUB) is an outbreak rectangle given that it is indeed a rectangle of the hypothesis. The prior distribution of SUB is taken to be uniform since we assume that we have no information that leads us to favor one tiling over another. The calculation of Z_{T }is left for a later section that also discusses the relationship between p and the prior probability of an outbreak, as well as the process which was used to select the value of p that corresponds to the prior probability of a flu outbreak equaling P(outbreak) = 0.04.
In the text that follows, we will refer to the choice of a set of tiles sub and the associated assignment of the variables OB_{1}, ..., OB_{n }as a colored tiling, or when it is clear from the context we may refer to colored tilings simply as tilings. We also discuss the particular space of colored tilings that the dynamic programming algorithm considers in more depth, since this space is a subset of the space of all possible tilings. The possibility of using a structure prior P(OB_{j }= trueSUB) that is not identical for every rectangle in every tiling SUB is also discussed.
Tilings
The algorithm presented in this paper searches a space of hypotheses where each hypothesis is a possible tiling of the surveillance grid by nonoutbreak and outbreak rectangles. For example, Figure 2 shows the six possible tilings of a 1 × 2 grid.
Figure 2. 1 by 2 colored tilings The 6 possible tilings of a 1 × 2 grid. White tiles represent hypothesized nonoutbreak regions and gray tiles represent hypothesized outbreak regions.
Note that we make a distinction between two adjacent outbreak tiles (e.g., Hypothesis 5) and one large outbreak tile (e.g., Hypothesis 6). The rationale for this is that each tile represents a region over which the outbreak frequency is uniform. In the 1 × 2 example, we would expect Hypothesis 5 to be more likely than Hypothesis 6 in a case where, for example, 5% of the population in the left cell has the outbreak disease and 10% of the population in the right cell has the outbreak disease. Conversely, if the outbreak disease cases are distributed uniformly among the two cells we would expect Hypothesis 6 to be more likely.
Computing tiling scores
In computing tiling scores, we assume conditional independence among separate tiles given a particular tiling, even if the tiles are adjacent. As an alternative, modeling dependencies could help in the cluster detection task to adjust the prior probability of the presence of a disease in one cluster when it is near another cluster. A model that takes such effects into account could achieve improved outbreak detection, especially when modeling infectious diseases like influenza where being near an infected individual increases the probability of transmission of an infection. Modeling such dependencies incorrectly, however, could also hinder detection. In this sense, our choice not to model spatial dependencies can be seen as a cautious approach to avoid making informative spatial dependence assumptions that may be incorrect and therefore deleterious to outbreak detection performance. Even so, our basic choice of priors does favor finding a smaller number of tiles for a region, which often leads to spatially grouping neighboring disease cases.
The independence assumptions we make enable the dramatic computational efficiency that we gain by using dynamic programming, as explained in detail below. Assuming conditional independence does not constrain the outbreak hypotheses that can be identified in principle from the data, if given enough data. Although valid assumptions of conditional dependence may yield a more accurate performance in light of available data, representing and reasoning with conditional dependence carries an enormous computational burden that we avoid. It may be possible to extend our method to take spatial dependencies into account and still maintain computational tractability. We leave this issue as an open problem for future research.
Conditional independence allows us to define the score of a given tiling to be the product of the scores of the individual tiles it is composed of. The score of each individual tile is given by the data likelihood of that tile, as defined by Equation (8), multiplied by the prior probability of the tile's outbreak state. That is, the score of the hypothesis that tile T contains an outbreak is p ⋅ lik(1, T) and the score of the hypothesis that it does not contain an outbreak is (1  p) ⋅ lik(0, T).
To illustrate the effects of multiplying the likelihood by a prior, suppose that in the 1 × 2 example in Figure 2 the likelihoods of the tiles in Hypothesis 5 and Hypothesis 6 were the same, both equal to some value l. In that case the score of Hypothesis 6, which contains only one tile, would be lp while the score of Hypothesis 5, which contains two tiles, would have the lower value of lp^{2}. Hence, all other things being equal, our prior favors tilings composed of fewer tiles.
A multiplicative prior for each tile is used to allow the decomposition of a tiling score into a product of individual tile scores. While we use the same prior for all tiles, it is possible to assign a different prior probability for each tile based on its location and size while maintaining the multiplicative property that the score of a tiling is a product of tile scored. Also note that while the priors for each tile add up to 1, the priors for tilings, which are a product of tile priors, are not normalized. We discuss the details of normalization when we present Bayesian model averaging, since it is especially relevant in that context. The process of selecting the value of the structure prior p and its relationship to the prior probability that an outbreak is present anywhere in the region (the "global prior") is also discussed alongside normalization. Below, let us denote the score of an outbreak state x for a tile T that spans rows R_{L }through R_{H }and columns C_{L }through C_{H }by
Dynamic programming algorithm
It is clear that the space of possible outbreak hypotheses is exponential in the size of the grid, since there are 2^{R×C }possible outbreak subregions and an even larger number of possible tilings. In order to efficiently search the space of hypotheses, a dynamic programming algorithm is presented for finding the most likely tiling that exploits the fact that the tiling of a large grid can be decomposed into multiple tilings of smaller grids.
To present the operation of the algorithm, we will first describe the algorithm for finding the highestscoring tiling of a 1 × C horizontal strip in general terms, then walk through an example of tiling the top row of a 5 × 5 grid illustrated in Figure 3. Next we describe how the algorithm is extended to two dimensions with the aid of the example on a 5 × 5 grid illustrated in Figure 4, and finally we provide a formal definition of the algorithm as pseudocode.
Figure 3. Onedimensional tiling selection. Illustration of an iteration of the tiling selection algorithm on a 1 × 5 strip of the grid. Tick marks indicate the boundaries of between the potential tile to be added and the rest of the tiling. See text for more details. (a) Best tilings for C_{L }∈ {1, 2, 3, 4}. (b) Candidate tiles for each value of C_{L}. Highest scoring tile for each pair is shown as "best.". (c) Candidate tilings for each value of C_{L}. Highest scoring tiling shown as "best."
Figure 4. Twodimensional intermediate tilings. Illustration of intermediate candidate tilings on a 5 × 5 grid for R_{H }= 4.
In order to find the best (highest scoring) tiling of a 1 × C strip, we first number the cells consecutively from 1 to C. Let denote the best tiling of cells numbered C_{L } 1 (inclusive) and lower. When C_{L }= 1, (or equivalently ) is a tiling of an empty set of cells. There is only one such possible tiling, the empty tiling, and we assign it a score of 1 because it is the multiplicative identity. Having laid out this grounding we can describe the operation of the algorithm iteratively: Given that we have found the best tiling for each C_{L }such that 1 ≤ C_{L }≤ C_{H}, we can determine the best tiling for cells 1, ..., C_{H }as follows: For each C_{L }≤ C_{H}, determine whether the tile that spans the cells C_{L}, ..., C_{H }(inclusive) should be an outbreak tile or a nonoutbreak tile to maximize its score, then multiply the score by the score of the best tiling . This product is the score of the tiling obtained by appending tile to tiling , and this resulting tiling is then a candidate for tiling . Then out of all the candidates obtained for the different values of C_{L }in 1, ..., C_{H}, the highest scoring one is guaranteed to be the highest scoring tiling of the range of cells 1, ..., C_{H}. Thus we obtain the best tiling for the entire strip by iterating over values of C_{H }from 1 to C.
Figure 3 illustrates a single iteration of this algorithm when looking for the best scoring tiling of the first (top) row of a 5 × 5 grid. The cells are numbered left to right. Particularly, the figure shows us the iteration that obtains the best tiling of the first (leftmost) four cells, thus, throughout this iteration, C_{H }= 4. At this point, we have already obtained the best tilings of the lower ranges of cells in previous iterations. These are shown in Figure 3(a). Figure 3(b) shows that for each value of C_{L }from 1 through C_{H }= 4, we consider whether an outbreak or a nonoutbreak tile scores higher. The best scoring tile from each pair is indicated. In Figure 3(c), we show each of those highest scoring tiles combined with the corresponding best tiling from Figure 3(a). In our example suppose that of those resulting tilings the bottom one (obtained with C_{L }= 4) resulted in the highest score. This tiling is then the best tiling of cells 1...4, and it is cached for use in the next iteration.
This method is extended to the twodimensional case by performing the same iteration over rows, where each row takes the role analogous to a cell in the onedimensional version discussed above. For example, Figure 4 is analogous to Figure 3(c) in that we know the best tiling for the rectangular region that spans rows 1 through R_{L } 1, and we are adding the columnwise tiling of the rectangular region spanning rows R_{L }through R_{H }to get a candidate tiling of the entire area above the line labeled R_{H}. The main difference is that, where in Figure 3(b) we were considering whether to add an outbreak tile or a nonoutbreak tile, in Figure 4 we are simply adding the best columnwise tiling of the range of rows between R_{L }and R_{H}, as is illustrated by the fact that the best tiling of row 1 comes from an extension of the example in Figure 3.
For readability purposes, a version of the algorithm that only computes the score of the best tiling is presented in Figure 5. A version of the algorithm that keeps track of the actual tiling is detailed in Additional file 1: Appendix A.
Figure 5. Highest tiling score selection algorithm. Pseudocode for the dynamic programming algorithm for highest tiling score selection.
Additional file 1. Appendix A: Tiling selection code. Pseudocode of the tiling selection algorithm that returns the highest scoring tiling and its score.
Format: PDF Size: 85KB Download file
This file can be viewed with: Adobe Acrobat Reader
The algorithm takes Θ(R^{2 }⋅ C^{2}) iterations. In the general case, the computational cost of each iteration depends on the likelihood model used. In our particular implementation we compute a table of likelihoods for all possible grid rectangles prior to running the algorithm. Since our model is agentbased, the computational cost of populating the table is linear in the population of the surveillance region. The computational advantage that dynamic programming gives us is the ability to scan an exponential number of possible outbreak subregions. The exact number of possible tilings scanned is (2 × 3^{C1 }+ 1)^{R1 }× 2 × 3^{C1 }= Θ(2^{R(Clg3+1lg3)}); of these tilings, (2^{C1 }+ 1)^{R 1 }× 2^{C1 }are nonoutbreak hypotheses, and the rest are outbreak hypotheses, that is, hypotheses where at least one tile is colored. The calculation of the number of tilings is presented in Additional file 2: Appendix B.
Additional file 2. Appendix B: The number of tilings. Calculation of the number of tilings considered by the algorithm.
Format: PDF Size: 104KB Download file
This file can be viewed with: Adobe Acrobat Reader
We can express the value computed by the algorithm formally as follows: Let a rowwise partition of the grid be denoted by where is the space of all rowwise partitions of the grid. Each partition is a set of pairs of row indexes, where each pair (R_{L}, R_{H}) bounds the rectangular region R_{LH }that spans the width of the grid (columns 1 through C) and rows R_{L }through R_{H }(inclusive). Let a columnwise partition of the rectangular region R_{LH }be denoted by , where is the space of all columnwise partitions of R_{LH}. We use the notation to mean that there is a tile spanning columns C_{L }through C_{H }(and rows R_{L }through R_{H}) in the partition. Using this notation, the algorithm finds:
The space of tilings considered by the algorithm can be described as the space of rowwise tilings of columnwise subtilings. Intuitively, this space has the desired property of being able to capture every cellwise coloring of the grid, however, this is not an exhaustive search over all general colored tilings. Figure 6 illustrates this on a 10 × 10 grid with a pair of examples: Figures 6(a) and Figure 6(c) show two configuration of outbreak rectangles representing multiple outbreaks. It is desirable for a tiling found by the algorithm to be able to (1) cover all outbreak regions with outbreak tiles and all nonoutbreak regions with non outbreak rectangles (that is, get the coloring right), (2) cover each outbreak rectangle with a single tile, and (3) minimize any unnecessary fragmentation of the nonoutbreak region. For the outbreak configuration in 6(a), there exists a tiling in the search space of the algorithm that satisfy all three conditions, namely, the tiling in Figure 6(b). However, such a tiling does not always exist: The outbreak configuration in Figure 6(c) shows this limitation. A tiling in the space of all colored tilings that satisfies all three conditions is shown in Figure 6(e). Note that this tiling is not a rowwise tiling of columnwise subtilings and is hence not in the search space of the algorithm. In fact, a rowwise tiling that minimizes the fragmentation of the outbreak rectangles (to the extent possible for a rowwise tiling) is shown in Figure 6(d). Note that in Figure 6(d) the nonoutbreak region has to be broken up into eleven tiles (as opposed to just eight in Figure 6(e)) and the three outbreak rectangles need to be broken up into five tiles. Thus, the most parsimonious rowwise tiling shown in Figure 6(d) is still not as parsimonious as the tiling in Figure 6(e).
Figure 6. Matching tilings to rectangles. Two arrangements of rectangles and corresponding tilings that illustrate the capabilities and limitations of the dynamic programming algorithm's tilings. (a) The first sample outbreak configuration. (b) An optimal tiling of the first sample outbreak configuration that is within the algorithm's search space. (c) The second sample outbreak configuration. (d) The most optimal tiling of the second sample outbreak configuration within the algorithm's search space.(e)An optimal tiling of the second sample outbreak configuration outside the algorithm's search space.
Model averaging
In the above algorithm it was assumed that the task at hand is model selection, that is, finding the most likely tiling given the data. Below we present an adaptation of the dynamic programming algorithm to perform Bayesian model averaging as well, in order to derive a posterior probability that an outbreak is occurring given the data. In simple terms, this is done by replacing maximization with summation to obtain two sums: S_{0}(Data), the sum of the scores of all tilings that do not include outbreaks (blank tiles only); and S_{01}(Data), the sum of the scores of all tilings considered. More specifically, replacing maximization by summation leads to the dynamic programming algorithm for summation over tilings in Figure 7.
Figure 7. Tiling summation algorithm. Psuedocode for the dynamic programming algorithm for summing over tiling values that are products of individual tile values defined by a function h(⋅).
The summation algorithm calculates the sum
Here, the function h(R_{L}, R_{H}, C_{L}, C_{H}) defines the tilewise values we would like to sum over. By substituting each of the following alternative functions for h we obtain the desired sums S_{01}(Data) and S_{0}(Data), respectively:
Score normalization
Under the above setup, the tiling scores need to be normalized in order to have the prior probabilities over all hypotheses sum to 1. The normalization constant Z_{T }can be simply calculated as a sum over the priors associated with all tilings:
Hence the normalization constant Z_{T }is simply the sum over tilings in (28) with h ≡ 1. Additional file 3: Appendix C shows that for an R × C grid, this value can be calculated in closed form as Z_{T }= f(R, C, y) using Equation (32) with y = 1:
Additional file 3. Appendix C: Normalization function. Proof of equation (32).
Format: PDF Size: 144KB Download file
This file can be viewed with: Adobe Acrobat Reader
Similarly, the prior probability, as governed by the structure prior parameter p, of the absence of an outbreak anywhere in the grid can be calculated as a sum over the priors of all nonoutbreak tilings:
This relationship was used to pick the value of p that matches the particular value of P(outbreak) = 0.04 for the prior probability of an influenza outbreak based on expert assessment. Since we have not found a closedform inverse to this relationship, the appropriate the value of p was found numerically using a binary search over numbers in [0,1].
In order to obtain the posterior probability of an outbreak, we normalize the sums of scores to obtain: S_{0,1}(Data)/Z_{T }= P(Data) and S_{0}(Data)/Z_{T }= P(Data, no outbreak). This allows us to obtain the posterior probability of an outbreak
In a typical biosurveillance application this is the posterior that we would use to detect whether an outbreak is occurring anywhere in the surveillance region by testing whether it rises above a predefined alert threshold.
Evaluation methods
This section describes an evaluation of the dynamic programming (DP) algorithm, both in terms of model selection and in terms of model averaging. The evaluation was preformed using real background data consisting of information about chief complaints and home ZIP codes of ED patients who are presumed not to have an outbreak disease. Synthetic data were generated to simulate the presence of influenza outbreak cases. The evaluation consists of a comparison to a baseline that scans over singlerectangle hypotheses (SR), as well as a comparison for model selection against a greedy algorithm (GR) that hypothesizes one or more nonoverlapping outbreak rectangles. The next section describes these algorithms and the section that follows it describes the data in further detail.
Baseline methods
The simpler of the baseline methods that the DP algorithm is compared to is a method of scanning over singlerectangle hypotheses (SR). For model selection, it consists of iterating over all possible placements of a single rectangle on the grid and finding the placement that maximizes the posterior probability of an outbreak, calculated as:
where S_{i }represents some rectangular region on the grid and is its complement. In the implementation used in this evaluation, the prior probability P(H_{1}(S_{i})) of having an outbreak in rectangle S_{i }is set to be equal for all rectangles, hence maximization of the posterior probability here is equivalent to maximization of the likelihood . The prior probabilities P(H_{1}(S_{i})) are chosen so that the prior probability P(Outbreak) of an outbreak anywhere in the region matches the one for the dynamic programming algorithm.
Let the notation S(R_{L}, R_{H}, C_{L}, C_{H}) denote the rectangle S_{i }defined by those row and column boundaries, and let . Then model averaging using the SR algorithm consists of simply calculating the posterior:
The other baseline method that is used in the evaluation is the greedy algorithm (GR) without recursion by Jiang and Cooper [18]. We use this algorithm for model selection only, and it operates iteratively: First it selects the single rectangle that maximizes the score , where q is a structure prior. Next the nonoverlapping rectangle that maximizes the score is selected. This process of adding one rectangle at each step and multiplying by the structure prior q with each new added rectangle is repeated until the score can no longer be increased. The result is a collection of nonoverlapping rectangles with the associated score
Outbreak rectangles are considered to be independent conditioned on their positions. The value of q is chosen to be equal to the value used in the SR algorithm for P(H_{1}(S_{i})).
Data generation
Multiple outbreak scenarios were generated on a 10 × 10 grid with a population distribution based on ZIP Code Tabulation Area populations in Allegheny County, Pennsylvania as reported in the 2000 Census. Real ED data with home ZIP code information was used as a nonoutbreak background scenario. Data from the months of JuneAugust of the years 20032005, a total of 181 days, were used to evaluate the background scenario. The data for each day consisted of a deidentified list of records for patients who visited emergency departments in Allegheny county on that day, where each record contained the patient's home zip code and chief complaints. Ethics approval for use of the data was provided by the University of Pittsburgh IRB.
Outbreaks of varying shapes, sizes, and frequencies were then injected into the data to obtain outbreak scenarios. In simulating outbreaks, the outbreak frequency F was sampled as a continuous uniform random variable from one of the following four frequency ranges: low (0 to 5.91 × 10^{5}), mid (5.91 × 10^{5 }to 3.55 × 10^{4}), high (3.55 × 10^{4 }to 6.50 × 10^{4}), and very high (6.50 × 10^{4 }to 0.5).
For each frequency range, four sets of outbreak specifications were generated where the rectangle sizes were limited to a maximum of 10, 40, 60, and 100 cells. Each of these sets was in turn composed of five sets of specifications with n = 1 to 5 nonoverlapping rectangles, 10 scenarios for each value of n, giving a total of 800 outbreak specifications. The positioning of each rectangle on the grid was uniformly random. Figure 8 shows a sample of the randomly generated outbreak shapes that were used. Each rectangle j was assigned an outbreak frequency F_{j }sampled from the previously determined frequency range.
Figure 8. Randomly generated outbreaks. A sample of some randomly generated outbreak shapes for testing.
For each of the 181 days of background data, each of the 800 outbreak specifications was used to create an outbreak scenario by injecting disease cases into that day by selecting an individual in a an outbreak rectangle indexed by j with probability F_{j }and setting the corresponding symptom I_{i }by sampling the distribution of P(I_{i}D_{i }= flu) defined in our disease model.
For each outbreak scenario, the outbreak's coverage (the proportion of the grid that is covered by outbreak rectangles) was also recorded. This value is later used to stratify the test results.
Results and discussion
Evaluation of outbreak presence detection
We evaluated model averaging of the dynamic programming (DP) and the single rectangle (SR) algorithms by obtaining the posteriors for each day in the background data (when no outbreak is occurring) and for each day in the outbreak scenarios, and aggregating these results to obtain Receiver Operating Characteristic (ROC) curves. Note that days are treated as single snapshots of the surveillance region with no temporal context, and the area under the ROC curve gives us a measure of the quality of detection of an outbreak from observing only a single day of the outbreak regardless of how far into the outbreak's progression that day is. This is unlike another popular metric in the outbreak detection literature, the Activity Monitoring Operating Characteristic (AMOC) curve, where a progressing outbreak is simulated over a span of multiple days and the time to detection is measured. We use an evaluation measure that does not take time into account because the algorithms in question themselves are purely spatial. Additionally, the need to simulate outbreak progression for AMOC analysis inadvertently introduces additional assumptions about the dynamics of the outbreaks to be detected, which is not the focus of the current evaluation.
In these tests the disease model used assumes a singlevalued K. Table 1 reports the areas under the curve (AUC) and the results of a statistical significance test based on [25] of the difference in areas under the curve of the two methods. The table shows the AUC for each algorithm, the lower and upper 95% confidence limits for the difference in areas (DPSR) and the associated pvalues. Note that in the midrange frequencies and above, the posteriors for all outbreaks with coverage 35% and above were all equal to 1 up to numerical precision for both methods. As a result, the ROC curves obtained for those samples are identical making the comparison trivial. In the "very high" frequency range all outbreaks yielded posteriors that are 1 up to numerical precision, and for this reason this frequency range is omitted from Table 1.
Table 1. Model averaging comparison in terms of area under the ROC curve
The results show that both methods perform similarly and do not achieve statistically different results, except in the case of midrange frequencies with outbreak coverage of 1634% of the surveillance grid where the SR method performs statistically significantly better, but the performance difference was inconsequential from a practical standpoint. We can also see that for both methods the area under the ROC curve is strongly influenced by both the frequency and coverage of the outbreak, increasing as these parameters increase, as expected.
Additional investigation
The result that the DP algorithm does not perform better than the SR algorithm is surprising since the DP algorithm is able to consider multiplerectangle hypotheses which would be expected to drive the posterior probability up significantly when there are multiple simulated outbreaks.
Since it is not immediately obvious from inspection of the algorithms why the anticipated improvement is absent, we performed targeted tests with the same background data and speciallydesigned outbreaks, such as the one illustrated in Figure 9. Such "doughnut shaped" outbreaks are expected to be more challenging for the SR algorithm because any placement of a single rectangle on the grid will either miss a significant portion of the outbreak cells or include a significant number of nonoutbreak cells. The outbreak shape in Figure 9 and the possible combinations of three, two, and one rectangle of the four shown in Figure 9 (making a total of 15 different outbreak shapes) were injected into the background data at the low, mid, and high frequency ranges to obtain ROC curves. As the statistical comparison in Table 2 shows, these tests show that even in multirectangle outbreaks designed to be difficult to approximate with a singlerectangle cluster, the SR algorithm performs at least as well as the DP algorithm in terms of area under the ROC curve.
Figure 9. Doughnutshaped outbreak. An example of a specially designed outbreak for further comparison of SR and DP model averaging.
Table 2. Additional model averaging comparison
In light of these results we also followed another line of investigation, namely, one of changing the underlying likelihood function with the thought that a model that more accurately represents the background scenarios would yield more representative likelihoods, and possibly better differentiation between the nuances of the two algorithms. Since in reality the proportion K of people coming to the ED for nonoutbreak diseases varies from day to day, modeling K as a random variable with a nontrivial distribution is more realistic. Under this new model, the DP algorithm performed similarly to its performance under the old model. The SR algorithm, however, took a severe turn for the worse as it marked almost every nonoutbreak scenario with a very high posterior probability, numerically indistinguishable from one. For this reason, we do not present a comparison between the methods using the distributionbased model. We have carefully checked the correctness of the implementation of the SR algorithm and must therefore conclude that this behavior is indeed a drawback of the the SR algorithm. Further investigation is required to fully understand this behavior, however.
Comparison to SaTScan™
We have compared DP to SaTScan™,^{a }the software package developed by Kulldorff that can use spatial, temporal, or spacetime scan statistics [26]. SaTScan™ implements a popular set of algorithms that (1) have been extensively evaluated by its creators, (2) perform well in practice, (3) are made available for free download for research purposes, and (4) have been used by others as a benchmark ofscanning performance. Therefore, we believe SaTScan™ provides a good point of comparison to DP, although we describe below some caveats regarding this comparison. Since the current implementation of DP is purely spatial, we restricted the comparison only to the spatial scan statistic.
It is important to note that there are fundamental differences between the DP algorithm and SaTScan™. Firstly, SaTScan™ is designed to detect clusters of higher or lower than normal values of some measurement and report those as clusters, while our method uses a disease model to calculate a likelihood of a state of interest based on observations. This means that in our testing, we cannot use SaTScan™ to directly detect high numbers of influenza cases per se, since the disease state is never directly observed in the data. For this reason, we used SaTScan™ to detect abnormally high counts of patients that came in with a chief a complaint of either "cough" or "fever", as they are indicative of the locations of influenza cases. We leave the information provided to the DP algorithm unchanged (the values of the chief complaint variable are provided) in this comparison. Secondly, SaTScan™ does not scan over rectangular regions, but rather over circular or ellipseshaped regions [4]. We still use the same generated outbreaks for this comparison, meaning that the true outbreaks are rectangular. SaTScan™ does have the ability to detect multiple clusters in an iterative process that is similar to the greedy algorithm. In this mode, SaTScan™ scans for the most likely ellipseshaped cluster, then removes the data it captures, and repeats the process until the pvalue (the probability of the observed data under the null hypothesis that no cluster exists) of the newly detected cluster falls above 0.05 [27]. Since this is the setting which is most suited for detecting multiple and irregularlyshaped clusters, it is the one we used in our comparison. Of the models available to SaTScan™ we used the Poissonbased model [3] since it is the most appropriate for the setting where we are detecting abnormally high event counts for a known population at risk.
Table 3 reports the AUC and the statistical significance of ROC differences comparing the DP algorithm to SaTScan™. In order to obtain ROC curves from SaTScan™, we used the pvalue of the most likely cluster found as the criterion variable, with lower pvalues corresponding to a positive decision (in favor of an outbreak). The table shows that SaTScan™ performs statistically significantly worse than DP in all tests.
Table 3. DP compared to SaTScan™ in terms of area under the ROC curve
A possible factor influencing the poor performance of SaTScan™ is that the task that it performs is not a perfect fit for our evaluation measure. We are measuring performance in the task of identifying injected disease outbreaks, while SaTScan™ performs the task of detecting elevated counts of the "cough" or "fever" chief complaint. Also, we injected rectangular outbreaks, whereas SaTScan™ is looking for elliptical shaped outbreak regions. Another potential factor to the performance of SaTScan™ is that the background (nooutbreak) scenarios also contain clusters of elevated chief complaint counts of cough and fever, even in the absence of injected influenza outbreak cases. DP itself is not free from the same problem, however, since it does assign high posteriors to some background scenarios, but to a lower extent.
Evaluation of outbreak location detection
In order to compare the quality of the detection of the location of outbreak clusters, the cellwise spatial precision and recall of each algorithm were measured in each outbreak scenario. In terms of the most likely outbreak subregion Q detected by each algorithm and the actual subregion SUB covered by the simulationgenerated outbreak rectangles in each scenario, the cellwise spatial precision is taken to be the proportion Q ∩ SUB/Q and the cellwise spatial recall to be Q ∩ SUB/SUB where the notation X signifies the area of subregion X. The measurement is restricted to populated cells only.
The disease model used in these tests is also the one that assumes a singlevalued K, due to details of the greedy algorithm implementation. Recall from equation (37) that the likelihood of an irregular region, needs to be calculated at each step. When K is a distribution, this calculation needs to be redone for every considered hypothesis because the expression is a sum of products, leading to much longer runtime. However, when K is a single value, since the likelihood of that region is a simple product, we can take advantage of canceling terms and calculate it much faster from cached values as
Table 4 shows the average cellwise spatial precision for the DP, GR, and SR methods. The table also shows the result of paired ttests for significance of the difference between the DP and GR algorithms. Since it is clear that the singlerectangle algorithm performs worse than the other two, the statistical comparison to SR is not shown here. Table 5 shows a similar table for cellwise recall.
Table 4. Comparison of model selection in terms of spatial precision
Table 5. Comparison of model selection in terms of spatial recall
First note that all differences observed between the DP and GR algorithms, while small, are statistically significant at the large sample sizes used. In terms of both spatial precision and recall, the results show that the greedy algorithm does slightly better than the dynamic programming algorithm at low frequencies, even though the performance of both is very poor. This may be an indication that the greedy algorithm is more sensitive to small differences in likelihood. For the most part, as frequency and coverage of the outbreak increase, the performance of the dynamic programming algorithm overtakes that of the greedy algorithm. Most notably, the dynamic programming algorithm yields much better spatial precision for outbreaks with higher spatial coverage (and therefore more outbreak rectangles). The differences in recall, while statistically significant, do not amount to much from a practical standpoint.
We also compare the DP algorithm to SaTScan™ in terms of spatial precision and recall. Table 6 compares the algorithms in terms of spatial precision and Table 7 compares the algorithms in terms of spatial recall. In order to obtain precision and recall readings from SaTScan™, we considered locations that are contained in a detected cluster with a pvalue below 0.05 to be marked as outbreak locations, and all others to be marked as nonoutbreak locations.
Table 6. Comparison of spatial precision against SaTScan™
Table 7. Comparison of spatial recall against SaTScan™
The spatial precision results show that SaTScan™ achieves precision not statistically significantly different from DP for outbreaks of low coverage and frequency, while for other cases DP achieves statistically significantly higher precision. The spatial recall results show that SaTScan™ achieves recall that is statistically significantly higher than DP for outbreaks of coverage below 35% and low frequency. Both methods perform generally poorly in this range. In the remainder of the cases DP outperforms SaTScan™ in terms of recall as well as precision.
We emphasize that DP has the advantage of performing inference about whether an individual has influenza using the same disease model that was used to generate the injected outbreaks. SaTScan™, however, merely searches for elevated chief complaint counts. Also note that the injected outbreaks are rectangular in shape, while SaTScan™ uses elliptical search windows. An ellipse is not able to fit some of the outbreak scenarios well. Additionally, unlike our model which has prior parameters defining a distribution over the chief complaints that are observed when no outbreak is present, SaTScan™ computes the expected counts for the absence of a cluster based on the counts outside the search window. As a result, when an elliptical window fails to accurately enclose the outbreak, SaTScan™'s assessment of the expected level of chief complaint counts is affected. We believe that the combination of these factors is what is responsible for SaTScan™'s lower performance.
Evaluation of running time
A major motivating factor for developing a dynamic programming spatial scan is that brute force multiregion spatial scans are computationally intensive. In order to evaluate the feasibility of using the DP algorithm in a practical application setting, we performed a variety of timing tests on input data of varying grid sizes and populations. The population size affects the running time not due to any feature of the dynamic programming itself, but rather due to the agentbased disease model. In order to properly interpret the timing results, it is first useful to briefly describe some technical aspects of our implementation.
The implementation consists of two modules, one module reads in the data to be scanned and uses the disease model to compute the loglikelihood of each possible grid rectangle for each value of the flu incidence frequency F (and the "other" disease state frequency K when it is a distribution rather than a constant), including F = 0, and stores this table of likelihoods in a data structure. The second module is the actual DP scanning algorithm which queries the data structure constructed by the first module every time the likelihood of a tile needs to be calculated. We actually reuse the first module in the implementations of the SR and greedy algorithms since queries to the same tables of likelihoods need to be made by each of those algorithms as well.
Figures 10 and 11 show median timing results for the likelihood calculation module, the scan modules of the SR and DP algorithms, and SaTScan™. All tests were performed on a desktop PC with a quad core 2.66 GHz processor and 4 GB of RAM. Figure 10 shows timing results for grids of size 10 × 10, 20 × 20, 50 × 50, and 100 × 100. We produced data with these varying grid sizes by dividing the same original map of Allegheny county into smaller cells to obtain "larger" grids (in terms of cell counts). The results are in line with our theoretical analysis that the runtime of the DP scan, like that of the SR scan, scales as a square of the number of cells in the grid. The results also show that grid size does not appreciably affect the time taken to compute the table of likelihoods that the DP algorithm uses. It can be seen that SaTScan™ takes approximately five minutes to complete a scan on a 50 × 50 grid, while the DP algorithm can complete a scan on a 100 × 100 grid in the same amount of time. A scan on a 100 × 100 grid by SaTScan™would take on the order of hours (not shown). The longer runtimes incurred by SaTScan™ for larger size grids are largely influenced by the 999fold randomization testing that SaTScan™ uses to compute pvalues for the scan statistic. The same observation was made by Neill, Moore, and Cooper in previous work [13].
Figure 10. Comparison of running time across a range of grid sizes. Comparison of running time across grid sizes of 10 × 10, 20 × 20, 50 × 50, and 100 × 100 using the original 2000 census population of Allegheny County. The plot shows the running time of the calculation of the table of likelihoods, the scan of the SR algorithm, the scan of the DP algorithm, and SaTScan™.
Figure 11. Comparison of running time across a range of population sizes. Comparison of running time for a 10 × 10 grid across populations of 1, 5, 25, and 100 times the original 2000 census population of Allegheny County. The plot shows the running time of the calculation of the table of likelihoods, the scan of the SR algorithm, the scan of the DP algorithm, and SaTScan™.
Figure 11 shows timing results for 1, 5, 25, and 100 times the original census population of Allegheny County. We produced data with these varying population sizes by merely counting each individual multiple times. The results are consistent with our theoretical analysis that the likelihood calculation phase of the implementation scales linearly with the population size due to the agentbased nature of our model. The DP and SR scan times are not affected by variations in populations size, since those algorithms always query likelihoods for entire rectangles. The runtime of SaTScan™ is not appreciably affected by changes in the population size, which can be explained by the fact that as a countbased method, it deals with perlocation counts.
Conclusions
This paper described a dynamic programming algorithm for spatial cluster detection. The algorithm specifies alternative clustering hypotheses in terms of colored tilings of the surveillance grid. We showed how the algorithm can be used both for characterizing the location and shape of the most likely clusters and for calculating the posterior probability of the presence of clusters in the data.
Tests of general detection in terms of area under the ROC curve show that the new method is not statistically significantly better than the naïve calculation that only averages over singlerectangle hypotheses. This result appears to remain the case even when the true outbreaks cannot be reasonably approximated by a single rectangle. We conjecture that the additive nature of the process of model averaging either gives the naïve method more detection power than expected or hinders the dynamic programming method. On the one hand, even though the singlerectangle method cannot accurately capture multiple clusters with a single hypothesis, it can accurately capture parts of the multicluster scenario (the single rectangles that compose a multirectangle scenario). These partial agreements may make a sufficient contribution to the likelihoods so that averaging over all hypotheses results in relatively high posterior probability for the multicluster scenario. On the other hand, while the dynamic programming algorithm is able to accurately capture a multicluster scenario in a single hypothesis, it does consider an exponential number of alternative hypotheses that do not accurately describe the outbreak scenario. It is possible that when averaging over the entire hypothesis space the score of the accurate hypotheses is simply outweighed by the alternative scores. The reasons for why the two methods perform similarly may be a combination of both of these reasons and possibly others. Further investigation is needed to explain these results fully.
In spite of the lack of improvement in general detection power, tests of location and shape characterization show that under certain conditions the dynamic programming method performs better than the naïve SR method and the greedy method. The most notable improvement is in the precision of cluster locations detected, as DP algorithm tends to output fewer falsepositive outbreak cells than the other algorithms. Recall of cluster locations is on par with the greedy method.
Measurements of the DP algorithm's running time over varying grid and population sizes showed that it scales well to larger grids, being able to complete a scan of a 100 × 100 grid in about five minutes. The measurements also show that the dynamic programming algorithm has a general advantage of taking no more time than the naïve singlerectangle scan. Since the scan itself relies on having access to a likelihood function for the likelihood of an outbreak in a particular rectangle, the runtime will depend on nature of the likelihood calculation. In our implementation and using our particular disease model, this calculation could be performed in a separate module and timed separately. We observed that the likelihood calculation scales linearly with the population using our agentbased model, taking under four minutes for a population of almost 150 million.
Just like the baseline methods considered, the dynamic programming algorithm can be applied to other likelihood models as long as they provide a likelihood as a function of location. In particular, this means that even though an agentbased model was used in the experiments, these methods are also suited for use with a countbased model. These methods can also be extended to additional space dimensions as well as time, and multiple cluster classes (outbreak states). The DP algorithm can be also modified to perform finer characterization such as finding the most likely values of F and K associated with each tile. By extension, using a different underlying model the algorithm could be modified to estimate the most likely values for that model's parameters. In that sense, there are many potential applications for the dynamic programming algorithm, as well as many potential avenues for further investigation.
Endnotes
^{a}SaTScan™ is a trademark of Martin Kulldorff. The SaTScan™ software was developed under the joint auspices of (i) Martin Kulldorff, (iii) the National Cancer Institute, and (iii) Farzad Mostashari of the New York City Department of Health and Mental Hygiene.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
GF Cooper conceived of the basic dynamic programming algorithm presented here, advised and supervised all stages of research, and participated in the drafting of the manuscript. X Jiang performed initial tests and initial implementation of the model selection version of the dynamic programming algorithm and performed the analysis of the number of tilings searched. Y Sverchkov contributed to the development of the model averaging version of the dynamic programming algorithm, implemented and tested all the algorithms presented in this paper, and took the lead in drafting this paper. All authors read and approved the final manuscript.
Acknowledgements
This work was funded by CDC grant P01HK000086 and NSF grant IIS0911032.
References

Kulldorff M: Spatial scan statistics: models, calculations, and applications.
In Scan Statistics and Applications Edited by Glaz J, Balakrishnan M, Birkhauser. 1999, 303322.

Wang X, Hutchinson R, Mitchell TM: Training fMRI Classifiers to Detect Cognitive States across Multiple Human Subjects.
Advances in Neural Information Processing Systems 2004, 18:709716.

Kulldorff M: A Spatial scan statistic.
Commun Stat Theory Methods 1997, 26(6):14811496. Publisher Full Text

Kulldorff M, Huang L, Pickle L, Duczmal L: An elliptic spatial scan statistic.
Stat Med 2006, 25:39293943. PubMed Abstract  Publisher Full Text

Kulldorff M, Athas W, Feuer E, Miller B, Key C: Evaluating cluster alarms: A spacetime scan statistic and cluster alarms in Los Alamos.
Am J Public Health 1998, 88:13771380. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kulldorff M: Prospective timeperiodic geographical disease surveillance using a scan statistic.
J R Stat Soc A 2001, 164:6172. Publisher Full Text

Neill DB, Moore AW, Sabhnani MR: Detecting elongated disease clusters.
Morb Mortal Wkly Rep 54
doi:2005. Supplement on Syndromic Surveillance

Duczmal L, Assunção R: A simulated annealing strategy for the detection of arbitrary shaped spatial clusters.
Comput Stat Data Anal 2004, 45:269286. Publisher Full Text

Patil GP, Taillie C: Upper level set scan statistic for detecting arbitrarily shaped hotspots.

Tango T, Takahashi K: A flexibly shaped spatial scan statistic for detecting clusters.
Int J Health Geogr 2005, 4:11. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Neill DB, Moore AW: Rapid Detection of Significant Spatial Clusters.
In Conference on Knowledge Discovery in Databases (KDD) 2004 Edited by Guerke J, DuMouchel W. 2004.

Neill DB, Moore AW, Pereira F, Mitchell T: Detecting Significant Multidimensional Spatial Clusters.

Neill D, Moore A, Cooper G: A Bayesian spatial scan statistic.
In Advances in Neural Information Processing Systems Edited by Y Weiss ea. 2005, 18:10031010.

Neill DB, Cooper GF: A Multivariate Bayesian Spatial Scan Statistics.

Neill DB, Cooper GF, Das K, Jiang X, Schneider J: Bayesian network scan statistics for multivariate pattern detection.
In Scan Statistics: Methods and Applications Edited by Glaz J, Pozdnyakov V, Wallenstein S, Birkhäuser. 2009.

Cooper GF, Dash DH, Levander JD, Wong WK, Hogan WR, Wagner MM: Bayesian Biosurveillance of Disease Outbreaks. In Proceedings of the 20th Annual Conference on Uncertainty in Artificial Intelligence (UAI04). AUAI Press; 2004:94103.

Jiang X, Neill DB, Cooper GF: A Bayesian network model for spatial event surveillance.
Int J Approximate Reasoning 2010, 51(2):224239.
Bayesian Model Views
Publisher Full Text 
Jiang X, Cooper GF: A Recursive Algorithm for Spatial Cluster Detection.
Proceedings of the Symposium of the American Medical Informatics Association (AMIA) 2007, 369373.

Shen Y, Wong WK, Levander JD, Cooper GF: An Outbreak Detection Algorithm that Efficiently Performs Complete Bayesian Model Averaging Over All Possible Spatial Distributions of Disease.

Ester M, peter Kriegel H, S J, Xu X: A densitybased algorithm for discovering clusters in large spatial databases with noise. In 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR. AAAI Press; 1996:226231.

Pei T, Jasra A, Hand DJ, Zhu AX, Zhou C: DECODE: A new method for discovering clusters of different densities in spatial data.
Data Min Knowl Discov 2009, 18(3):337369. Publisher Full Text

Jiang X: A Bayesian Network Model for SpatioTemporal Event Surveillance. In PhD dissertation. University of Pittsburgh, Department of Biomedical Informatics; 2008.

DeLong ER, DeLong DM, ClarkePearson DL: Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach.
Biometrics 1988, 44(3):837845. PubMed Abstract  Publisher Full Text

Kulldorff M, Information Management Services, Inc: SaTScan™ v8.0: Software for the spatial and spacetime scan statistics. [http://www.satscan.org] webcite
2009.

Zhang Z, Assunçãdo R, Kulldorff M: Spatial Scan Statistics Adjusted for Multiple Clusters.
Prepublication history
The prepublication history for this paper can be accessed here: