Computational Sciences Center of Emphasis, Pfizer Worldwide Research & Development, Cambridge, MA, USA

Mathematics Department and Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA

Compound Safety Prediction, Pfizer Worldwide Research & Development, Cambridge, MA, USA

Abstract

Background

Causal graphs are an increasingly popular tool for the analysis of biological datasets. In particular, signed causal graphs--directed graphs whose edges additionally have a sign denoting upregulation or downregulation--can be used to model regulatory networks within a cell. Such models allow prediction of downstream effects of regulation of biological entities; conversely, they also enable inference of causative agents behind observed expression changes. However, due to their complex nature, signed causal graph models present special challenges with respect to assessing statistical significance. In this paper we frame and solve two fundamental computational problems that arise in practice when computing appropriate null distributions for hypothesis testing.

Results

First, we show how to compute a p-value for agreement between observed and model-predicted classifications of gene transcripts as upregulated, downregulated, or neither. Specifically, how likely are the classifications to agree to the same extent under the null distribution of the observed classification being randomized? This problem, which we call "Ternary Dot Product Distribution" owing to its mathematical form, can be viewed as a generalization of Fisher's exact test to ternary variables. We present two computationally efficient algorithms for computing the Ternary Dot Product Distribution and investigate its combinatorial structure analytically and numerically to establish computational complexity bounds.

Second, we develop an algorithm for efficiently performing random sampling of causal graphs. This enables p-value computation under a different, equally important null distribution obtained by randomizing the graph topology but keeping fixed its basic structure: connectedness and the positive and negative in- and out-degrees of each vertex. We provide an algorithm for sampling a graph from this distribution uniformly at random. We also highlight theoretical challenges unique to signed causal graphs; previous work on graph randomization has studied undirected graphs and directed but unsigned graphs.

Conclusion

We present algorithmic solutions to two statistical significance questions necessary to apply the causal graph methodology, a powerful tool for biological network analysis. The algorithms we present are both fast and provably correct. Our work may be of independent interest in non-biological contexts as well, as it generalizes mathematical results that have been studied extensively in other fields.

Background

Causal graphs are a convenient representation of causal relationships between variables in a complex system: variables are represented by nodes in the graph and relationships by directed edges. In many applications the edges are also signed, with the sign indicating whether a change in the causal variable positively or negatively affects the second variable. Causal graphs can serve as predictive models, and conclusions can be drawn from comparing the models' predictions to experimental measurements of these variables. Pollard et al.

Published research in biology provides a wealth of regulatory relationships within the cell that we mine to produce a causal network. The edges in this network are directed (by the flow of causality among the corresponding variables) and signed (by the sign of the correlation between the variables). Directed paths within the network thus predict putative upregulation and downregulation that would be effected downstream by changes in the level of a given entity (i.e., vertex in the graph). Our companion paper

Illustration of the causal graph methodology

**Illustration of the causal graph methodology**. Schematic depiction of a set of relationships curated from the literature and transformed into a causal graph, used to explain gene expression data.

In this paper, we study the problem of evaluating statistical significance of the conclusions drawn from a causal graph-based model given a particular gene expression dataset. To form a null distribution, either the correspondence between gene transcripts and experimental expression values or the connectivity of the graph can be randomized. Thus, the statistical significance question splits into two subproblems. First, how likely is it for the same level of agreement between predicted and observed regulation to be achieved when the classification of gene transcripts (as upregulated, downregulated, or neither) is randomly drawn from a family of all classifications with similar characteristics? Second, how likely is it to occur when the causal graph is randomly drawn from a family of all causal graphs with similar characteristics?

Answering the first question amounts to computing the distribution of the dot product of two vectors with components in {-1, 0, 1}, each drawn randomly from the family containing all such vectors with a fixed number of components of each value. This problem, which we call Ternary Dot Product Distribution, generalizes Fisher's exact test

Answering the second statistical significance question analytically does not appear to be possible, but the desired likelihood may be approximated by sampling uniformly at random from the family of all causal graphs with the same basic structure as the original causal graph: namely, the same positive and negative in- and out-degrees of each vertex. Because of the structure of the problem, even drawing one causal graph from this family is challenging. We call this the Causal Graph Randomization problem. Previous work on the problem of graph randomization has focused on undirected graphs

The rest of this paper is organized as follows. We begin by describing the regulatory network model based on causal graphs and discuss the way conclusions are drawn from it and the importance and subtleties of computing their statistical significance. We then describe the Ternary Dot Product Distribution problem and present two efficient algorithms to solve it: an algorithm with complexity cubic in the number of variables (i.e., vertices) in the graph but requiring computation in exact arithmetic, and an algorithm with a weaker complexity guarantee but numerically stable and efficient in practice. Finally, we discuss the challenges of the Causal Graph Randomization problem and present a practical algorithm for it using local graph operations, and conclude by describing future work.

Model Description

The two fundamental properties of causal relationships between biological entities are (1) the direction of causality between them; and (2) the qualitative response (i.e., upregulation or downregulation) of the second entity when the first one is upregulated or downregulated. This information can be encapsulated in a signed directed graph

For any two nodes

Hypothesis scoring

Given a gene expression dataset, we may classify gene transcripts into three families: significantly upregulated, significantly downregulated, and not significantly regulated. We refer to this classification as the

Given a particular entity

In order to evaluate the goodness-of-fit of a particular hypothesis to the observed gene expression dataset, we declare a prediction to be

Statistical significance

The scores computed for each putative hypothesis provide us with an overall ranking of all hypotheses. However, a good score does not necessarily imply good explanatory power, because of possible connectivity differences between the transcript nodes of

In addition, we need to understand how significant the rank of a hypothesis is with respect to another null model, in which the gene expression data remains fixed but the causal graph is allowed to vary, only keeping basic connectivity properties. More specifically, we examine the rank of a hypothesis of interest in the family of graphs with the same sequence of positive and negative in-degrees and out-degrees as

Illustrative Example

To build intuition for the proposed method we outline an example application based on previously published experimental data (GEO accession GSE7683

Our approach provides a statistical framework for causal inference that may be particularly valuable in such a situation. As outlined above, we consider each entity in our causal graph together with a direction of perturbation as a hypothesis; based on the network model, perturbing the entity should effect changes downstream, and we assess significance of the concordance between the predicted and experimentally measured changes by computing p-values based on the Ternary Dot Product and Causal Graph randomized null models. For simplicity, in this example we only consider predicted downstream effects one step downstream of each entity. Figure

Scoring of an example hypothesis

**Scoring of an example hypothesis**. Illustration of scoring for the

Table

Top hypotheses by score and corresponding p-values on an example dataset

**Rank**

**Hypothesis Name**

**Correct**

**Incorrect**

**Score**

**Ternary Dot Product p**

**Causal Graph p**

1

Response to Hypoxia+

48

9

37

2 × 10^{-12}

< 0.001

2

Dexamethasone+

20

4

16

6 × 10^{-6}

< 0.001

3

Hydrocortisone+

17

4

13

1 × 10^{-8}

< 0.001

4

PGR+

12

1

11

6 × 10^{-8}

< 0.001

5

SRF+

10

0

10

3 × 10^{-5}

< 0.001

6

KLF4+

9

0

9

3 × 10^{-6}

< 0.001

7

NR3C1+

12

4

8

7 × 10^{-4}

< 0.001

7

Glucocorticoid+

12

4

8

8 × 10^{-5}

< 0.001

7

CCND1+

9

1

8

3 × 10^{-4}

< 0.001

7

Triamcinolone acetonide+

8

0

8

9 × 10^{-7}

< 0.001

...

...

...

...

...

...

...

17

NRF2+

9

4

5

0.18

0.07

Top hypotheses by score in an example experimental dataset of dexamethasone-stimulated chondrocytes (GEO accession GSE7683

Importantly, hypotheses are based on overlapping but different sets of regulated transcripts. Thus, while we assess significance of each hypothesis in isolation, the evidence shared among hypotheses should be helpful in building a more global understanding. For instance, 50% of the

Only 23 of the top 50 hypotheses by score pass a significance cutoff of 0.001 for both metrics, indicating the utility of significance assessment--not just score--in discerning hypotheses worthy of further investigation. For example, ^{-5}, a result that is probably spurious.

This example is not meant as a comprehensive discussion of the affected biology but should provide some intuition how the proposed measures can be used. For complex biological phenotypes, many hypotheses may be reported as significant that may include overlapping but distinct sets of transcriptional changes as supporting evidence. While our proposed metrics judge significance of single hypotheses independently, the results provide a statistically well-founded substrate on which to form a more comprehensive picture of potential drivers of the observed expression changes.

Results

We divide this section into two parts corresponding to the two statistical significance questions we address: Ternary Dot Product Distribution and Causal Graph Randomization.

Ternary Dot Product Distribution

We begin by establishing notation and phrasing the problem in a slightly more abstract setting which we find helpful for investigating its mathematical structure.

Problem definition

A **u**(**u**(**u**(

We are interested in understanding the distribution of the agreement between the fixed experimental classification

Denote the parameters of

where

for

Contingency table comparing predicted and experimental classifications

_{++}

_{+-}

_{+0}

_{+}

_{-+}

_{--}

_{-0}

_{-}

_{0+}

_{0-}

_{00}

_{0}

_{+}

_{-}

_{0}

Contingency table of predicted and experimental classifications. The columns sum to _{+}, _{-}, and _{0}, the numbers of predicted classifications of each type, and the rows sum to _{+}, _{-}, and _{0}, the numbers of experimental classifications of each type.

The same 3 × 3 contingency table will arise from a large number of randomized classifications _{++}, _{+-}, _{-+}, _{--}], depends only on the top left 2 × 2 corner of the table since the other entries are determined by the constraints on row and column sums. Using multinomial coefficients, we can write

We will write _{±±}] as shorthand for this quantity.

The score for a classification

We also know that the total number of possible randomized classifications is

Thus, the distribution we are seeking is a sum of the _{++}, _{+-}, _{-+}, _{--}] aggregated by the score _{++}, _{+-}, _{-+}, _{--}] and normalized by _{tot}. Explicitly, the probability of a score

and the p-value of a score can be computed by summing the right tail of the distribution.

In the context of our illustrative example, these are the p-values given for hypotheses of interest in the "Ternary Dot Product

Algorithm

The Ternary Dot Product Distribution problem can be solved by computing each _{+}, _{-}, _{+}, _{-}, i.e., ^{4}) where _{+}, _{-}, _{+}, _{-}). While this complexity is acceptable for moderate values of

Instead of computing all the _{++ }+ _{--}. This still makes it possible to group them by the score _{++ }+ _{-- }and _{-+ }+ n_{+-}. We can write the sum of all the _{+- }+ _{-+ }in the form of a constant times

where _{+-}, _{+ }+ _{- }- _{++ }- _{--}, _{+ }- _{++}, _{+ }+ _{- }- _{++ }- _{--}, and _{- }- _{--}. It turns out that ^{3}) values of ^{3}) algorithm for our problem. (See Methods for a full description.)

This cubic algorithm is of theoretical interest but in practice requires exact arithmetic to obtain correct answers due to numerical instability (see Testing). We therefore developed a second algorithm that is both fast and practical, having the important advantage of working in floating-point arithmetic.

The key observation underlying our algorithm is that the vast majority of contingency tables are highly improbable (i.e., _{++}, _{+-}, _{-+}, _{--}]/_{tot }≪ 1) and thus may be safely ignored if we:

(a) need only carry out the computation to fixed precision; and

(b) do not care about the precise values of tail probabilities: it is enough to know that they are small.

Moreover, the quantities _{±±}] follow an easily described law on certain families of contingency tables, thus allowing us to identify entire families of tables that can be discarded after a constant amount of computation.

Consider families of configurations in which the row and column sums of the upper-left 2 × 2 submatrix (_{±±}) are fixed. Denote these sums by _{+}, _{-}, _{+}, _{-}, noting that as before, one constraint is redundant as _{+ }+ _{- }= _{+ }+ _{- }=: _{++}. It turns out that within each such family, _{±±}] is maximized when _{±± }are distributed in proportion to the 2 × 2 row and column sums, i.e.,

(with appropriate rounding), and moreover, the probability decreases monotonically as _{++ }is varied in either direction from the optimum. (See Methods for details and a proof.)

Our algorithm thus proceeds as follows (Figure _{max }over all 3 × 3 contingency tables with row and column sums _{σ}, _{τ}. As in the 2 × 2 case just discussed, _{max }is achieved when ^{3}) families of contingency tables with fixed upper-left 2 × 2 row and column sums _{σ}, _{τ}. For each such family, compute its maximum _{fam }by setting _{στ }≈ _{σ}_{τ}/_{στ }with _{fam }is less than _{max }times a chosen threshold factor ^{3}, though machine epsilon itself is likely sufficient for practical purposes), discard this family and proceed to the next one. Otherwise, the maximum probability for the family is non-negligible; in this case, iterate through the family upward and downward from the maximizing _{++}, updating the aggregate probabilities of the scores _{++}, _{+-}, _{-+}, _{--}] obtained, until the _{max}.

Pseudocode for Ternary Dot Product algorithms

**Pseudocode for Ternary Dot Product algorithms**. Pseudocode for algorithms computing the Ternary Dot Product Distribution using thresholding on families of contingency tables.

In practice, very few 2 × 2 families are within threshold. In fact, the computation time is often governed by the ^{3}) initial threshold tests for each family (with fewer than ^{3 }additional _{σ }of the upper-left 2 × 2 submatrix are fixed, leaving two degrees of freedom. Each such superfamily is the union of a set of families we considered above, and as before, the maximal

Testing

We tested our algorithms on a wide range of problem parameters and found that our thresholded algorithm achieves substantial speed gains across parameter distributions. Table

Run times for Ternary Dot Product Distribution algorithm

**Problem size ( n**

**Quartic algorithm:compute all D-values**

**Thresholded algorithm**

8

0.05

0.07 s

16

0.19

0.15 s

32

0.92

0.36 s

64

6.16

0.61 s

128

53.15

2.35 s

256

689.18

5.93 s

512

7864.20

19.54 s

1024

> 1 d

85.76 s

Run time comparison of simple quartic Ternary Dot Product Distribution algorithm to thresholded version for an increasing family of problems with (_{+}, _{-}, _{0}, _{+}, _{-}) in the ratio (1, 1, 50, 2, 1), a typical usage scenario. Runs were performed on a 3.0 GHz Intel Xeon processor with 2 MB cache.

To further investigate the efficiency attained by thresholding, we computed counts of the numbers of _{0 }= 5_{+ }and one with _{0 }= 50_{+}. The first case is relatively dense, i.e., a sizeable portion (around 30%) of the gene transcripts are significantly upregulated or downregulated. The second case is sparser; here, there are many more genes but only a few percent of them are found to be regulated. This latter case is typical in practice.

Computational complexity of Ternary Dot Product algorithms

**Computational complexity of Ternary Dot Product algorithms**. Counts of the numbers of _{max }threshold. The left panel shows a "dense" case _{0 }= 5_{+ }while the right panel shows a "sparse" case _{0 }= 50_{+}. For these examples we set _{+ }= _{- }= _{+ }= _{- }and chose ^{-16}.

The solid black curve in Figure _{max}, thus placing a lower bound on the amount of work that any thresholding-based algorithm must perform. The disparity between these two curves immediately demonstrates the reason our thresholding algorithms achieve speedup: only a tiny fraction of the _{0 }= 5_{+}, we see that 2 × 2 thresholding (Algorithm 1a) is probably already close to optimally efficient: the amount of work required to do the threshold checks (solid blue curve) is comparable to the total amount of work required to compute all relevant _{0 }= 50_{+}, even performing 2 × 2 threshold checks leaves much room for improvement because the number of relevant ^{2}) 3 × 2 threshold checks (solid red line). For an analytical discussion of these phenomena and a proof that the 2 × 2 thresholding algorithm has complexity ^{3.5}), see Methods.

We have left our cubic algorithm out of the previous figures and discussion because unfortunately, our tests showed that it is numerically unstable, at least in the form stated; we now briefly discuss this issue. While the cubic algorithm does yield the correct distribution when implemented in arbitrary-precision exact arithmetic, it fails when implemented in floating-point arithmetic because the range of values in the recurrence

Implementation

We implemented all of our algorithms in R ^{3}), and using a stochastic model of rounding error, the total accumulated relative error is thus bounded by ^{3/2}) times machine epsilon. In practice ^{-16 }so there is no concern.

The only caveat, as we noted initially, is that our algorithm guarantees precision relative to the maximum probability of all score values--not the probability of each particular score. In other words, very small tail probabilities are known only to the extent that they are understood to be negligible compared to probabilities from the bulk distribution; their precise values are not computed.

Causal Graph Randomization

We now turn to our second computational problem arising from statistical significance evaluation in causal graph models, that of graph randomization. We begin by defining the Causal Graph Randomization problem and placing it in context with previous work on graph randomization. We then explain the special challenges of randomizing a signed causal graph and present an algorithm that successfully overcomes these challenges in practice.

Problem definition

The basic statistical significance question motivating our study of graph randomization is the same as before: How likely is a given observation to have occurred by chance? In the preceding development we analyzed this question from the standpoint of randomizing the identities of gene transcripts classified as upregulated or downregulated in a gene expression assay; now we take the perspective of randomizing the causal graph itself. Note that the ability to efficiently sample randomized versions of the graph allows one to create an empirical distribution of any quantitative graph property of interest, in particular enabling p-value computation.

In our setup, we estimate the p-value of a hypothesis as the proportion of the randomized graphs with a better score for the hypothesis than the actual causal graph. This is the general context in which we computed the p-values listed in the "Causal Graph

In order to obtain an appropriate null distribution on causal graphs, it is important to require that the randomized graphs share basic structural properties with the original causal graph, yet have enough flexibility to reflect the space of reasonable graphical models. We propose to fix the vertex set

1. Vertex degrees. We require that each vertex

2. Simplicity. We disallow self-edges and parallel edges in

3. Connectedness. We require that

Note that the first two properties are local and the third is global. These properties capture the most significant features of a causal graph and have also been the subject of previous study in the graph randomization literature

Challenges in causal graphs

In the case of undirected graphs, the randomization problem is typically solved by defining a Markov chain whose state space is

In our situation, signed directed graphs, a natural generalization of the above randomization algorithm is to allow edge switches and triangle flips of same-sign edges. Such operations clearly preserve in- and out-degrees while modifying the edge structure of the graph, but unfortunately the sign requirement substantially constrains the set of possible transitions. We have identified several obstacles that can make parts of the state space

Two obstacles to randomization of signed directed graphs

**Two obstacles to randomization of signed directed graphs**. A strong quadrilateral and a strong triangle. Solid lines indicate positive edges and dotted lines indicate negative edges.

The first one is the

The second obstacle is the

Now, while these examples show that in general it is impossible to produce all the graphs in

Let _{1}, _{2}), (_{1}, _{2}), (_{1}, _{2}) disjoint from each other and {

Flipping a strong triangle using auxiliary edges

**Flipping a strong triangle using auxiliary edges**. The sequence of same-sign edge switches and triangle flips that flips a strong triangle: (1) Opening, (2) Flipping, (3) Closing, and (4) Restoring. Solid lines indicate positive edges and dotted lines indicate negative edges.

1. Opening: Switch (_{1}, _{2}), (_{1}, _{2}), (_{1}, _{2}).

2. Flipping: Flip the triangle (

3. Closing: Switch (_{1}, _{2}), (_{1}, _{2}), (_{1}, _{2}).

4. Restoring: Switch (_{1}, _{2}) with (_{1}, _{2}) and then switch (_{1}, _{2}) with (_{1}, _{2}).

Algorithm

Given that causal graphs arising from biological networks are typically large and sparse, we expect that in practice the combination of same-sign edge flips and triangle switches suffices to overcome local obstacles to randomization, as observed above.

We thus propose the following algorithm for Causal Graph Randomization. Repeatedly perform the following procedure:

1. Pick two edges uniformly at random from the edge set

2. If the edges share no endpoints, perform an edge switch if it is legal; otherwise, restart.

3. If the edges share one endpoint and belong to a directed triangle, perform a triangle flip if it is legal; otherwise, restart.

Note that in order for a transition to be legal, connectedness must be preserved (Property 3), which is a global property and thus slow to verify. To improve the efficiency of our algorithm, we therefore perform multiple iterations in between connectivity checks. We allow the number of iterations _{+}. If it fails, we multiply it by 1 - _{+ }and

An important final detail of the algorithm is the number of iterations to perform; this relates to the mixing time of the Markov chain. While the mixing times of chains arising from graph randomization are not theoretically known, a constant multiple

Testing

We tested our algorithm on the causal graph studied in our companion paper

We also tabulated some statistics from an independent set of 79 runs with

Statistics from runs of Causal Graph Randomization algorithm

**Structure**

**Occurrence rate**

Strong quadrilateral

3.76 × 10^{-4}

Flippable triangle

1.22 × 10^{-6}

Strong triangle

2.44 × 10^{-9}

Rates of occurrence of local graph structures in 79 runs of the randomization algorithm on our test graph. A total of 5.3 billion iterations were performed during these runs.

Finally, we recorded the variation of the connectivity check interval

Implementation

We implemented our algorithm in R using the _{+ }≈ 0.131,

Discussion

Our work provides practical algorithms for assessing statistical significance in causal graphs but also raises a number of unresolved theoretical questions; we describe a few of them now.

In the Ternary Dot Product Distribution problem, we saw that the recursion used to obtain a cubic algorithm leads to cancelation of large approximately equal numbers. This naturally brings up the following question: Is numerical instability an artifact of a poor setup of the recursion computing

Another open question is the precise computational complexity of our thresholding algorithm. In Methods we prove an ^{3.5}) bound on the complexity, but our empirical results (Figure _{±±}] that are within a multiplicative factor of _{max}, as a function of

Furthermore, it would be interesting to investigate the consequences of level stratification in regulatory networks in order to propose a more refined null model. While such a multilevel model may indeed provide more precise estimates of statistical significance, it would be much more challenging to estimate that significance and would likely require simulation rather than an analytic approach like the one in this paper.

In the Causal Graph Randomization problem, we saw that same-sign edge switches and triangle flips are insufficient to reach all possible random graphs in the state space

On the other hand, in practical cases with large, sparse graphs, we showed that it is often possible to overcome local obstacles to randomization. This gives rise to the following question: Is there a lower bound on the size or upper bound on the edge density of the graph that would make same-sign edge switches and triangle flips sufficient?

An alternative approach to overcoming obstacles is to limit ourselves to edge switches and triangle flips, but allow several moves to be performed in sequence before the simplicity of the resulting graph is verified. Let _{s}(_{s}(^{2 }-- _{s}(

Finally, even in cases that Markov chains can be shown to generate all possible graph randomizations, their mixing time remains an open question. It is known that the Markov chain rapidly mixes in the case of

In some cases it may be possible to reduce the size of a causal graph, and thereby the resources required to solve the Causal Graph Randomization problem, by performing a transitive reduction of the graph. A transitive reduction of a graph is a minimal graph with the same transitive closure as the original graph (so a transitive reduction does not contain any edges between vertices that are connected by a different path in the graph). Transitive reduction has been successfully used in computational biology

Conclusions

This paper presents the first systematic attempt at addressing the computational challenges that arise in the evaluation of the significance of results produced by a causal graph-based model. We develop two algorithms for the Ternary Dot Product Distribution problem and one algorithm for the Causal Graph Randomization problem. All the algorithms are implemented in the statistical computing language R and available on request for academic purposes. We believe that our work opens the door to further study of causal graphs from both a theoretical and practical perspective, and we hope that these algorithms will enable the integration of statistical significance computations into causal graph-related methods in biology and other areas of science.

Methods

Quartic algorithm for Ternary Dot Product Distribution

The Ternary Dot Product Distribution problem can be solved with a simple algorithm using the following relationships between the

where

This algorithm can be made numerically stable by computing an initial normalized value

Cubic algorithm for Ternary Dot Product Distribution

Setting _{1 }:= _{++ }+ _{--}, _{2 }:= _{-+ }+ _{+-}, _{++}, _{+-}, we rewrite _{±±}] as

where _{+ }:= _{+ }- (_{- }:= _{- }- (_{1 }+ _{2}) + (_{0 }:= _{1 }+ _{2}). By rearranging the factorials, we can further rewrite this expression as

where

Note that the product above only depends on _{1}, _{2}, _{1}, _{2}, _{t }_{1}, _{2},

Let us now define

where we made the following substitutions to simplify the previous expression: _{2}, _{+ }+ _{- }- _{1}, _{+ }- _{+ }+ _{- }- _{1}, _{- }- _{1 }+

By using the WZ algorithm

where the coefficients of the polynomial multipliers are given in Additional File

**Recurrence relation for Ternary Dot Product Distribution cubic algorithm**. Details of recurrence relation for

Click here for file

Practical algorithm for Ternary Dot Product: Mathematical details and ^{3.5}) complexity bound

Consider families of contingency matrices in which the row and column sums of the upper-left 2 × 2 submatrix (_{±±}) are fixed. Denote these sums by _{+}, _{-}, _{+}, _{-}, noting that as before, one constraint is redundant as _{+ }+ _{- }= _{+ }+ _{- }=: _{++ }=:

Within each such family, the values of _{0+}, _{0-}, _{+0}, _{-0}, _{00 }are determined by _{+}, _{-}, _{+}, _{- }and thus independent of

Explicitly, the proportionality constant is _{±± }are distributed in proportion to the 2 × 2 row and column sums, i.e.,

(with appropriate rounding), and moreover, the probability decreases monotonically as

The numerator and denominator are both monic quadratics in

We now provide an argument that our algorithm performs no more than ^{3.5}) bound on the complexity of the overall algorithm. Denote by _{opt }≈ _{+}_{+}/_{++ }maximizing

As _{opt}, observe that the terms in the numerator of Δ

(In fact, it is not hard to see that all four terms contribute such factors, but for the purpose of asymptotics our bounds need not be tight.) Chaining these bounds together,

from which it follows that the

Authors' contributions

LC and PL developed and tested the methods and drafted the manuscript. LC implemented the algorithms in R and reviewed the literature. AE prepared the illustrative example application. BB participated in the design and coordination of the project and helped draft the manuscript. DZ conceived of the project, participated in its design and coordination, and helped draft the manuscript. All authors read and approved the final manuscript.

Acknowledgements

The authors would like to thank Amy Rossman for assistance creating the illustration of a strong triangle flip using auxiliary edges and the three anonymous reviewers for many helpful suggestions that improved the clarity of this manuscript. PL was supported by an NSF Graduate Research Fellowship. BB was supported by NIH grant GM081871.