Department of Biomedical Informatics, Columbia University, New York, NY 10032

Joint Centers for Systems Biology, Columbia University, New York, NY 10032

Institute for Cancer Genetics, Columbia University, New York, NY 10032

Department of Applied Physics and Applied Mathematics, Columbia University, New York, NY 10032

IBM T.J. Watson Research Center, Yorktown Heights, NY 10598

Abstract

Background

Elucidating gene regulatory networks is crucial for understanding normal cell physiology and complex pathologic phenotypes. Existing computational methods for the genome-wide "reverse engineering" of such networks have been successful only for lower eukaryotes with simple genomes. Here we present

Results

We prove that ARACNE reconstructs the network exactly (asymptotically) if the effect of loops in the network topology is negligible, and we show that the algorithm works well in practice, even in the presence of numerous loops and complex topologies. We assess ARACNE's ability to reconstruct transcriptional regulatory networks using both a realistic synthetic dataset and a microarray dataset from human B cells. On synthetic datasets ARACNE achieves very low error rates and outperforms established methods, such as Relevance Networks and Bayesian Networks. Application to the deconvolution of genetic networks in human B cells demonstrates ARACNE's ability to infer validated transcriptional targets of the cMYC proto-oncogene. We also study the effects of misestimation of mutual information on network reconstruction, and show that algorithms based on mutual information ranking are more resilient to estimation errors.

Conclusion

ARACNE shows promise in identifying direct transcriptional interactions in mammalian cellular networks, a problem that has challenged existing reverse engineering algorithms. This approach should enhance our ability to use microarray data to elucidate functional mechanisms that underlie cellular processes and to identify molecular targets of pharmacological compounds in mammalian cellular networks.

Background

Cellular phenotypes are determined by the dynamical activity of large networks of co-regulated genes. Thus dissecting the mechanisms of phenotypic selection requires elucidating the functions of the individual genes in the context of the networks in which they operate. Because gene expression is regulated by proteins, which are themselves gene products, statistical associations between gene mRNA abundance levels, while not directly proportional to activated protein concentrations, should provide clues towards uncovering gene regulatory mechanisms. Consequently, the advent of high throughput microarray technologies to simultaneously measure mRNA abundance levels across an entire genome has spawned much research aimed at using these data to construct conceptual "gene network" models to concisely describe the regulatory influences that genes exert on each other.

Genome-wide clustering of gene expression profiles

Within the last few years a number of sophisticated approaches for the reverse engineering of cellular networks (also called deconvolution) from gene expression data have emerged (reviewed in

Here we introduce

Theoretical Background

Several factors have impeded the reliable reconstruction of genome-wide mammalian networks. First, temporal gene expression data is difficult to obtain for higher eukaryotes, and cellular populations harvested from different individuals generally capture random steady states of the underlying biochemical dynamics. This precludes the use of methods that infer temporal associations and thus plausible causal relationships (reviewed in _{i}}),

where _{i}}) is the

Note that Eq. (1) does not define the potentials uniquely, and additional constraints are needed to avoid the ambiguity (see Appendix B). A reasonable approach is to specify _{1},..., _{N}) consistent with known marginals, so that constraining an

Approximations of the interaction structure

Since typical microarray sample sizes are relatively small, inferring the exponential number of potential _{i}}) = ∑_{i}), such that first-order potentials can be evaluated from the marginal probabilities, _{i}), which are estimated from experimental observations. As more data become available we should be able to reliably estimate higher order marginals and incorporate the corresponding potentials progressively, such that for _{i}, _{j}, _{k}) requires about an order of magnitude more samples. Thus the current version of ARACNE truncates Eq. (1) at the pairwise interactions level, _{ij }= 0 are declared mutually non-interacting. This includes genes that are statistically independent (i.e., _{i}, _{j}) ≈ _{i})_{j})), as well as genes that do not interact directly but are statistically dependent due to their interaction via other genes (i.e. _{i}, _{j}) ≠ _{i})_{j}), but _{ij }= 0). We note that _{i}, _{j}) = _{i})_{j}) is not a sufficient condition for _{ij }= 0. We discuss this below.

Since the number of potential pairwise interactions is quadratic in

Algorithm

Within the assumption of a two-way network, all statistical dependencies can be inferred from pairwise marginals, and no higher order analysis is needed. While not implying that this is always the case for biological networks, it is important to understand whether this assumption may allow the inference of a subset of the true interactions with fewer false positives. Thus we identify candidate interactions by estimating pairwise gene expression profile mutual information, _{i}, _{j}) ≡ _{ij}, an information-theoretic measure of relatedness that is zero _{i}, _{j}) = _{i})_{j}). We then filter MIs using an appropriate threshold, _{0}, computed for a specific p-value, _{0}, in the null-hypothesis of two independent genes. This step is basically equivalent to the Relevance Networks method

Thus in its second step, ARACNE removes the vast majority of indirect candidate interactions (_{ij }= 0) using a well-known information theoretic property, the data processing inequality (DPI, discussed in detail later), that has not been previously applied to the reverse engineering of genetic networks.

Mutual Information

_{i}) = _{i}) is the probability of each discrete state (value) of the variable (in this work, logarithms are natural). For continuous variables the entropy is infinite but the MI remains well defined and can be computed by replacing

MI Estimation

We estimate MI using a computationally efficient Gaussian Kernel estimator _{i}, _{i}},

Since MI is reparameterization invariant, we copula-transform (i.e., rank-order)

For a spatially uniform ^{2}_{ij }≥ _{0}, where _{0 }is the statistical significance threshold. Similarly, the DPI (see below) only requires ranking the MIs.

Producing reliable estimates of the MI ranks is an easier task. From the work on estimation of MI for discrete variables

MI and MI rank estimation errors for varying Gaussian kernel widths

**MI and MI rank estimation errors for varying Gaussian kernel widths**. The mean absolute percent error in estimating mutual information for bivariate normal densities is compared to the percent of errors in ranking the relative mutual information values for randomly sampled pairs for which the distribution with the lower true MI value is between 70% and 99% of the distribution with the higher value. MI estimation error (dashed blue line) is highly sensitive to the choice of Gaussian kernel width used by the estimator and grows rapidly for non-optimal parameter choices. However, due to similar bias for distributions with close MI values, the error in ranking pairs of MIs (solid red line) is much less sensitive to the choice of this parameter. These averages were produced using samples from 1,000 bivariate normal densities with a random uniformly distributed correlation coefficient

Statistical Threshold for Mutual Information

Since MI is always non-negative, its evaluation from random samples gives a positive value even for variables that are, in fact, mutually independent. Therefore, we eliminate all edges for which the null hypothesis of mutually independent genes cannot be ruled out. To this extent, we randomly shuffle the expression of genes across the various microarray profiles, similar to _{0}, by empirically estimating the fraction of the estimates below _{0}. This is done for different sample sizes ^{5 }gene pairs so that reliable estimates of _{0}(^{-4}. Extrapolation to smaller p-values is done using

Determination of mutual information statistical significance. P-values are assigned to MI thresholds using a Monte Carlo simulation for different kernel widths, sample sizes (^{5 }gene pairs so that reliable estimates are produced up to ^{-4 }(solid lines). Extrapolation to smaller p-values is done using

Click here for file

Data Processing Inequality

The DPI (Figure _{1 }and _{3 }interact only through a third gene, _{2}, (i.e., if the interaction network is _{1 }↔ ... ↔ _{2 }↔ ... ↔ _{3 }and no alternative path exists between _{1 }and _{3}), then

Examples of the data processing inequality

**Examples of the data processing inequality**. **(a) **_{1}, _{2}, _{3}, and _{4 }are connected in a linear chain relationship. Although all six gene pairs will likely have enriched mutual information, the DPI will infer the most likely path of information flow. For example, _{1 }↔ _{3 }will be eliminated because _{1}, _{2}) >_{1}, _{3}) and _{2}, _{3}) >_{1}, _{3}). _{2 }↔ _{4 }will be eliminated because _{2}, _{3}) >_{2}, _{4}) and _{3}, _{4}) >_{2}, _{4}). _{1 }↔ _{4 }will be eliminated in two ways: first, because _{1}, _{2}) >_{1}, _{4}) and _{2}, _{4}) >_{1}, _{4}), and then because _{1}, _{3}) >_{1}, _{4}) and _{3}, _{4}) >_{1}, _{4}). **(b) **If the underlying interactions form a tree (and MI can be measured without errors), ARACNE will reconstruct the network exactly by removing all false candidate interactions (dashed blue lines) and retaining all true interactions (solid black lines).

_{1}, _{3}) ≤ min [_{1}, _{2}); _{2}, _{3})]. (3)

Thus the least of the three MIs can come from indirect interactions only, and checking against the DPI may identify those gene pairs for which _{ij }= 0 even though _{i}, _{j}) ≠ _{i})_{j}). Correspondingly, ARACNE starts with a network graph where each _{ij }>_{0 }is represented by an edge (_{0 }and removes the edge with the smallest value. Each triplet is analyzed irrespectively of whether its edges have been marked for removal by prior DPI applications to different triplets. Thus the network reconstructed by the algorithm is independent of the order in which the triplets are examined.

Since this approach focuses only on the reconstruction of pairwise interaction networks, a pair of mutually independent genes, _{ij }<_{0}, will never be connected by an edge. Therefore, interactions represented by higher-order potentials for which the corresponding pairwise potentials are zero will not be recovered (see discussion). Additionally, even for a second order interaction network, one may imagine a situation where the effect of a direct interaction is exactly cancelled out by indirect interactions through other node(s), resulting in _{ij }≠ 0 and _{i}, _{j}) ≈ _{i})_{j}). This situation will not be identified by ARACNE. However, we believe that such precise cancellation is biologically unrealistic and the following theorems specify conditions under which ARACNE will reconstruct the network exactly. Proofs of all theorems can be found in the Appendix A.

Theorem 1

If MIs can be estimated with no errors, then ARACNE reconstructs the underlying interaction network exactly, provided this network is a tree and has only pairwise interactions.

However, unlike standard tree reconstruction methods (e.g. Chow and Liu

Theorem 2

The Chow-Liu (CL) maximum mutual information tree is a subnetwork of the network reconstructed by ARACNE.

Theorem 3

Let _{ik }be the set of nodes forming the shortest path in the network between nodes _{ik}, _{ij }≥ _{ik}. Further, ARACNE does not produce any false negatives, and the network reconstruction is exact _{ij }≥ min(_{jk}, _{ik}).

Tree networks satisfy all conditions of Theorem 3, while topologies containing loops may or may not. In particular, networks with three-gene loops definitely violate (c) [but may still satisfy (a) and (b)], and

Finally, to minimize the impact of the variance of the MI estimator, a tolerance, _{ij }≤ _{ik}(1 -

Prediction errors as a function of DPI tolerance. The number of inferred errors, _{FP }+ _{FN}, are plotted as a function of the DPI tolerance, **(a) **the Erdös-Rényi and **(b) **the scale-free topologies. Raising ^{-4 }and a synthetic microarray size of 1,000.

Click here for file

Algorithmic Complexity

Because for a network of ^{3 }+ ^{2}^{2}), where ^{2}^{2}). As a result, ARACNE can efficiently analyze networks with tens of thousands of genes.

Results

We study ARACNE's performance in reconstructing a class of synthetic networks proposed by

Comparative Algorithms

A _{1},..., _{n}}, and whose edges correspond to parent-child dependencies among variables; see

Synthetic Networks

Networks Specification

We benchmark the three algorithms using synthetic transcriptional networks proposed by Mendes et al. ^{-γ }with

Topology of the 100 gene regulatory networks proposed by Mendes

**Topology of the 100 gene regulatory networks proposed by Mendes**. Blue/red edges correspond to activation/inhibition. For the Erdös-Rényi topology **(a) **each gene is equally likely to be connected to every other gene, while the scale-free topology **(b) **is characterized by large interaction hubs with many connections.

The Mendes models use a multiplicative Hill kinetics to approximate transcriptional interactions:

where _{i }is the concentration (expression) of the _{I }and _{A }are the number of upstream inhibitors and activators respectively, and their concentrations are _{j }and _{l}. All other parameters are specified in

We obtain synthetic expression values of each gene _{i }in each microarray _{k }by simulating its dynamics until the system relaxes to a steady state _{i }= _{k,i }_{i }= _{k,i }_{k,i}, _{k,i }are random variables uniformly distributed in [0.0, 2.0]. Note that _{k,i }~ 0.0 corresponds to a gene knock-out, while _{k,i }~ 2.0 is a 2 fold increase in the synthesis rate. This parameter randomization models the sampling of a population of distinct cellular phenotypes at random time points (at or close to equilibrium), as is the case for the B cell experiments described later, where the efficiency of individual biochemical reactions may be different from assay to assay due to differences in temperature, nutrients, genetic mutations, etc. Although this model is a clear simplification of real biological networks, it forms a reasonably complex interaction network that captures some elements of transcriptional regulation, and an algorithm that does not perform well on this model is unlikely to perform well in a more complex case. Within this model, an interaction is unambiguously defined as a direct regulatory effect of one gene on another. Thus the performance of reverse engineering algorithms can be studied by comparing the inferred statistical interactions to the direct interactions in the model. We specifically note that, to our knowledge, this is the first attempt to benchmark network reverse engineering algorithms based on published objective criteria.

Performance metrics

Since genetic networks are sparse, potential false positives (_{FP}), that is, identification of an irreducible statistical interaction between two genes not connected by a direct regulatory link, far exceed potential true positives (_{TP}) _{TN}/(_{FP }+ _{TN}), which is typically used in ROC analysis, is inappropriate as even small deviation from a value of 1 will result in large false positive numbers. Therefore, we choose two closely related metrics, precision and recall. Recall, _{TP}/(_{TP }+ _{FN}), indicates the fraction of true interactions correctly inferred by the algorithm, while precision, _{TP}/(_{TP }+ _{FP}), measures the fraction of true interactions among all inferred ones. Note that precision corresponds to the expected success rate in the experimental validation of predicted interactions. Performance will thus be assessed using Precision-Recall Curves (PRCs). PRCs for ARACNE and RNs are generated by adjusting the p-value or, equivalently, the MI threshold. ARACNE's PRC does not extend to 100% recall since the DPI eliminates some interactions even at _{0 }= 1. To reach the 100% recall, the DPI tolerance,

Performance Evaluation

As shown in Figure _{0 }~ 10^{-4}, exactly where we would expect the algorithm to begin inferring large numbers of non-statistically significant interactions for a network of this size. This suggests that a sensible value for the MI threshold, producing a near optimal result, can be selected

Precision vs. Recall for 1,000 samples generated from the Mendes network

**Precision vs. Recall for 1,000 samples generated from the Mendes network**. **(a) **Erdös-Rényi network topology. **(b) **Scale-free topology. ARACNE's PRCs are consistently better than those of the other algorithms, and the precision reaches ~100% while maintaining high recall. Points on the PRCs for ARACNE and RNs corresponding to _{0 }= 10^{-4 }(the value yieding <0.5 expected false positives for 4,950 potential interactions) are indicated with arrows.

ARACNE's high performance can be better understood by analyzing the distribution of MIs as a function of the length of the shortest path connecting each gene pair (degree of connectivity). ARACNE depends on MI being enriched for directly interacting genes and decreasing rapidly with this distance. Figure

Distribution of mutual information for different lengths of the shortest path between genes for the scale-free topology

**Distribution of mutual information for different lengths of the shortest path between genes for the scale-free topology**. Here we plot the log of the empirical probability that MI for a given separation between genes is above some value (in nats) marked on the horizontal axis. High MI values are significantly more probable for closer genes. Statistical significance threshold of 10^{-4 }for the background MI distribution, corresponding to _{0 }= 0.0175 nats, is marked on the graph. As shown, this threshold retains a large number of indirect candidate interactions, and there is no threshold that would be able to separate indirect and direct interactions; a threshold that eliminates most of the former (red arrows) also eliminates the majority of the latter. This severely degrades performance of RNs. (Inset) Expanded log-log view of the MI distribution for 934 gene pairs with 3 or more intermediaries and the background distribution computed by Monte Carlo. The curves are virtually indistinguishable, indicating that the background distribution can be used to obtain reliable estimates of statistical significance thresholds for filtering genes with higher degrees of connectivity. Similar results apply for the Erdös-Rényi topology (see

MI distribution for different shortest path lengths for the Erdös-Rényi topology. Red and black arrows are explained in the legend of Figure

Click here for file

Recovery for varying numbers of samples generated from the Mendes networks, which contain an average of ~194 true interactions after self-loops and bidirectional edges are eliminated.

**
Erdös-Rényi Topology
**

**ARACNE**

**Relevance Networks**

**
DPI Sensitivity
**

**
DPI Precision
**

**Bayesian Networks**

**
Num samples
**

_{
TP
}

_{
FP
}

_{
TP
}

_{
FP
}

_{
TP
}

_{
FP
}

128.00

1.33

143.33

462.67

99.71%

96.78%

50.00

32.33

124.33

2.67

139.33

411.00

99.35%

96.46%

45.33

31.00

119.00

1.67

130.67

311.33

99.46%

96.37%

41.00

29.00

101.00

4.67

110.00

182.33

97.44%

95.18%

24.67

25.33

81.00

4.67

84.67

95.00

95.09%

96.10%

5.33

19.00

**
Scale-Free Topology
**

**ARACNE**

**Relevance Networks**

**
DPI Sensitivity
**

**
DPI Precision
**

**Bayesian Networks**

**
Num samples
**

_{
TP
}

_{
FP
}

_{
TP
}

_{
FP
}

_{
TP
}

_{
FP
}

97.67

2.33

113.33

234.00

99.00%

93.67%

38.67

17.00

90.67

3.33

103.00

200.00

98.33%

94.10%

33.33

15.33

80.33

5.33

91.67

154.67

96.55%

92.95%

27.00

13.33

63.33

7.67

70.00

80.00

90.42%

91.56%

9.00

9.67

46.33

3.67

48.00

49.67

92.62%

96.50%

4.00

6.00

Recovery for varying numbers of samples generated from the Mendes networks, which contain an average of ~194 true interactions after self-loops and bidirectional edges are eliminated. For all sample sizes ARACNE efficiently eliminates almost all false candidate interactions inferred by RNs, as indicated by the DPI sensitivity (calculated as the percent of false positives eliminated by the DPI), with minimal reduction in true positives, as indicated by the DPI precision (calculated as the percent of false positives removed out of the total number of edges removed by the DPI). Moreover, as the sample size decreases, the number of true connections inferred by ARACNE decays gracefully while the number of false positives remains very low, whereas the performance of Bayesian Networks degrades rapidly for smaller sample sizes as the conditional probability tables become very sparsely populated. Results are calculated using a p-value of 10^{-4 }for ARACNE and Relevance Networks, yielding <0.5 expected false positives for 4,950 potential interactions, and using a Dirichlet prior with equivalent sample size of one for Bayesian Networks [19]. Results are averaged over three network configurations for each topology.

In summary, ARACNE appears to (a) achieve very high precision and substantial recall, even for few data points (125), (b) allow an optimal choice of the parameters h (Gaussian Kernel width) (Figure _{0 }(statistical threshold), (c) to be quite stable with respect to the choice of parameters, and (d) to produce robust reconstruction of complex topologies containing many loops.

Synthetic network reconstruction errors for varying Gaussian kernel widths

**Synthetic network reconstruction errors for varying Gaussian kernel widths**. The total number of inferred errors (_{FP }+ _{FN}) in reconstructing the Mendes networks is stable with respect to choice of estimator kernel width, validating the observation that rankings of MIs are more stable than the MI estimates with respect to changes in this parameter (**Figure 1**). The choice of kernel width for each number of samples that minimizes the mean absolute MI estimation error for bivariate Gaussian densities (indicated with diamonds) yields optimal or near optimal reconstruction of this network for all samples sizes. Results are calculated for a statistical significance threshold of 10^{-4 } for the scale-free network topology.

Human B Cells

Although large gene expression datasets such as those derived from systematic perturbations to simple organisms (e.g.,

This dataset was deconvoluted using ARACNE to generate a B cell specific regulatory network consisting of approximately 129,000 interactions. Since the c-MYC proto-oncogene emerges as one of the top 5% largest cellular hubs in the complete network and is extensively characterized in the literature as a transcription factor, we performed a first validation of the overall network quality by comparing its interactions inferred by our method with those previously identified by biochemical methods. The ^{-23 }by ^{2 }test) with respect to the expected 11% of background c-MYC targets among randomly selected genes

Discussion

ARACNE, which is motivated by statistical mechanics and based on an information theoretic approach, provides a provably exact network reconstruction under a controlled set of approximations. While we have shown that these approximations are reasonable even for complex mammalian gene networks, they may cause the algorithm to fail for some control structures. First, ARACNE will open all three-gene loops along the weakest interaction, and therefore introduce false negatives for triplets of interacting genes (although some may be preserved when a nonzero DPI threshold is used). Improvements to the algorithm are being investigated to address this condition. Second, by truncating Eq. (1) at the pairwise interactions, ARACNE will not infer statistical dependencies that are not expressed as pairwise interaction potentials (such as an XOR Boolean table for which MI between any gene pair is zero). By expanding Eq. (1) to include third and higher order potentials our formulation, in principle, can be extended to distinguish higher order interactions as well

Because mRNA abundance measurements only serve as a proxy for the interacting molecular species (i.e., activated protein concentrations), the type of physical interactions corresponding to the irreducible statistical dependencies identified by ARACNE are not always clear. For example, if the activity of a transcription factor is primarily mediated by an activating enzyme, rather than by changes in its mRNA abundance level, we expect ARACNE to identify dependencies between this enzyme and the target genes of the transcription factor. Moreover, a violation of the algorithm's hypotheses may occur for proteins involved in stable complex formation. Since it is energetically efficient for the cell to produce a stochiometrically balanced concentration of proteins involved in stable complexes (e.g., the ribosomal units), evolution has finetuned the transcriptional control of these proteins so that their concentrations are balanced. Thus, regardless of the concentration of the several transcription factors (TF) that may control their expression, the correlation between the final protein concentrations is generally higher than that between each protein and each individual TF. This violates the assumptions of Theorem 3 and produces irreducible statistical interactions between protein pairs involved in stable complex formation. Therefore, we expect some edges to correspond to protein-protein interactions, although we note that this situation would be correctly handled if higher order dependencies were analyzed.

Finally, we end with the following observation. Since ARACNE may fail for topologies with many tight loops, it is important to understand whether an analyzed topology is, in fact, locally tree like, and, therefore, the reconstruction can be trusted. We suggest two heuristics. First, loopy topologies continue to have more loops after reconstruction (results not shown). Thus an excessive number of loops in a deconvolved network should serve as a warning sign (Appendix C); more analysis is required to determine an acceptable range for this statistic. Second, as in the current analysis, predictions made by ARACNE (or, for that matter, any other computational algorithm) should be directly experimentally verified.

Conclusion

The goal of ARACNE is not to recover

ARACNE's high precision in reconstructing a synthetic network designed to simulate transcriptional interactions, as well as the inference of bona-fide targets of c-MYC, a known transcription factor, in human B cells, suggest ARACNE's promise in identifying direct transcriptional interactions with low false-positive rates in mammalian networks, an obvious challenge for all reverse engineering algorithms. More research is needed to precisely characterize other types of interactions corresponding to irreducible statistical dependencies identified by ARACNE. We suggest that predictions made by ARACNE can be used in conjunction with other data modalities such as genome-wide location data, DNA sequence information, or targeted biochemical experiments to progress towards this level of detail. We plan to investigate this possibility using a model organism platform as well as extensions to the simulation model. However, studies based on targeted perturbations to model organisms have demonstrated the promise of using conceptual "gene-gene" networks to elucidate functional mechanisms underlying cellular processes

Appendices

Appendix A – Proofs of Theorems

Theorem 1

If MIs can be estimated with no errors, then ARACNE reconstructs the underlying interaction network exactly, provided this network is a tree and has only pairwise interactions.

Proof of Theorem 1

First, notice that for every pair of nodes

Theorem 2

The Chow-Liu (CL) maximum mutual information tree is a subnetwork of the network reconstructed by ARACNE.

Proof of Theorem 2

We notice that, without a loss of generality, we can assume that the Chow-Liu tree and the ARACNE construction span all the nodes of the network. If this is not the case, that is, a few connected clusters exist (separated by edges with zero MI), then for the purpose of this theorem we can complete CL and ARACNE structures by the same edges with zero MI without formation of additional loops, till they become spanning. Now suppose that the theorem is false and there exists an edge (_{i }and _{j }that contain the _{ik}, _{jk}) >_{ij}. Without a loss of generality, let _{i}. Then replacing the (_{jk }- _{ij }> 0. Thus the original tree is not the maximum MI tree. We arrive at a contradiction, which proves the theorem.

Theorem 3

Let _{ik }be the set of nodes forming the shortest path in the network between nodes _{ik}, _{ij }≥ _{ik}. Further, ARACNE does not produce any false negatives, and the network reconstruction is exact _{ij }≥ min(_{jk}, _{ik}).

Proof of Theorem 3

To prove the absence of false positives, we notice that, for every candidate edge (_{ik}. Applying DPI to the (

Appendix B – Relations to Graphical Models and Statistical Physics

The definition of dependencies employed in the paper, which is based on the presence of a potential that couples interacting genes in the JPD,

is similar to that used in the theory of graphical models, specifically Markov Networks (MNs)

As is understood in the graphical models literature, the formulation of Equation 1 resembles some statistical mechanics problems, specifically spin glasses on random networks _{i }are binary (such discretization of expression levels is a common technique to deal with undersampling). In this case, the genes are the Ising spins, and truncations to the first, second, or the third order potentials are steps towards the mean field, Bethe, and Kikuchi variational approximations _{i}}), a variational approximation to the true JPD, _{i}}), that minimizes _{KL }(_{L }are unknown and cannot be used in averaging. On the other hand, we are here solving the inverse problem – reconstructing the network given the known true marginal distributions.

ARACNE, which truncates Equation 1 at the second order potentials, is an analog of the Bethe approximation for the direct problem. Just like this approximation and the associated belief propagation algorithm

Appendix C – Counting Loops in an Undirected Adjacency Matrix

A pairwise interaction network can be represented by an adjacency matrix _{ij}, where _{ij }= 1,0 denotes either presence or absence of the corresponding interaction. To test the effect of violation of the "locally tree-like" assumption on the performance of the algorithm, we need to be able to count the number of cycles (loops) in a given network. This is complicated by the fact that the total number of cycles in a graph is not equal to the number of independent cycles; that is the number of edges that need to be removed to transform the graph into a tree. We need to count the number of independent cycles only. Additionally, of all possible complete sets of independent cycles we are interested in identifying the one with the smallest loops (since small loops have the highest potential to violate the locally tree-like assumption). We suggest the following algorithms to solve this task approximately.

1) We prune the nodes that have 0 or 1 neighbors in the adjacency matrix

2) We transform the undirected network _{ij }≠ 0 in the original network with a node in the new network (edges _{ij }= _{jk }= 1, _{(ij),(jk) }= 1 otherwise _{(ij),(kl) }= 0.

3) We evaluate integer powers of the matrix ^{n}) > 0, a loop (or loops) of size

4) We repeat 1–3 till no more loops are found.

Authors' contributions

AAM: Conducted research, designed study, participated in design of algorithm, wrote manuscript. IN: Designed theoretical framework, participated in design of algorithm, wrote manuscript. KB: Performed biochemical validation. CW: Participated in design of study. GS: Participated in design of algorithm and validation. RDF: Supervised and designed biochemical validation. AC: Designed algorithm, supervised research, wrote manuscript. All authors read and approved the final manuscript.

Acknowledgements

This work was supported by the NCI (1R01CA109755-01A1) and the NIAID (1R01AI066116-01). AAM is supported by the NLM Medical Informatics Research Training Program (5 T15 LM007079-13).