Department of Pharmacology and Systems Therapeutics, Systems Biology Center of New York (SBCNY), Mount Sinai School of Medicine, One Gustave L. Levy Place, Box 1215, New York, NY, 10029, USA

Abstract

Background

The skeleton of complex systems can be represented as networks where vertices represent entities, and edges represent the relations between these entities. Often it is impossible, or expensive, to determine the network structure by experimental validation of the binary interactions between every vertex pair. It is usually more practical to infer the network from surrogate observations. Network inference is the process by which an underlying network of relations between entities is determined from indirect evidence. While many algorithms have been developed to infer networks from quantitative data, less attention has been paid to methods which infer networks from repeated co-occurrence of entities in related sets. This type of data is ubiquitous in the field of systems biology and in other areas of complex systems research. Hence, such methods would be of great utility and value.

Results

Here we present a general method for network inference from repeated observations of sets of related entities. Given experimental observations of such sets, we infer the underlying network connecting these entities by generating an ensemble of networks consistent with the data. The frequency of occurrence of a given link throughout this ensemble is interpreted as the probability that the link is present in the underlying real network conditioned on the data. Exponential random graphs are used to generate and sample the ensemble of consistent networks, and we take an algorithmic approach to numerically execute the inference method. The effectiveness of the method is demonstrated on synthetic data before employing this inference approach to problems in systems biology and systems pharmacology, as well as to construct a co-authorship collaboration network. We predict direct protein-protein interactions from high-throughput mass-spectrometry proteomics, integrate data from Chip-seq and loss-of-function/gain-of-function followed by expression data to infer a network of associations between pluripotency regulators, extract a network that connects 53 cancer drugs to each other and to 34 severe adverse events by mining the FDA’s Adverse Events Reporting Systems (AERS), and construct a co-authorship network that connects Mount Sinai School of Medicine investigators. The predicted networks and online software to create networks from entity-set libraries are provided online at

Conclusions

The network inference method presented here can be applied to resolve different types of networks in current systems biology and systems pharmacology as well as in other fields of research.

Background

The skeleton of complex systems can be represented as a network where vertices represent entities and edges the relations between these entities. Often it is impossible, or expensive, to determine the network structure by experimental validation of all the interactions between all the vertices. It is usually more practical to infer the network from surrogate observations. Uncovering the relations between entities from indirect evidence is known as the problem of network inference or reverse engineering of networks from data. While many algorithms have been developed to infer networks from quantitative data of systems’ entities states, less attention has been placed on methods to infer networks from repeated observations of related sets. There are many cases in which groups or clusters of interrelated entities are known or can be observed experimentally. Typically such information is much more readily accessible than direct evidence of pair-wise interactions or even quantitative information about the entities under different conditions or time points. Each of these sets of related entities provides some information about the connectivity of the underlying network, and it would be of value to be able to utilize this information to resolve the connectivity of the underlying network connecting these entities. This inference process applies to a general class of inference problem of broad applicability; however, our motivation comes from problems in systems biology and systems pharmacology.

The tide of high-throughput biological data makes the inference of biological networks both more necessary and possible. There have been a number of success stories in the application of these methods to understand real biological phenomena. However, the multitude of components in biological molecular intracellular systems and their combinatorial interactions means that the possible networks that are consistent with observed data are astronomically large. Since we cannot directly observe many components of this system at once, and methods to profile binary interactions are expensive and laborious, the network remains under-determined; there are many networks which can equally well explain the observed data. However, in recent years the ability to sequence DNA, RNA and protein, together with the accumulation of prior knowledge about functional and physical relationships between genes and proteins, lends itself to a better ability to infer the underlying networks that govern the phenotype of mammalian cells.

At the same time, a new field is emerging called systems pharmacology. Systems pharmacology aims to integrate knowledge about drugs, drug-drug interactions, drug interactions with cells and organs, and drug relations to adverse events and desired effects in individual patients

Current network inference methods employ a number of different strategies to reduce the search space of network structure solutions. Naturally, each strategy makes certain compromises and assumptions and has particular advantages and limitations

In most high-throughput (HT) methods that collect molecular biological data from cells, the underlying network is not known. Typically subsets of related molecular components are observed. For example, groups of co-expressed genes across different samples and contexts may be identified from transcriptomics data. Another example is groups of proteins which are listed in pull-down proteomics experiments that use immunoprecipitation followed by mass spectrometry (IP/MS) for protein complex identification. The identified genes or proteins may be regarded as the vertices of the underlying gene regulatory, cell-signaling or protein-protein interaction (PPI) networks. Extraction of the underlying network from such data can be achieved in many ways. In addition, the popular method of gene set enrichment analysis (GSEA)

The method we present here makes no

Methods

Exponential random graphs

Exponential Random Graphs Models (EGRMs) are a means of generating an ensemble of networks with prescribed statistical properties with the aim of modeling real-world networks. ERGMs were introduced by Holland and Leinhardt _{i}}, _{i} takes the empirical values. The best choice of probability distribution is the one which satisfies the empirical constraints,

while admitting no further information about the model graphs, which is achieved by maximizing the Gibbs entropy,

This leads to a probability distribution which is the network equivalent of the Boltzman distribution,

Where

and

This probability distribution defines the exponential random graph model networks which obey the mean constrains of Equation 1, but which are otherwise maximally disordered.

Dependence graphs

Pattison and Wasserman _{ij}, which is symmetric and has elements,

The ERGMs define a probability distribution over _{ij} may be regarded as a realization of a random variable which is defined as a random selection from _{ij}. The edges of

Inference approach

The inference problem addressed here is now phrased in terms of ERGMs. Given an unknown underlying network, _{u}, with vertices, _{i} consist of vertices which are empirically observed to be related in the network, this could be for example, proteins identified in a mass-spectrometry proteomics pull-down, members of a cell signaling pathway, or co-authors of a publication. The central assumption is that each of the sets _{i} identifies a locally connected subgraph of the underlying network. In these terms, the network inference problem we pose is thus: given the set of _{c} subsets, _{u}?

To give a specific example, we may consider the results from HT-IP/MS, after appropriate filtering, as identifying a locally connected region of the underlying human protein-protein interactome network _{i}, and the list of proteins identified in each pull-down experiment would correspond to one element of _{u}. We are aware that we are searching for a static configuration of the network, whereas the underlying connectivity in complex systems, including PPIs and gene-regulatory networks, is dynamic as it may change over time and under different conditions.

We define our observable graph functions, _{i}(

If we interpret each set as providing course local information on the connectivity of the underlying _{u}, such that we have a confidence, α, that the elements in each line are locally connected, then the constraints on the ensemble are the following,

in which case the maximum entropy probability distribution function _{ens}. In our studies of the properties of our inference approach on synthetic networks we shall generate data which identifies locally connected regions of the underlying network and so we shall take the value of

The above constraints leading to the probability distribution _{i} contains vertices between which a set of edges are assumed to exist in the underlying network in order to form a locally connected subgraph. Over the ERGM ensemble, the presence of each of these edges is conditionally dependent upon the others. These edges form a complete subgraph of

In the approach presented here we make no attempt to infer directionality, hence, we take the sample set,

The algorithm works by generating a random sample, of size _{g}, of the ensemble of networks _{ens} consistent with the data. According to the assumptions of the inference, the GMT file contains a number _{C} of lines, _{i}_{i} = {}, it builds a network by taking each line and introducing a minimal number of random links that connect the vertices in that line. The pseudo-code for the algorithm is then:

For i = 1 to _{g}

Randomly permute the order of the lines in the

_{i} = {} (start with a graph with no links)

For j = 1 to _{C}

Randomly introduce a minimal number of edges between the vertices _{i} such that they are connected, and continually append to the set _{i}

End For i

End For j

Calculate the mean adjacency matrix of the ensemble _{ens}

A sample of random networks generated in this fashion constitutes a random sample of _{ens}. The properties of this ensemble are then used to infer the underlying network _{u}. Specifically, we calculate the mean adjacency matrix over this ensemble, each element of which corresponds to the probability of the edge being present in a uniformly random draw from the ensemble; this is interpreted as being indicative of the accumulation of information on the presence of the edge in the underlying network.

Analytical approximation

The algorithmic sampling of the ensemble _{ens} becomes computationally demanding when inferring networks with many vertices, and large amounts of data. When applying the approach to infer biological networks from large high-throughput datasets which although sparse, can have thousands of vertices, we look for an efficient analytical approximation to the algorithmic solution. The approach we take is to analytically mimic the function of the fully executed algorithm that generates the networks, which are samples of _{ens}. The first order approximation is to treat each _{i} as generating an independent minimally connected Bernoulli random graph, in which each edge has an independent and equal probability of appearing. Then the superposition of all the _{i} may be regarded as a Bernoulli process in which each edge undergoes a series of Bernoulli trials with probabilities corresponding to the random graphs generated by each of the elements of _{u} is given by,

Where _{ij,k} is the size of the kth elements of _{i} and _{j} belong.

Alternative co-occurrence scoring methods

To compare the above methods to other approaches that can be used to construct networks from repeated observations of co-occurence of entities in related sets we compare our approach to two alternatives. A simple difference of proportions test to measure co-occurrence strength is applied as follows: For two given vertices, _{i} in _{i} in the GMT file in which both _{c}, is also used as a comparison.

Bias adjustment

While artificial random networks can be evenly and randomly sampled, real world datasets contain sampling biases where some vertices are measured more often. For example, well-studied genes/proteins appear in more publications and thus commonly have more reported interactions with other genes/proteins. This inhomogeneity of information throughout the network means that the inferred interactions are as sensitive to the frequency with which particular vertices are observed as to the strength and specificity of the edges between the vertices. In order to correct for this and calculate probabilities which are only sensitive to the strength of the interactions, we generate a null distribution for each edge probabilities under the hypothesis of no specific interaction. This is achieved by calculating the distribution of each edge-probability _{ij} in the network after any information on the network connectivity has been destroyed: the GMT file data is randomly permuted, while preserving its structure by conserving the lengths of each set and the frequency of each element throughout the data, and allowing only one of each type in any given set/line, then all the edge weights are calculated. This process is repeated to generate the null distribution for each edge weight. By comparing the actual edge weight to this null distribution, under the hypothesis of no interaction, we obtain a p-value which quantifies the edge-probabilities after correction for the sampling bias. This p-value is then used as a measure of the strength of the interaction.

Combining evidence from multiple GMT files to build a consensus network

We can combine several types of datasets stored in different GMT files to form a consensus network. By combining two or more different inferred network perspectives we may gain additional insight into the functional associations between related vertices. When combining data from several sources we rewrite equation 9 for the edge probabilities as:

where the terms in each square bracket come from each distinct data source and the Greek letters are the corresponding confidence parameters. To compute scores that consider the bias adjustment, the distribution of edge weights from randomly permuted GMT files are generated using the same equation and the same steps described above for inferring probabilities for edges for individual networks.

Matthews correlation coefficient to evaluate inferred networks

To evaluate the quality of the predicted edges we use the Matthew's correlation coefficient (MCC). MCC is a balanced measure of the quality of binary classifications, and it is equal to unity when the classification is perfect; it is computed by the following equation:

Where FP and FN are the false positives and negatives respectively, and TP and TN are the true positives and true negatives respectively.

Results

Inference of synthetic networks

First we investigate the quality of inference using our approach by applying it to synthetic test networks. We begin with a randomly generated connected graph serving as the underlying network _{u} that we wish to infer (Figure _{u} only using the information from _{ens}, is generated algorithmically by randomly introducing a minimum number of links in order to connect successive sets of vertices _{i}. A small selection of the calculated elements of _{ens} is shown in Figure _{u}. Due to the binary nature of the edges, a comparison can be made between the underlying network _{u} and the mean adjacency matrix, after the application of a threshold value of _{t}. A histogram of edge probabilities, _{ij}, shown in Figure _{t}. In other words, we plot the MCC where we cut the inferred adjacency matrix to decide which scores will constitute an edge. If we set a large threshold value there will be only few edges remaining in the network and thus many false negatives and lower MCC. Conversely, as the threshold is reduced to zero the network will tend to completeness and thus will include many false positives. There is a region of _{t} where the similarity between the inferred network and the underlying network is greatest. This region corresponds to the peak of the MCC curve (Figure

Example of inferring a network from synthetic data

**Example of inferring a network from synthetic data.** (**A**) An arbitrary random graph serves as _{u} which can be represented by an adjacency matrix, shown in (**B**). The field _{ens} is algorithmically sampled, a few of the sampled graphs are shown in (**C**). An example of the mean adjacency matrix over _{ens} is shown in (**D**).

Histogram of interaction probabilities (pale blue bars) and their corresponding MCC (deep blue lines) as a function of _{t} for inferring ten synthetic networks of the same size (n = 50)

**Histogram of interaction probabilities (pale blue bars) and their corresponding MCC (deep blue lines) as a function of **
**
p
**

The aim of applying the inference approach to such synthetic networks is to investigate the quality of the inference in a case where the underlying network is known. This investigation requires several steps. First, because the approach requires a random sample (of the consistent networks) we must investigate the convergence of the statistics deriving from this sample. Once the rate of convergence is known the required sample size _{g} for a given degree of convergence is obtained and the method may be reliably applied. The next step is to examine the quality of inference and its dependence on the properties of the available data. Finally, with a view to the application of our method to problems in systems biology we examine the computational complexity and running time of our approach.

Convergence of inferred networks

Our inference approach is based on the mean of a random sample of networks which are consistent with our data. The rate of convergence of this statistic depends on the properties of the data upon which the ensemble is based and so a general convergence rate does not exist. We can however derive an upper limit to the rate of convergence by considering the convergence in the worst-case-scenario. In this case there is no information on the connectivity of the underlying network, and therefore the largest possible ensemble of consistent networks and so the slowest rate of convergence. Here we derive the rate of convergence of the mean in this case as the upper limit for the rate of convergence.

In this worst-case every connected network with _{e} consistent networks, the total number of times any given edge is present, _{e} trials at a probability of success of _{e} is large enough to achieve a certain signal-to-noise ratio. Given that

and this is the formula used to determine the sample size throughout all applications of the algorithm presented here.

The above estimate can be used as an upper limit to the _{e} required for the mean to converge to a given signal-to-noise ratio in the case of general GMT file data; this is because such GMT data file is more informative, hence the size of the ensemble of consistent networks is smaller, and a smaller sample is required for inferring the underlying network. We demonstrate this in a plot of the signal-to-noise ratio against the size of the ensemble for inference of a 20 node network (Figure

MCC as a function of the threshold value_{t}applied to the mean adjacency matrix calculated for ten synthetic network inferences from _{C}= 180, 50, 10, where each of the connected subsets were generated by a random walk on the respective underlying synthetic networks _{u } of length 3

**MCC as a function of the threshold value **
**
p
**

Accuracy of inference

Having quantified the convergence of the inferred network with the sample size _{c,}, the next consideration is how similar the inferred network is to the underlying network from which the data derives. The quality of the inference depends on the data contained in the GMT file. As the number of lines in the GMT file, _{c,} increases, the information on the presence of edges accumulates and the inferred network becomes more similar to the underlying network. The length of the lines in the GMT file also influences the quality of the inference; larger lines provide coarser information on the connectivity of the network and thus are less informative. Finally, the more nodes are in an underlying network the greater the amount data required to infer its structure. Here, for a given number of n odes, _{c,} and the length of the lines in the GMT file. We begin by examining the dependence on the number of lines in the GMT file. We infer the structure of a minimally connected network with _{c,}, using four different methods of inference: 1) the full-algorithm 2) the analytic approximation 3) Chi-squared difference of proportions 4) simple co-occurrence. In each case we see that as _{c} increases, i.e. with more data, the accuracy of the inference increases and tends towards ideal inference. This is when the inferred network is exactly the same as the underlying network. We also observe that the full algorithm described above is more accurate than the other methods, but the analytic approximation is also more accurate than more basic co-occurrence approaches (Figure _{c}, however, the rate reduces with increasing size of set in GMT file, shorter lines. In this way we show that for coarser information (longer GMT file lines) more data is required to infer the network.

Comparing algorithmic sampling to approximation and other methods

**The maximum value of the MCC over the full range of the threshold **_{t }**plotted against the size of the field, **_{C.} The curves show the resolution of the underlying network as _{C} increases, where each _{i} is generated by performing random walks of increasing length, as indicated in the legend. The upper figure derives from the algorithmic sampling while the lower figure derives from the analytic approximation.

Signal-to-noise ratio (SNR) plotted against the size of the ensemble for inference of a 20 vertices network

**Signal-to-noise ratio (SNR) plotted against the size of the ensemble for inference of a 20 vertices network.** (**a**) Mean SNR for edges inferred from a worst-case GMT file over a range of sample-sizes _{e} (averaged over 40 inferred networks). (**b**) Expected SNR for the worst-case GMT file. (**c**) SNR in the case of GMT files generated by performing 100 short random walks (length 3).

The maximum value of the MCC over the full range of the threshold _{t} plotted against the size of the field, _{C}

**Comparing algorithmic sampling to approximation and other methods.** The maximum value of the MCC over the full range of the threshold _{t} plotted against the size of the field, _{C}. The error bars show the standard deviation over ten network inferences (each with 50 nodes). The four curves represent the resolution of the underlying network, _{u}, with increasing _{C}, when the network is inferred with the algorithmic sampling of _{ens}, the approximation shown in Equation 9, simple co-occurrence counting, and co-occurrence enrichment analysis using the chi-squared proportion test as described under alternative co-occurrence scoring methods.

Dependence of the mean running time over 10 inference realizations for the inference of synthetic networks which have the parameters _{c}

**Dependence of the mean running time over 10 inference realizations for the inference of synthetic networks which have the parameters **

Running time and computational complexity

While the fully executed algorithm is potentially useful, in practice it is not practical due to its computational complexity. The operations to execute the fully enumerating algorithm depend on _{e}, _{C}, and the length of the elements in each _{i}, which are inherently random so we shall refer to the mean length _{avg}. From the structure of the pseudo-code we expect that the number of operations should increase in proportion to each of these three quantities, as well as the number of vertices n_{e}_{C}_{avg}). In the case of real data, _{C} can be of the order of typically 10^{3} or 10^{4}, and there can be typically thousands of vertices, so the number of operations for this method of inference can be prohibitively large. For the analytical approximation with Equation 9, described under Analytical Approximation above, the number of operations required for the computation depends on the details of the data as this determines the number of operations. The number of evaluations increases as ^{2}). However, the number of operations required for each evaluation depends on the structure of the data within the GMT file. Comparison of running times between the fully executed algorithm and the approximation is shown in Figure 7Figure 7; this shows that the analytical approximation is two orders of magnitude faster than algorithmic sampling and therefore more practical for generating networks from real datasets stored as GMT files.

In the next sections we employ the approximation for the inference of PPIs from HT-IP/MS data, construct a network of stem cell regulators from ChIP-seq and loss-of-function/gain-of-function followed by expression data, construct a network between cancer drugs and severe side effects from patient records, as well as construct a co-authorship network connecting researchers from Mount Sinai School of Medicine in New York.

Application to PPI prediction from HT-IP/MS data

The identification of binary interactions between proteins is an important task in systems biology. Initially, information on PPI networks in mammalian cells came from targeted experiments involving a small number of proteins. However, experimental techniques can now explore PPIs in human and mouse cells at large-scale. In addition, large numbers of PPIs from small-scale experimental studies are continually aggregated in publicly available databases.

In a recent study, Malovannaya et al.

A large value of _{i,j} may indicate that there is potentially sufficient information to suggest that protein _{i} and _{j} directly physically bind to each other. However, although large, the amount of data in this HT-IP/MS dataset is not large enough to fully resolve the underlying network of PPIs, so a small value of _{i,j} does not necessarily indicate that the pair of proteins do not directly bind, only that there is not enough information in the dataset to suggest that they do. As the network is not fully resolved, the results of this inferential process could be used to rank the binary interactions to suggest likelihood of interactions for more targeted validation. Alternatively, the mean adjacency matrix could be used to gain a course grained global view of the human nuclear co-regulation complexome. To evaluate the reliability of predicted interactions we used benchmarking to compare predicted interactions to known interactions. Benchmarking is important for determining the quality and reliability of network inference approaches, and we attempt here to evaluate our inference approach applied to this data. The typical approach is to take the union of many current curated PPI databases and treat the interactions therein as true positives. This is imperfect because these databases contain significant numbers of false positives and false negatives, penalizing inferences that may be discovering correctly unknown interactions.

With these concerns in mind, we followed this procedure to evaluate our inference method. First we used the list of proteins identified in each pull-down to define the sets of proteins forming a connected subgraph of the underlying PPI network; this defines the subsets _{i} composing the field _{t} (Figure

**Predicted PPIs.**

Click here for file

Receiver operator characteristic (ROC) curves of the mean adjacency matrix of the ensemble _{ens} created from the HT/IP-MS data inferred interactions created with the four different types of inference methods

**Receiver operator characteristic (ROC) curves of the mean adjacency matrix of the ensemble **_{ens}**created from the HT/IP-MS data inferred interactions created with the four different types of inference methods.** True positives are called based on known protein-protein interactions from published databases and publications used as the gold standard to evaluate the quality of the classification. Inset is a zoom-in of the most left portion of the ROC curve plots.

The MMC as a function of the threshold value _{t} applied to the mean adjacency matrix calculated for the HT/IP-MS data inferred interactions with the four different types of inference methods as compared to the PPI database to evaluate the quality of the classification

**The MMC as a function of the threshold value **
**
p
**

Histograms of scores between all pair-wise proteins within the mean adjacency matrix of the ensemble _{ens} created from the HT/IP-MS data inferred interactions created with the approximation algorithm before (top) and after (bottom) the bias adjustment

**Histograms of scores between all pair-wise proteins within the mean adjacency matrix of the ensemble **
**
G
**

To further validate the inferred PPI network, we compared the ability of the predicted interactions to recover known protein complexes listed in the CORUM database

**List of 50 complexes.**

Click here for file

**Images of predicted 50 complexes.**

Click here for file

Predicted interactions between members of the MCM complex as listed in the CORUM database and additional proteins that are predicted to strongly interact with the members of this complex

**Predicted interactions between members of the MCM complex as listed in the CORUM database and additional proteins that are predicted to strongly interact with the members of this complex.** On the left is a heatmap that visualizes the strength of the scores as predicted by the approximation method after applying the bias adjustment. On the right is a heatmap made of the same proteins where interactions are visualized as an adjacency matrix where black squares denotes known interactions in the PPI database we constructed from multiple published sources.

Predicted interactions between members of the TFIIH transcription factor complex as listed in the CORUM database and additional proteins that are predicted to strongly interact with the members of this complex

**Predicted interactions between members of the TFIIH transcription factor complex as listed in the CORUM database and additional proteins that are predicted to strongly interact with the members of this complex.** On the left is a heatmap that visualizes the strength of the scores as predicted by the approximation method after applying the bias adjustment. On the right is a heatmap made of the same proteins where interactions are visualized as an adjacency matrix where black squares denote known interactions in the PPI database we constructed from multiple published sources.

Revealing associations between stem cell regulators

In the past few years a tremendous number of high-content experiments have profiled different aspects of mouse embryonic stem cells. Such experiments include gene-expression microarrays at different conditions, genome-wide histone modification and transcription factor binding to DNA using ChIP-seq, RNAi screens for identifying pluripotency regulators, proteomics, phosphoproteomics and microRNA profiling. While these experiments have the potential to fully uncover the regulatory networks governing stem-cell maintenance and differentiation into specific lineages, data integration across regulatory layers to extract new knowledge from such data is challenging

Subnetwork of inferred interactions between pluripotency and self-renewal regulators as determined by applying the approximation method with the bias adjustment on the ChIP-seq dataset of profiling these factors and regulators in mouse embryonic stem cells (mESCs)

**Subnetwork of inferred interactions between pluripotency and self-renewal regulators as determined by applying the approximation method with the Bias Adjustment on the ChIP-seq dataset of profiling these factors and regulators in mouse embryonic stem cells (mESCs).**

Subnetwork of inferred interactions between pluripotency and self-renewal regulators as determined by applying the approximation method with the bias adjustment to the LOF/GOF followed by gene expression microarrays dataset when perturbing these factors and regulators in mESCs

**Subnetwork of inferred interactions between pluripotency and self-renewal regulators as determined by applying the approximation method with the bias adjustment to the LOF/GOF followed by gene expression microarrays dataset when perturbing these factors and regulators in mESCs.**

Subnetwork of inferred interactions between pluripotency and self-renewal regulators as determined by applying the approximation method with the bias adjustment on the ChIP-seq dataset of profiling these factors and regulators in mESCs as well as the LOF/GOF followed by gene expression microarrays dataset when perturbing these factors and regulators in mESCs

**Subnetwork of inferred interactions between pluripotency and self-renewal regulators as determined by applying the approximation method with the bias adjustment on the ChIP-seq dataset of profiling these factors and regulators in mESCs as well as the LOF/GOF followed by gene expression microarrays dataset when perturbing these factors and regulators in mESCs.** The scores were combined using Equation 10.

It is important to point out that the GMT files, both from the ChIP data and LOF/GOF gene expression data have large _{c}, i.e., many rows in each of the two GMT files. In addition, each row is relatively short, having only few factors or regulator listed in each row. As seen from the examples applied on random artificial networks, such data should recover networks with high fidelity because it is likely to contain enough information to recover the network. Indeed, histograms of the scores show clear segregation of scores into high and low after applying the bias adjustment (Figure

Histograms of scores between all pair-wise transcription factors and regulators within the mean adjacency matrices of the ensembles _{ens} created for generating the three networks visualized in Figures

**Histograms of scores between all pair-wise transcription factors and regulators within the mean adjacency matrices of the ensembles **
**
G
**

Identifying statistical interactions between drugs and side effects

Next, we used the network inference approach to mine statistical interactions from the FDA’s spontaneous adverse event reporting system (AERS). This database contains millions of records entered by physicians in the United States recording data from patients. Each record contains a patient record number, the drugs the patient was taking and the adverse events they experienced. From the database we first extracted the most recent one million records (April, 2012). To consolidate drug names we converted all entered drug names to their generic names using synonyms from DrugBank _{i} where we treated drugs and side-effects as connected subgraphs in the underlying drug-drug, drug/side-effect, and side-effect/side-effect statistical interaction network. Because of the computational complexity of the problem and to save execution time, we only used the most recent 50,000 records from this dataset to create the actual network we visualize (Figure

Statistical interactions between drugs (green), side-effects (purple) and drugs/side-effects (orange) created from patient records (rows) from the AERS database and applying the network inference method on the most recent 50,000 entries

**Statistical interactions between drugs (green), side-effects (purple) and drugs/side-effects (orange) created from patient records (rows) from the AERS database and applying the network inference method on the most recent 50,000 entries.** Color intensity represents the strength of the link. Drugs and side-effects are hierarchically clustered separately.

Heatmap of an adjacency matrix connecting 53 cancer drugs and 32 severe side-effects

**Heatmap of an adjacency matrix connecting 53 cancer drugs and 32 severe side-effects.** Statistical interactions between cancer drugs: light green (n = 19) are drugs that target cell signalling components, and dark green (n = 34) are cytotoxic drugs; as well as 32 severe side-effects (purple) and drugs/side-effects interactions (orange) created from patient records (rows) from the AERS database applying the network inference method on the most recent 50,000 entries. Color intensity represents the strength of the link. Drugs and side-effects are hierarchically clustered separately.

Mount Sinai collaboration network

Finally, we show how the network inference approach can be applied more broadly to construct other types of networks. The GMT representation lends naturally to the inference of co-authorship networks where each row in the GMT file derives from a publication where the authors of the publication are listed in each row. Using PubMed E-utilities’ E-search function we searched for the latest (early May 2012) publications that contain an affiliation equal to the term Mount Sinai School of Medicine. From the returned list of publications, we downloaded the top 5,000 abstracts returned by the search query and extracted the author list using the E-fetch function. For each paper, the data was formatted into a GMT file with the PubMed ID as the set label and each author of each paper as the members of each set. After the assembly of the GMT file, the approximation algorithm was applied to the data. The final network contains only edges with scores higher than 0.67 (Figure

Co-authorship network created by considering publications as the rows in a GMT file for publications from Mount Sinai School of Medicine investigators downloaded from PubMed

**Collaboration network created by considering publications as the rows in a GMT file for publications from Mount Sinai School of Medicine investigators downloaded from PubMed.**

Zooming into the highlighted subnetwork in Figure

**Zooming into the highlighted subnetwork in Figure**
**to show relationships between authors and the scores computed to connect them.**

Conclusions

Network inference is a process by which a network is resolved from indirect data. In cases where direct determination of the network is difficult or impossible it is necessary to use indirect evidence which can be more easily obtained. Here we show that using collections of related entities, which are easy to accumulate, we can resolve the underlying network. We drew the analogy that the indirect empirical data describes a macrostate, and the ensemble of networks consistent with this data is the available microstates. As the empirical data accrues there are more constraints and the size of the microstate ensemble shrinks until the underlying network resolves. We employed the formulation of ERGMs from Park and Newman

Competing interests

The author(s) declare that they have no competing interests.

Authors’ contributions

NRC and AM designed the study and wrote the manuscript. NRC generated all figures and conducted the computational analyses and mathematical derivations. RD developed the web-site. MEK collected PPI interactions from publicly available resources. CMT processed the data from AERS. All authors read and approved the final manuscript.

Acknowledgements

This work was supported in part by NIH grants P50GM071558-03, R01DK088541-01A1, RC2LM010994-01, P01DK056492-10, RC4DK090860-01, and R01GM098316.