Computational Biology and Machine Learning, Center for Cancer Research and Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen's University Belfast, 97 Lisburn Road, Belfast, BT9 7BL, UK

Abstract

Background

Inferring gene regulatory networks from large-scale expression data is an important problem that received much attention in recent years. These networks have the potential to gain insights into causal molecular interactions of biological processes. Hence, from a methodological point of view, reliable estimation methods based on observational data are needed to approach this problem practically.

Results

In this paper, we introduce a novel gene regulatory network inference (GRNI) algorithm, called C3NET. We compare C3NET with four well known methods, ARACNE, CLR, MRNET and RN, conducting in-depth numerical ensemble simulations and demonstrate also for biological expression data from

Conclusions

For systems biology to succeed in the long run, it is of crucial importance to establish methods that extract large-scale gene networks from high-throughput data that reflect the underlying causal interactions among genes or gene products. Our method can contribute to this endeavor by demonstrating that an inference algorithm with a neat design permits not only a more intuitive and possibly biological interpretation of its working mechanism but can also result in superior results.

Background

The inference of large-scale causal gene regulatory interactions is important because it can contribute to a better understanding of all aspects of normal cell physiology, development and pathogenesis

So far several methods have been suggested in the above context, inferring gene regulatory networks

This is due to the application of the DPI, which can only eliminate, but not add edges to the network. Another method similar to RN is CLR (Context Likelihood of Relatedness)

The major purpose of this paper is to introduce a new inference method. The motivation to suggest a new method is at least three fold. First, the capabilities of previously introduced methods are only partially investigated. This results from the fact that an inference method needs to be studied in combination with data because its performance depends crucially on the characteristics of the data. However, there is neither a general agreement how to simulate data in a way that they would capture all relevant aspects of real expression data, nor we are in possession of a true regulatory network of a reasonable size representing all causal interactions actually involved in a certain physiological process. Further, we do not have access to microarray data of arbitrary large sample sizes due to economic and experimental limitations. Hence, the principle approaches currently pursued for the statistical investigation of an inference algorithm represent a compromise acknowledging the above circumstances. In order to obtain the most thorough analysis of an inference algorithm we analyze our method with an ensemble of simulated data and with biological expression data from microarray experiments of an organism for which, at least to a certain degree, information about the underlying regulatory network is known. Second, the inference algorithms described above have the tendency of becoming more and more complex. Keeping in mind that previous results may be flawed due to the serious difficulty of obtaining a balanced statistical analysis, we step in the other direction aiming for an inference algorithm that is simpler than most other methods. This may not only allow for a better understanding of the proposed method but also reveal something about the underlying biology itself. Third, all previous methods aim, at least theoretically, to infer the entire regulatory network for a given data set. However, practically, no method can guarantee to achieve this for a given data set, not even for simulated data when a very large number of samples is available. One reason for this shortcoming is that observational data may not capture all dynamical interrelations that would allow a reliable estimation. For this reason, we lower the bar from the beginning by not aiming to infer the entire network, instead, our method aims to infer the strongest interactions among covariates only. We call this part of a network its

The basic idea of our method, we call C3NET, consists in the identification of a significant maximum mutual information network, the conservative causal core, in a way that two genes are only connected with each other if their shared significant mutual information value is at least for one of these two genes maximal with respect to all other genes. Since C3NET is an information theory based method, we compare it with ARACNE

The paper is organized as follows. In the next section we introduce our method, C3NET, and describe its working mechanism. Also, we describe our simulation set-up and the expression data we use. Then we present numerical results comparing our method with ARACNE, MRNET, RN and CLR and application of C3NET to the expression data from

Methods

In this section we introduce our inference algorithm, C3NET, describe its constituting components and present an example of its working mechanism. In addition, we motivate its introduction and discuss its biological plausibility.

In the first step of C3NET we want to eliminate nonsignificant connections among gene pairs. This can be accomplished by testing the statistical significance of pair-wise mutual information (MI) values employing resampling methods, similarly to previous methods, e.g., RN or ARACNE

Practically, the mutual information values need to be estimated from the data by using an appropriate estimator allowing a close approximation of the theoretical value of the population. A discussion of technical details of this issue is provided at the end of the section 'Simulated and expression data'. Starting from a fully connected matrix _{ij }
_{ij}
_{ij }
_{ji }
_{0 }: _{ij }
_{s }
_{s}
_{ij }
_{s }

**Algorithm 1 **Principle steps of our inference algorithm C3NET.

1: _{ij }

2: _{ij }

3: estimate mutual information _{ij }

4: **repeat**

5: Set _{ij }
_{ij }

6: **until **all pairs

7: **for all **
**do**

8: _{s}
_{ij }

9: **if **
_{s}

**10**:

**11**: **else**

**12**: _{c}

**13**: **endif**

14: **end for**

15: **for all **
**do**

16: **if **
_{c}

**17**:

**18**: **endif**

19: **end for**

20: **return **adjacency matrix

information value. This connection is identified by

In the case _{s}
_{ij }
_{c}
_{
c
}(

Visualization of the principle working steps of C3NET and the fact that the final network can have an arbitrary structure

**Visualization of the principle working steps of C3NET and the fact that the final network can have an arbitrary structure**.

For each of the four genes we determine its connection with neighboring genes with maximum mutual information that is also statistically significant, resulting in _{c }
_{c }

containing exactly the edges added by each node. Since MI does not provide directional information, due to its symmetry in its arguments, the resulting adjacency matrix

From Fig. _{j}
_{c}
_{c}

In addition to the statistical justification sketched above, the working mechanism of C3NET has also a very appealing interpretation from a biological point of view. Genes that are expressed in a cell have to interact with at least one other gene or gene product, because otherwise they could be knocked out without noteworthy effect on the cell's physiology. That means, active genes must have, at least, one connection with other genes in order to contribute to the biological function of the cell. This interaction is targeted by C3NET. On the other hand, if a gene is not expressed in a specific cell type, but the measurements reflect merely noise, the significance test applied in the first step of C3NET prevents the assignment of obviously false positive connections, because the mutual information values are in such a case not statistically significant.

In order to clarify differences between C3NET and other algorithms, we want to discuss some of these. MRNET is based on the

A characteristic of C3NET that is different to all other methods is that it can infer at most as much edges as genes. The reason for this is that the maximization step allows each gene to add at most one edge to another gene. All other methods are capable of inferring, potentially, more edges than genes. Put differently, this implies that C3NET does not aim at inferring the entire network underlying gene regulation, instead, it aims at its core structure and, hence, it is more conservative than all other methods. The purpose of this paper is to introduce C3NET and to investigate the capabilities of our method by providing a systematic comparison with other inference methods.

Complexity

The computational complexity of all methods used in this paper, except for C3NET, were discussed in ^{2}). In the following, ^{2}) since only pairwise interactions are evaluated. The complexity of ARACNE is ^{3}) because all triplets of genes need to be evaluated for the data processing inequality. The complexity of MRNET is between ^{2}) and ^{3}) because of the feature selection step, see the discussion in ^{2}) because only matrices of size

Simulated and expression data

In order to analyze our proposed inference algorithm by comparing it with the performance of other methods we use simulated as well as expression data from microarray experiments. Due to the fact that the knowledge about biological regulatory networks is still far from being complete, we use simulated data because for these data we know the underlying (true) regulatory network exactly. This allows a detailed and accurate analysis. We complement our simulation study with biological expression data to demonstrate that the assumptions made for our simulations are realistic enough to extrapolate these results to biological data sets.

The error measure we use to assess the performance of an inference algorithm is the F-score, _{0 }for the mutual information values by maximizing the F-score. The two biological networks we use in our simulation study represent subnetworks of the transcriptional regulatory network (TRN) of

For each network _{k}

Here

The biological expression data we use in our study is a data set of

Results and Discussion

We start our numerical analysis of C3NET by using simulated ensemble data. After that we investigate C3NET with expression data from

Simulated data

We compare the performance of C3NET with four of the most prominent inference algorithms, ARACNE

Boxplots of F-scores for C3NET (orange), ARACNE (gray), MRNET (blue), RN (red) and CLR (green)

**Boxplots of F-scores for C3NET (orange), ARACNE (gray), MRNET (blue), RN (red) and CLR (green)**. Dark color (left boxplot) corresponds to sample size 50, light color (right boxplot) to sample size 200. A subnetwork of Yeast GRN is used for the simulations. Ensemble size is

Summary of F-scores (max, min, mean and median) for C3NET, ARACNE and MRNET obtained from our simulations.

**C3NET**

**ARACNE**

**MRNET**

Yeast _{200}

max

0.5478

0.4919

0.4927

min

0.336

0.2058

0.336

median

0.4628

0.3836

0.4455

mean

0.4628

0.3795

0.4410

Yeast _{50}

max

0.4782

0.3983

0.4585

min

0.2844

0.1854

0.2879

median

0.3859

0.3166

0.3698

mean

0.3848

0.3161

0.3683

Ecoli

max

0.6046

0.4973

0.5608

min

0.4131

0.1866

0.3512

median

0.5308

0.3803

0.500

mean

0.5269

0.3758

0.4948

The sample size is 1000 for

The boxplots in Fig.

Boxplots for the average mutual information values respectively z-scores per significant edge for C3NET (orange), ARACNE (gray), MRNET (blue), RN (red) and CLR (green)

**Boxplots for the average mutual information values respectively z-scores per significant edge for C3NET (orange), ARACNE (gray), MRNET (blue), RN (red) and CLR (green)**. Dark color (left boxplot) corresponds to sample size 50, light color (right boxplot) to sample size 200. A subnetwork of Yeast GRN is used for the simulations. Ensemble size is

In order to study the influence of the underlying network structure we repeat our analysis, this time, using a subnetwork of

Boxplots for the F-scores for C3NET (orange), ARACNE (gray) and MRNET (blue)

**Boxplots for the F-scores for C3NET (orange), ARACNE (gray) and MRNET (blue)**. A subnetwork of the TRN of

In Fig.

Subnetwork of yeast consisting of 100 genes, sample size is 200

**Subnetwork of yeast consisting of 100 genes, sample size is 200**. Edge colors are obtained from simulations of 300 data sets. The color of each edge reflects its mean TPR. Specifically, for black edges,

Subnetwork of

**Subnetwork of E. coli consisting of 100 genes, sample size is 1000**. Edge colors are obtained in a similar way as for yeast. Ensemble size is 300.

Expression data from

Next, we apply C3NET to expression data from

Following a similar approach for CLR, as described in

For the significance test of the mutual information values we obtain a threshold value of 0.414. Application of C3NET results in a total of 99 interactions of which

Fig.

Inferred

**Inferred E. coli network by C3NET**. Pink genes correspond to transcription factors and gray genes to regulated genes. Black edges indicate true positive results whereas red edges correspond to false positives.

In this network, the largest hub inferred by C3NET is fliA, a RNA polymerase. FliA is a minor sigma factor responsible for the initiation of transcription and involved in motility. The second largest hub in the inferred network is Lrp. The leucine-responsive protein (Lrp) is a transcription regulator widely distributed throughout archaea and eubacteria

Table

Interactions predicted by C3NET, shown as red edges in Fig. 7, declared as false positives according to the reference network

**regulator gene**

**regulated gene**

**literature**

confirmed

gadE

gadB

gadE

hdeD

gadE

yhiD

lrp

pntA

fliA

tsr

flhD

flhC

predicted interactions

dnaA

amiB

dnaA

rnpA

fliA

flgA

lrp

artJ

lrp

aroP

indirect interaction via TyrR

lexA

araB

lexA

araD

lexA

araE

lrp

pntB

tdcR

yiaM

tdcR

bglG

csgD

trpD

csgD

trpC

zur

glmU

purR

aroH

fnr

dinF

cbl

treB

gadE

slp

gadE

dps

Among these 25 interactions 6 receive support from the literature to be in fact true positives.

GadE is an essential transcriptional activator of the glutamate decarboxylase (GAD) system which is reported to be the most efficient acid resistance (AR) mechanism in

In addition to these five transcription regulations we find support for a different type of interaction, namely a protein-protein binding. In

Taking these newly confirmed interactions into account, the precision of C3NET increases to 0.81. Finally, we want to report that

Conclusions

In this paper we introduced a novel unsupervised GRNI method, called C3NET, in order to infer causal regulatory networks. We investigated the performance of C3NET by conducting in-depth simulations using 900 synthetic data sets in combination with two different subnetworks from yeast and

The conservative approach of C3NET, allowing each gene to contribute (add) at most one edge to the inferred network, appears to exploit the estimates of mutual information values significantly better than previous methods. The simplicity of our approach demonstrates that it is not always favorable to increase the complexity of an inference procedure in order to increase its performance. More important is a concise design that takes the nature and constraints of the underlying problem into account. Also, the investigation of an inference method using simulated ensemble data is strongly advised to obtain a clear assessment of such a method, because the results obtained for individual data sets may be atypical. In contrast, ensemble data uncover relentlessly the entire spectra of behavior an inference method can exhibit. Hence, an important result from our study is the insight that a neatly structured algorithm can perform better than other methods that are more complex. This is not only favorable because it allows a better understanding of the inference procedure itself but usually leads to more robust results, especially when the sample size is small.

Although, our method has been invented for the inference of gene regulatory networks applied to expression data, it may find application in other fields as well that aim at inferring causal relations among covariates, because the requirements for the data are moderate. For example, C3NET could find its application for the inference of brain connectivity networks

Authors' contributions

GA and FES designed the method, performed the analysis and interpreted the results. FES conceived and coordinated the study. GA and FES wrote the manuscript. All authors read and approved the final manuscript.

Appendix

For our numerical simulations we used R

Acknowledgements

We would like to thank Shailesh Tripiati for help in visualizing the networks and Dirk Husmeier, Alexander Thompson and David Timson for fruitful discussions on various aspects of the paper. This project is supported by the Department for Employment and Learning through its "Strengthening the all-Island Research Base" initiative.