Human Genetics, David Geffen School of Medicine, University of California, California, Los Angeles, USA

Biostatistics, School of Public Health, University of California, California, Los Angeles, USA

Abstract

Background

Co-expression measures are often used to define networks among genes. Mutual information (MI) is often used as a generalized correlation measure. It is not clear how much MI adds beyond standard (robust) correlation measures or regression model based association measures. Further, it is important to assess what transformations of these and other co-expression measures lead to biologically meaningful modules (clusters of genes).

Results

We provide a comprehensive comparison between mutual information and several correlation measures in 8 empirical data sets and in simulations. We also study different approaches for transforming an adjacency matrix, e.g. using the topological overlap measure. Overall, we confirm close relationships between MI and correlation in all data sets which reflects the fact that most gene pairs satisfy linear or monotonic relationships. We discuss rare situations when the two measures disagree. We also compare correlation and MI based approaches when it comes to defining co-expression network modules. We show that a robust measure of correlation (the biweight midcorrelation transformed via the topological overlap transformation) leads to modules that are superior to MI based modules and maximal information coefficient (MIC) based modules in terms of gene ontology enrichment. We present a function that relates correlation to mutual information which can be used to approximate the mutual information from the corresponding correlation coefficient. We propose the use of polynomial or spline regression models as an alternative to MI for capturing non-linear relationships between quantitative variables.

Conclusion

The biweight midcorrelation outperforms MI in terms of elucidating gene pairwise relationships. Coupled with the topological overlap matrix transformation, it often leads to more significantly enriched co-expression modules. Spline and polynomial networks form attractive alternatives to MI in case of non-linear relationships. Our results indicate that MI networks can safely be replaced by correlation networks when it comes to measuring co-expression relationships in stationary data.

Background

Co-expression methods are widely used for analyzing gene expression data and other high dimensional “omics” data. Most co-expression measures fall into one of two categories: correlation coefficients or mutual information measures. MI measures have attractive information-theoretic interpretations and can be used to measure non-linear associations. Although MI is well defined for discrete or categorical variables, it is non-trivial to estimate the mutual information between quantitative variables, and corresponding permutation tests can be computationally intensive. In contrast, the correlation coefficient and other model based association measures are ideally suited for relating quantitative variables. Model based association measures have obvious statistical advantages including ease of calculation, straightforward statistical testing procedures, and the ability to include additional covariates into the analysis. Researchers trained in statistics often measure gene co-expression by the correlation coefficient. Computer scientists, trained in information theory, tend to use a mutual information (MI) based measure. Thus far, the majority of published articles use the correlation coefficient as co-expression measure

Several articles have used simulations and real data to compare the two co-expression measures when clustering gene expression data. Allen et al. have found that correlation based network inference method WGCNA

While previous comparisons involved the Pearson correlation, we provide a more comprehensive comparison that considers i) different types of correlation coefficients, e.g. the biweight midcorrelation (bicor), ii) different approaches for constructing MI based and correlation based networks, iii) different ways of transforming a network adjacency matrix (e.g. the topological overlap reviewed below

This article presents the following results. First, probably the most comprehensive empirical comparison to date is used to evaluate which pairwise association measure leads to the biologically most meaningful network modules (clusters) when it comes to functional enrichment with GO ontologies. Second, polynomial regression and spline regression methods are evaluated when it comes to defining non-linear association measures between gene pairs. Third, simulation studies are used to validate a functional relationship (cor-MI function) between correlation and mutual information in case that the two variables satisfy a linear relationship. Our comprehensive empirical studies illustrate that the cor-MI function can be used to approximate the relationship between mutual information and correlation in case of real data sets which indicates that in many situations the MI measure is not worth the trouble. Gene pairs where the two association measures disagree are investigated to determine whether technical artifacts lead to the incongruence.

Overall, we find that bicor based co-expression measure is an attractive co-expression measure, particularly when limited sample size does not permit the detection of non-linear relationships. Our theoretical results, simulations, and 8 different gene expression data sets show that MI is often inferior to correlation based approaches in terms of elucidating gene pairwise relationships and identifying co-expression modules. A signed correlation network transformed via the topological overlap matrix transformation often leads to the most significant functional enrichment of modules. Polynomial and spline regression model based statistical approaches are promising alternatives to MI for measuring non-linear relationships.

Association measure and network adjacency

An association measure is used to estimate the relationships between two random variables. For example, correlation is a commonly used association measure. There are different types of correlations. While the Pearson correlation, which measures the extent of a linear relationship, is the most widely used correlation measure, the following two more robust correlation measures are often used. First, the Spearman correlation is based on ranks, and measures the extent of a monotonic relationship between

Association measures can be transformed into network adjacencies. For _{1},…,_{
n
}, an adjacency matrix _{
ij
}) is an

An association network is defined as a network whose nodes correspond to random variables and whose adjacency matrix is based on the association measure between pairs of variables _{1},…,_{
n
}), we start by defining an association measure ^{2}variable pairs {_{1 }= (_{1}
_{1}),_{2 }= (_{1}
_{2}),…,_{
N
}= (_{
N
}
_{
N
})}, resulting in an

Then, one needs to specify how the association matrix

As for step 2, if

where the power **
x
**

Additional details of correlation based adjacencies (unweighted or weighted, unsigned or signed) are described in Materials and Methods.

Network adjacency based on co-expression measures

When dealing with gene expression data, _{
i
}denotes the expression levels of the i-th gene (or probe) across multiple samples. In this article, we assume that the _{
i
}correspond to random independent samples. Co-expression measures can be used to define co-expression networks in which the nodes correspond to genes. The adjacencies _{
ij
} encode the similarity between the expression profiles of genes

Mutual information networks based on categorical variables

Assume two random samples _{1},…,_{
R
}. The mutual information (MI) is defined as:

where _{
r
}) is the frequency of level **simple relationship exists between the mutual information (Eq. 8) and the likelihood ratio test statistic** (described in Additional file

**Detailed methods descriptions.** In this document, we provide detail information of entropy, mutual information, likelihood ratio test statistics and p-value calculation of correlation coefficients.

Click here for file

This relationship has many applications. First, it can be used to prove that the mutual information takes on non-negative values. Second, it can be used to calculate an asymptotic p-value for the mutual information. Third, it points to a way for defining a mutual information measure that adjusts for additional conditioning variables _{1},_{2},… Specifically, one can use a multivariate

As discussed below, numerous ways have been suggested for construct an adjacency matrix based on MI. Here we describe an approach that results in a weighted adjacency matrix. Consider _{1},_{2},…,_{
n
}. Their mutual information matrix _{
i
},_{
j
}) is a similarity matrix

where

Using Eq. 6 with _{
ij
}= (_{
i
}) + _{
j
}))/2 (Eq. 12) results in the

A transformation of ^{
MI,SymmetricUncertainty
}leads to the

One can easily prove that _{
i
}and _{
j
} will be small if ^{
UniveralVersion1 }satisfies the properties of a distance including the triangle inequality.

Another adjacency matrix is based on the upper bound implied by inequality 13. We define the

The name reflects the fact that ^{
UniveralVersion2 }= 1−^{
MI,UniversalVersion2} is also a universal distance measure ^{
MI,UniversalVersion1 }and ^{
MI,UniversalVersion2} are in general different, we find very high Spearman correlations (

Many alternative approaches exist for defining MI based networks, e.g. ARACNE

Mutual information networks based on discretized numeric variables

In its original inception, the mutual information measure was only defined for discrete or categorical variables, see e.g. _{
l
} falls:

The number of bins,

In our subsequent studies, we calculate an MI-based adjacency matrix using the following three steps. First, numeric vectors of gene expression profiles are discretized according to the equal-width discretization method with the default number of bins given by _{
ij
}=_{
i
}),_{
j
})) is calculated between the discretized vectors based on Eq. 10 and the Miller Madow entropy estimation method (detailed in Additional file ^{
MI,SymmetricUncertainty
} (Eq. 14), ^{
MI,UniversalVersion1} (Eq. 15), ^{
MI,UniversalVersion2 }(Eq. 16).

Results

An equation relating

As described previously, the mutual information ^{
MI,UniversalVersion2 }can be accurately approximated as follows:

where the “cor-MI” function

depends on the following two parameters

In general, one can easily show that ^{
cor−MI
}(

Eq. 18 was stated in terms of the Pearson correlation, but it also applies for bicor as can be seen from our simulation studies.

Simulations where

Here we use simulation studies to illustrate that ^{
cor−MI
}(Eq. 19) can be used for predicting or approximating ^{
MI,UniversalVersion2 }from the corresponding correlation coefficients (Eq. 18). Specifically, we simulate 2000 pairs of sample vectors ^{
MI,UniversalVersion2} (Eq. 16) on the basis of ^{
MI,UniversalVersion2} (Figure ^{
MI,UniversalVersion2} is practically indistinguishable from ^{
MI,SymmetricUncertainty
}. This suggests that cor-MI function can also be used to predict ^{
MI,SymmetricUncertainty
} on the basis of the correlation measure. Figure (^{
MI,UniversalVersion1}and ^{
MI,UniversalVersion2} are different from each other but satisfy a monotonically increasing relationship.

Relating mutual information based adjacencies to the Pearson correlation and biweight midcorrelation in simulation

**Relating mutual information based adjacencies to the Pearson correlation and biweight midcorrelation in simulation.** Each point corresponds to a pair of numeric vectors ^{MI,UniversalVersion1}, ^{MI,UniversalVersion2}, Pearson correlation and biweight midcorrelation, respectively. **(A)** MI-based adjacency ^{MI,UniversalVersion2 }versus absolute Pearson correlation. Spearman correlation of the two measures and the corresponding p-value are shown at the top, implying a strong monotonic relationship. The red line shows the predicted ^{MI,UniversalVersion2 }according to ^{cor−MI}(Eq. 18). Note that the prediction function is highly accurate in simulation. **(B)** Observed ^{MI,UniversalVersion2 }versus its predicted value. The straight line has slope 1 and intercept 0. **(C)** Observed Pearson correlation (x-axis) and the corresponding bicor values (y-axis). The straight line has slope 1 and intercept 0. These 2 measurements are practically indistinguishable when x and y are normally distributed. **(D)**^{MI,UniversalVersion2 }versus bicor. Spearman correlation and p-value of the 2 measurements are presented at the top, and predicted ^{MI,UniversalVersion2 }are shown as the red line. **(E)**^{MI,UniversalVersion2 }versus ^{MI.SymmetricUncertainty}. **(F)**^{MI,UniversalVersion2 }versus ^{MI,UniversalVersion1}.

Empirical studies involving 8 gene expression data sets

Our simulation results show that both the robust biweight midcorrelation and the Pearson correlation can be used as input of ^{
cor−MI
} for predicting ^{
MI,UniversalVersion2 }when the underlying variables satisfy pairwise bivariate normal relationships. However, it is not clear whether ^{
cor−MI
} can also be used to relate correlation and mutual information in real data applications. In this section, we report 8 empirical studies to study the relationship between MI and the robust correlation measure bicor. To focus the analysis on genes that are likely to reflect biological variation and to reduce computational burden, we selected the 3000 genes with highest variance across the microarray samples for each data set. Description of data sets can be found in Materials and Methods.

We first calculate bicor and ^{
MI,UniversalVersion2 }for all gene pairs in each data set. The two co-expression measures show strong monotonic relationships in most data sets (Figure ^{
MI,UniversalVersion2} from bicor based on ^{
cor−MI
} (Eq. 18). Our predictions are closely related to true ^{
MI,UniversalVersion2 }values (Figure ^{
MI,UniversalVersion2} (Spearman correlation 0^{
MI,UniversalVersion2} prediction (Pearson correlation 0^{
cor−MI
}. In summary, our examples indicate that for most gene pairs, ^{
MI,UniversalVersion2 }(Eq. 16) is a monotonic function (cor-MI) of the absolute value of bicor. This finding likely reflects the fact that the vast majority of gene pairs satisfy straight line relationships. This approximation improves with increasing sample size

Comparison of correlation and mutual information based co-expression measures in 8 empirical data sets

**Comparison of correlation and mutual information based co-expression measures in 8 empirical data sets.** Absolute value of bicor versus ^{MI,UniversalVersion2 }for all probe pairs in each data set. The Spearman correlation and corresponding p-value between the two measures are shown at the top. The two measures show different levels of monotonic relationships in data sets. The red curve predicts ^{MI,UniversalVersion2 }from bicor based on Eq. 18. The blue circle highlights the probe pair with the highest ^{MI,UniversalVersion2 }z-score among those with insignificant bicor z-scores (less than 1^{MI,UniversalVersion2 }z-scores (less than 1

Comparison of predicted and observed ^{MI,UniversalVersion2 }in 8 empirical data sets

**Comparison of predicted and observed **^{MI,UniversalVersion2 }**in 8 empirical data sets.** In all data sets, prediction from bicor based on Eq. 18 and observed ^{MI,UniversalVersion2 }are highly correlated (the Pearson correlation and corresponding p-value shown at top). Line y=x is added. Blue and red circles have the same meaning as in Figure

Although ^{
cor−MI
} reveals a close relationship between bicor and ^{
MI,UniversalVersion2 }for most gene pairs, there are cases where the two association measures strongly disagree. In the following, we present scatter plots to visualize the relationships between pairs of genes where MI found a significant relationship while bicor did not and vice versa. To facilitate a comparison between bicor and MI, we standardized each association measure across pairs, which resulted in the Z scores denoted by _{
ij
}was large but _{
ij
}was low and vice versa. The resulting pairs correspond to the blue and red circles in Figures ^{
MI,UniversalVersion2} but insignificant bicor values. Note that the resulting dependencies seem haphazard and may not reflect real biological dependencies. For example, the gene pair in the brain cancer data set exhibits no clear relationships as correctly implied by bicor, while the significant MI value is driven by an array outlier with extremely high expression for both genes. In the SAFHS data, the gene pair exhibits an unusual pattern that is more likely to be the result of batch effects rather than biological signals. The mouse liver data set displays a pairwise pattern that is neither commonly seen nor easily explained. The ND data set shows no obvious patterns at all, making mutual information less trustworthy. On the contrary, gene pairs with significant value of Z.bicor but insignificant ^{
MI,UniversalVersion2}.

Gene expression of example probe pairs for which the correlation and mutual information based measures disagree

**Gene expression of example probe pairs for which the correlation and mutual information based measures disagree. ****(A)** Gene expression of probe pairs highlighted by blue circles in Figure **(B)** Gene expression of probe pairs highlighted by red circles in Figure ^{MI,UniversalVersion2 }values and z-scores of the latter two measures are shown at the top. Mutual information is susceptible to outliers, sometimes detects unusual patterns that are hard to explain, and often misses linear relations that are captured by bicor.

In summary, bicor usually detects linear relationships between gene pairs accurately while mutual information is susceptible to outliers, and sometimes identifies pairs that exhibit patterns unlikely to be of biological origin or that exhibit no clear dependency at all. We note that MI results tend to be more meaningful when dealing with a large number of observations (say

**Empirical analysis using large number of genes in the mouse adipose and ND data sets.** Page one is an empirical analysis using all 23568 genes without restricting to 3000 genes for the mouse adipose data set. (A) Absolute value of bicor versus ^{
MI,UniversalVersion2}. One million randomly sampled gene pairs are plotted to reduce computational burden. The two measures show good monotonic relationship. The red curve predicts ^{
MI,UniversalVersion2} from bicor. The blue circle highlights the probe pair with the highest ^{
MI,UniversalVersion2} z-score among those with insignificant bicor z-scores (less than 1^{
MI,UniversalVersion2} z-scores (less than 1^{
MI,UniversalVersion2 }are highly correlated. As in (A), one million randomly sampled gene pairs are plotted. Line y=x is added. (C) Gene expression of probe pairs highlighted by blue circles. (D) Gene expression of probe pairs highlighted by red circles.

Page two is the same analysis for ND data set using 10000 randomly selected genes rather than 3000 genes with highest variance.

Click here for file

Gene ontology enrichment analysis of co-expression modules defined by different networks

Gene co-expression networks typically exhibit modular structure in the sense that genes can be grouped into modules (clusters) comprised of highly interconnected genes (i.e., within-module adjacencies are high). The network modules often have a biological interpretation in the sense that the modules are highly enriched in genes with a common functional annotation (gene ontology categories, cell type markers, etc)

In order to provide an unbiased comparison, we use the same clustering algorithm for module assignment for all networks. Toward this end, we use a module detection approach that has been used in hundreds of publications: modules are defined as branches of the hierarchical tree that results from using 1−

Module identification based on various network inference methods in simulation with non-linear gene-gene relationships

**Module identification based on various network inference methods in simulation with non-linear gene-gene relationships.** The data set is composed of 200 genes across 200 samples. 3 true modules are designed. Two of them, labeled with colors turquoise and blue, contain linear and non-linear (quadratic) gene-gene relationships. For each adjacency, the clustering tree and module colors are shown. True simulated module assignment is shown by the first color band underneath each tree. On top of each panel is the Rand index between inferred and simulated module assignments.

The 10 different adjacencies considered here are described in the last 2 columns of Table ^{
MI,UniversalVersion2 }with those resulting from 3 bicor based networks: unsigned adjacency (unsignedA, Eq. 29), signed adjacency (signedA, Eq. 28) and Topological Overlap Matrix (TOM, Eq. 30) based on signed adjacency. GO enrichment p-values of modules in the 8 real data applications are summarized as barplots in Figure ^{
MI,UniversalVersion2}. Note that signed correlation network coupled with the topological overlap transformation exhibit the most significant GO enrichment p-values in all data sets, and the difference is statistically significant (

**Comparison of MIC and correlation based co-expression measures.** Comparison of MIC and correlation in our empirical gene expression data sets except SAFHS. This is an extension of Figure

Click here for file

Gene ontology enrichment analysis comparing ^{MI,UniversalVersion2 }with bicor based adjacencies in 8 empirical data sets

**Gene ontology enrichment analysis comparing **^{MI,UniversalVersion2 }**with bicor based adjacencies in 8 empirical data sets.** 5 best GO enrichment p-values from all modules identified using each adjacency are log transformed, pooled together and shown as barplots. Error bars stand for 95% confidence intervals. On top of each panel is a p-value based on multi-group comparison test. TOM outperforms the others in all 8 data sets.

Gene ontology enrichment analysis comparing TOM with MI based adjacencies in 8 empirical data sets

**Gene ontology enrichment analysis comparing TOM with MI based adjacencies in 8 empirical data sets.** 5 best GO enrichment p-values from all modules identified using each adjacency are log transformed, pooled together and shown as barplots. Error bars stand for 95% confidence intervals. On top of each panel is a p-value based on multi-group comparison test. TOM outperforms the others in 5 data sets. ARACNE(

**Network type**

**Used here**

**Examples**

**Variable**

**Ease of estimation**

**Utility for modeling**

**Adjacencies**

**Used in GO**

**types**

**discussed**

**enrichment**

**this article**

**analysis**

**GRN**

**Reduce**

**Direct**

**Time**

**Nonlin.**

**Sign**

For each network method, the table reports what kinds of biological insights can be gained and what kind of data can be analyzed. Column “GRN” indicates whether the network has been (or can be) used for studying gene regulatory networks. Column “Reduce” indicates whether the method has been used for reducing high dimensional data (e.g. via modules and their representatives). Column “Direct” indicates whether the the network can encode directional information. Column “time” indicates whether the network method is suited for studying time series data. Column “Nonlin. ” indicates whether the network can capture non-linear relationships between pairs of variables (represented as nodes). Column “Sign” indicates whether the network adjacency provides information on the sign of the relationship between two variables, e.g. a correlation coefficient can take on positive and negative values. The table entry “NA” stands for not applicable. Adjacencies discussed in this article: unsignedA: unsigned bicor; signedA: signed bicor; TOM: TOM transformed signed bicor; ASU: ^{
MI,SymmetricUncertainty
}; AUV1: ^{
MI,UniversalVersion1}; AUV2: ^{
MI,UniversalVersion2}; ARACNE: ARACNE,

**Correlation network**

Yes

WGCNA

Numeric

Easy

Yes

Yes

No

Maybe

No

Yes

unsignedA

Yes

signedA

Yes

TOM

Yes

**Polynomial or**

Yes

WGCNA

Numeric

Moderate

Yes

Yes

No

Maybe

Yes

No

^{2}

No

**Spline regression**

^{2}

No

**network**

**Mutual information network**

Yes

ARACNE

Discretized

Moderate

Yes

Not clear

No

Maybe

Yes

No

ASU

No

numeric,

AUV1

No

categorical

AUV2

Yes

ARACNE

Yes

ARACNE0.2

Yes

ARACNE0.5

Yes

CLR

Yes

MRNET

Yes

RELNET

Yes

MIC

Yes

**Boolean network**

No

Boolean network

Dichoto-mized numeric

Moderate

Yes

Not clear

Yes

Yes

NA

NA

No

No

**Probabilistic network**

No

Bayesian network

Any

Hard

Yes

Not clear

Yes

Yes

Yes

Yes

No

No

**Overall, these unbiased comparisons show that signed correlation networks coupled with the topological overlap transformation outperform the commonly used mutual information based algorithms when it comes to GO enrichment of modules**.

Polynomial and spline regression models as alternatives to mutual information

A widely noted advantage of mutual information is that it can detect general, possibly non-linear, dependence relationships. However, estimation of mutual information poses multiple challenges ranging from computational complexity to dependency on parameters and difficulties with small sample sizes. Standard polynomial and spline regression models can also detect non-linear relationships between variables. While perhaps less general than MI, relatively simple polynomial and spline regression models avoid many of the challenges of estimating MI while adequately modeling a broad range of non-linear relationships. In addition to being computationally simpler and faster, regression models also make available standard statistical tests and model fitting indices. Thus, in this section we examine polynomial and spline regression as alternatives to MI for capturing non-linear relationships between gene expression profiles. We define association measures based on polynomial and spline regression models and study their performance.

Networks based on polynomial and spline regression models

Consider two random variables

The model fitting index ^{2}(^{2}(^{2}(^{2}(

Now consider a set of _{1},…,_{
n
}. One can then calculate pairwise model fitting indices _{
i
} and _{
j
}. To define an adjacency matrix, we symmetrize

Spline regression models are also known as local polynomial regression models _{
i
},_{
j
}. We use the following rule of thumb for the number of knots: if

Compared to spline regression, polynomial regression models have a potential shortcoming: the model fit can be adversely affected by outlying observations. A single outlying observation (_{
u
},_{
u
}) can “bend” the fitting curve into the wrong direction, i.e. adversely affect the estimates of the

Figure

Fitting polynomial and spline regression models to measure non-linear relationships

**Fitting polynomial and spline regression models to measure non-linear relationships. ****(A-B)** A pair of simulated data ^{2})^{2}. The red curve shows the fit of a polynomial regression model with degree **(C-D)** Comparisons of regression models and mutual information based co-expression measures in the ND data set. Co-expression of probe pairs is measured with polynomial (d = 3)/cubic spline regressions (x-axis) and mutual information _{MI,UniversalVersion2}(y-axis). The Spearman correlation and p-value of the two measures are shown at the top. **(E-F)** Comparisons in the mouse muscle data set. _{MI,UniversalVersion2 }has a stronger correlation with regression models than with bicor, indicating that the first two measures can capture certain common non-linear patterns.

Relationship between regression and MI based networks

Previously, we discussed the relationship between correlation and mutual information based adjacencies in simulations where ^{2 }result in similar adjacencies in our applications (refer to Additional file

**Compare polynomial and spline regression models to correlation or mutual information based co-expression measures in simulation.** Each point corresponds to a pair of numeric vectors ^{2 }from polynomial regression symmetrized by Eq. 5 versus absolute Pearson correlation values. The two measures are indistinguishable since the data is simulated to exhibit linear relationships. (B) ^{2} from polynomial regression symmetrized by Eq. 5 versus ^{
MI,UniversalVersion2}. The red line predicts ^{
MI,UniversalVersion2 }from ^{2}. (C-D) Same plots for spline regression models.

Click here for file

**Polynomial and spline regression models for estimating non-linear relationships in real data application.** In this document, we use polynomial and spline regression models to estimate non-linear relationships in real data applications.

Click here for file

In addition, our empirical data show that regression models and mutual information adjacency ^{
MI,UniversalVersion2 }are highly correlated, and the relationship is stronger than that between bicor and ^{
MI,UniversalVersion2} (Figure ^{
MI,UniversalVersion2 }and regression models discover some common gene pairwise non-linear relations that can not be identified by correlations. The Neurological Disease (ND) and mouse muscle sets are shown in Figure

Simulations for module identification in data with non-linear relationships

Our empirical studies show that most gene pairs satisfy linear relationships, which implies that correlation based network methods perform well in practice. But one can of course simulate data where non-linear association measures (such as MI, spline ^{2}) outperform correlation measures when it comes to module detection. To illustrate this point, we simulated data with non-linear gene-gene relationships. Here we simulated 200 genes in 3 network modules across 200 samples. Two of the simulated modules, labeled for convenience by the colors turquoise and blue, contain linear and non-linear (quadratic) gene-gene relationships (Figure

Rand indices in simulations with various number of observations

**Rand indices in simulations with various number of observations.** Simulation sample size versus Rand indices between inferred and simulated module assignments from different network inference methods. increase as the simulation data set contains more samples. Non-linear measures, especially polynomial and spline regression models, outperform other measures as sample size increases.

Overview of network methods and alternatives

A thorough review of network methods is beyond our scope and we point the reader to the many many review articles

While it is beyond our scope to evaluate network inference methods for time series data (reviewed in

A large part of GRN research focuses on the accurate assessment of individual network edges, e.g.

Discussion

This article presents the following theoretical and methodological results: i) it reviews the relationship between the MI and a likelihood ratio test statistic in case of two categorical variables, ii) it presents a novel empirical formula for relating correlation to MI when the two variables satisfy a linear relationship, and iii) it describes how to use polynomial and spline regression models for defining pairwise co-expression measures that can detect non-linear relationships.

Mutual information has several appealing information theoretic properties. A widely recognized advantage of mutual information over correlation is that it allows one to detect non-linear relationships. This can be attractive in particular when dealing with time series data

For categorical variables, mutual information is (asymptotically) equivalent to other widely used statistical association measures such as the likelihood ratio statistic or the Pearson chi-square test. In this case, all of these measures (including MI) are arguably optimal association measures. Interpreting MI as a likelihood ratio test statistic facilitates a straightforward approach for adjusting the association measure for additional covariates.

We and others

The correlation coefficient is an attractive alternative to the MI for the following reasons. First, the correlation can be accurately estimated with relatively few observations and it does not require the estimation of the (joint) frequency distribution. Estimating the joint density needed for calculating MI typically requires larger sample sizes. Second, the correlation does not depend on hidden parameter choices. In contrast, MI estimation methods involve (hidden) parameter choices, e.g. the number of bins when a discretization method is being used. Third, the correlation allows one to quickly calculate p-values and false discovery rates since asymptotic tests are available (Additional file

Our empirical studies show that a signed weighted correlation network transformed via the topological overlap matrix transformation often leads to the most significant functional enrichment of modules. The recently developed maximal information coefficient

While defining mutual information for categorical variables is relatively straightforward, no consensus seems to exist in the literature on how to define mutual information for continuous variables. A major limitation of our study is that we only studied MI measures based on discretized continuous variables. For example, the cor-MI function for relating correlation to MI only applies when an equal width discretization method is used with

A second limitation concerns our gene ontology analysis of modules identified in networks based on various association measures in which we found that the correlation based topological overlap measure (TOM) leads to co-expression modules that are more highly enriched with GO terms than those of alternative approaches. A potential problem with our approach is that the enrichment p-values often strongly depend on (increase with) module sizes, and TOM tends to lead to larger modules. To address this concern, in Additional file

**The relationship between module size and gene ontology enrichment p-values in 8 real data applications.** In each panel, module size (x-axis) is plotted against −log10 GO enrichment p-values (y-axis)in dots. Loess regression lines are provided to show the trend. Red and black color represent network modules constructed using TOM and ^{
MI,UniversalVersion2 }based measures, respectively. In most data sets, the enrichment of modules defined by TOM is better than that of comparably sized modules defined by ^{
MI,UniversalVersion2}.

Click here for file

A third limitation concerns our use of the bicor correlation measure as opposed to alternatives (e.g. Pearson or Spearman correlation). In our study we find that all 3 correlation measures lead to very similar findings (Additional file

**Comparison of bicor, Pearson correlation and Spearman correlation based signed adjacency in 8 empirical data sets.** Each panel show the −log10 transformed 5 best gene ontology enrichment p-values of all modules identified using each type of adjacency. Error bars stand for 95% confidence intervals. On top of each panel is a p-value based on multi-group comparison test. All three types of correlation are similar in terms of GO enrichment.

Click here for file

Conclusions

Our simulation and empirical studies suggest that mutual information can safely be replaced by linear regression based association measures (e.g. bicor) in case of stationary gene expression measures (which are represented by quantitative variables). To capture general monotonic relationships between such variables, one can use the Spearman correlation. To capture more complicated dependencies, one can use symmetrized model fitting statistics from a polynomial or spline regression model. Regression based association measures have the advantage of allowing one to include covariates (conditioning variables). In case of categorical variables, mutual information is an appropriate choice since it is equivalent to an association measure (likelihood ratio test statistic) of a generalized linear regression model but categorical variables rarely occur in the context of modeling relationships between gene products.

Materials and Methods

Empirical gene expression data sets description

**Brain cancer data set.** This data set was composed of 55 microarray samples of glioblastoma (brain cancer) patients. Gene expression profiling were performed with Affymetrix high-density oligonucleotide microarrays. A detailed description can be found in

**SAFHS data set.** This data set

**ND data set.** This blood lymphocyte data set consisted of 346 samples from patients with neurological diseases. Illumina HumanRef-8 v3.0 Expression BeadChip were used to measure their gene expression profiles.

**Yeast data set.** The yeast microarray data set was composed of 44 samples from the Saccharomyces Genome Database (

**Tissue-specific mouse data sets.** This study uses 4 tissue-specific gene expression data from a large _{2} mouse intercross (B × H) previously described in

Definition of Biweight Midcorrelation

Biweight midcorrelation (bicor) is considered to be a good alternative to Pearson correlation since it is more robust to outliers _{1},…,_{
m
}) and _{1},…,_{
m
}), one first defines _{
i
}
_{
i
}with

where _{
i
}for _{
i
}, which is,

where the indicator _{
i
}|) takes on value 1 if 1−|_{
i
}| > 0 and 0 otherwise. Therefore, _{
i
}gets away from _{
i
}differs from _{
i
}. Given the weights, we can define biweight midcorrelation of

A modified version of biweight midcorrelation is implemented as function _{
i
}= 0. Practically, we find that

Types of correlation based gene co-expression networks

Given the expression profile _{
ij
}between genes

An **unweighted network adjacency**
_{
ij
}between gene expression profiles **
x
**

where

To preserve the continuous nature of the co-expression information, we define the **weighted network adjacency** between 2 genes as a power of the absolute value of the correlation coefficient

with

An important choice in the construction of a correlation network concerns the treatment of strong negative correlations. In **signed networks** negatively correlated nodes are considered unconnected. In contrast, in **unsigned networks** nodes with high negative correlations are considered connected (with the same strength as nodes with high positive correlations). As detailed in

and an unsigned adjacency by

Adjacency function based on topological overlap

The topological overlap matrix (TOM) based adjacency function _{
TOM
} maps an original adjacency matrix ^{
original
}to the corresponding topological overlap matrix, i.e.

The TOM based adjacency function _{
TOM
} is particularly useful when the entries of ^{
original
}are sparse (many zeroes) or susceptible to noise. This replaces the original adjacencies by a measure of interconnected that is based on shared neighbors. The topological overlap measure can serve as a filter that decreases the effect of spurious or weak connections and it can lead to more robust networks

Mutual-information based network inference methods

There are 4 commonly used mutual-information based network inference methods: RELNET, CLR, MRNET and ARACNE. In order to identify pairwise interactions between numeric variables _{
i
},_{
j
}, all methods start by estimating mutual information _{
i
},_{
j
}).

RELNET

The relevance network (RELNET) approach

CLR

The CLR algorithm _{
i
} given the mutual information _{
i
},_{
j
}) and the sample mean _{
i
} and standard deviation _{
i
} of the empirical distribution of mutual information _{
i
},_{
k
}),

_{
j
} can be defined analogously. In terms of _{
i
},_{
j
}, the score used in CLR algorithm can be expressed as

MRNET

MRNET _{
i
} having the highest mutual information with target y. Next, given a set _{
k
} that maximizes _{
j
}−_{
j
} where _{
j
} is a relevance term and _{
j
} is a redundancy term. In particular,

The score of each pair _{
i
} and _{
j
} will be the maximum score of the one computed when _{
i
} is the target and the one computed when _{
j
}is the target.

ARACNE

The ARACNE **data processing inequality** (DPI). The DPI applied to association networks states that if variables _{
i
} and _{
j
} interact only through a third variable _{
k
}, then

ARACNE starts with a network graph where each pair of nodes with _{
ij
}>_{
i
},_{
k
}),_{
k
},_{
j
})) and _{
i
},_{
j
}) lies above a threshold

The tolerance threshold

The outputs from RELNET, CLR, MRNET or ARACNE are association matrices. They can be transformed into corresponding adjacencies based on the algorithm discussed in Introduction.

MIC

Another mutual information based method is the recently proposed the maximal information coefficient (MIC)

Fitting indices of polynomial regression models

While networks based on the Pearson correlation can only capture linear co-expression patterns there is clear evidence for non-linear co-expression relationships in transcriptional regulatory networks

where

One can show that the least squares estimate of the parameter vector

where^{-} denotes the (pseudo) inverse, and ^{
τ
}denotes the transpose of a matrix.

Given ^{2 }as:

In the context of a regression model, ^{2 }is also known as the proportion of variation of y explained by the model.

Spline regression model construction

To investigate the relationship between variable

To ensure that fit between **hockey stick function**()_{ + } to transform

This function can also be applied to the components of a vector, e.g. (_{ + }denotes a vector whose negative components have been set to zero. So (_{ + } is a vector whose u-th component equals

We are now ready to describe **cubic spline regression model**, which fits polynomial of degree 3 to sub-intervals. The general form of a cubic spline with 2 knots is as follows

The knot parameters (numbers) _{1},_{2},… are chosen before estimating the parameter values. Analogous to polynomial regression, ^{2} can be calculated as the association measure between

Other networks

Boolean network

Availability of software

**Project name:** Adjacency matrix for non-linear relationships

Project home page:

**Operating system(s):** Platform independent

**Programming language:** R

**Licence:** GNU GPL 3

The following functions described in this article have been implemented in the WGCNA R package ^{2} based adjacencies. Users can specify the ^{2 }symmetrization method. Function ^{
MI,SymmetricUncertainty
} (Eq. 14), ^{
MI,UniversalVersion1} (Eq. 15) and ^{
MI,UniversalVersion2} (Eq. 16). Function ^{
cor−MI
}prediction function 18 for relating correlation with mutual information.

Abbreviations

MI: Mutual information; Bicor: Biweight midcorrelation; MIC: Maximal information coefficient; ARACNE: Algorithm for the reconstruction of accurate cellular networks; GO: Gene ontology; LRT: Likelihoood ratio test; TOM: Topological overlap matrix; WGCNA: Weighted correlation network analysis.

Competing interests

We declare no conflict of interest.

Authors’ contributions

LS and SH performed the research; LS, SH, and PL wrote the paper and developed R software functions. SH designed the research. All authors read and approved the final manuscript.

Acknowledgements

We acknowledge grant support from 1R01 DA030913-01, P50CA092131, R01NS058980, and the UCLA CTSI.