Department of Statistics, Ludwig-Maximilians-Universität München, Ludwigstraße 33, D-80539 München, Germany

Institute for Medical Informatics, Statistics and Epidemiology (IMISE), University of Leipzig, Härtelstr. 16-18, 04107 Leipzig, Germany

Abstract

Background

The use of correlation networks is widespread in the analysis of gene expression and proteomics data, even though it is known that correlations not only confound direct and indirect associations but also provide no means to distinguish between cause and effect. For "causal" analysis typically the inference of a directed graphical model is required. However, this is rather difficult due to the curse of dimensionality.

Results

We propose a simple heuristic for the statistical learning of a high-dimensional "causal" network. The method first converts a correlation network into a partial correlation graph. Subsequently, a partial ordering of the nodes is established by multiple testing of the log-ratio of standardized partial variances. This allows identifying a directed acyclic causal network as a subgraph of the partial correlation network. We illustrate the approach by analyzing a large

Conclusion

The proposed approach is a heuristic algorithm that is based on a number of approximations, such as substituting lower order partial correlations by full order partial correlations. Nevertheless, for small samples and for sparse networks the algorithm not only yield sensible first order approximations of the causal structure in high-dimensional genomic data but is also computationally highly efficient.

Availability and Requirements

The method is implemented in the "GeneNet" R package (version 1.2.0), available from CRAN and from

Background

Correlation networks are widely used to explore and visualize high-dimensional data, for instance in finance

However, for shedding light on the causal processes underlying the observed data, correlation networks are only of limited use. This is due to the fact that correlations not only confound direct and indirect associations but also provide no means to distinguish between response variables and covariates (and thus between cause and effect).

Therefore, causal analysis requires tools different from correlation networks: much of the work in this area has focused on Bayesian networks

There already exist numerous methods for learning DAGs from observational data – see for instance the summarizing review in

In this paper we follow

• First, the correlation network is transformed into a partial correlation network, which is essentially an undirected graph that displays the direct linear associations only. This type of network model is also known under the names of graphical Gaussian model (GGM), concentration graph, covariance selection graph, conditional independence graph (CIG), or Markov random field. Note that there is a simple relationship between correlation and partial correlation. Moreover, in recent years there has been much progress with regard to statistical methodology for learning large-scale partial correlation graphs from small samples [e.g.,

• Second, the undirected GGM is converted into a

Note that this algorithm is similar to the PC algorithm in that edges are being removed from the independence graph to obtain the underlying DAG. However, our criterion for eliminating an edge is distinctly different from that of the PC algorithm.

The remainder of the paper is organized as follows. First, we describe the methodology. Second we consider its statistical interpretation and further properties. Subsequently, we illustrate the approach by analyzing an 800 gene data set from a large-scale

Methods

Theoretical basis

Consider a linear regression with _{1}, ..., _{k}, ..., _{K }as covariates. We assume that _{k }and _{k}) and with covariance cov(_{k}). The best linear predictor of _{k }that minimizes the MSE of
∑_{k }_{k}_{k }-

where _{k}, and

The partial correlation is the correlation that remains between two variables if the effect of the other variables has been regressed away. Likewise, the partial variance is the variance that remains if the influences of all other variables are taken into account. Table

Formulas for computing partial variances and partial correlations

Definition

True value

Estimate

Covariance matrix:

cov(_{k}, _{l}) = _{kl}

**Σ **= (_{kl})

** S **= (

Concentration matrix:

**Ω **= **Σ**^{-1}

**Ω **= (_{kl})

Variances:

var(_{k}) = _{kk }=

_{
kk
}

_{
kk
}

Partial variances

var(_{k}|_{≠k}) =

Correlations:

corr(_{k}, _{l}) = _{kl }= _{kl }(_{kk }_{ll})^{-1/2}

** P **= (

** R **= (

Partial correlations:

corr(_{k}, _{l}|_{≠k, l}) = _{}

Index

From Equation 1 it is immediately clear that the complete linear system and thus all _{k }[see also, e.g.,

We emphasize that Equation 1 has a direct relation with the usual ordinary least squares (OLS) estimator for the regression coefficient. This is recovered if the empirical covariance matrix is plugged into Equation 1. However, note that Equation 1 also remains valid if other estimates of the covariance are used, such as penalized or shrinkage estimators (note that there is no hat on

For the following it is important that Equation 1 can be further rewritten by introducing a scale factor. Specifically, by abbreviating the standardized partial variance _{k}, we can decompose the regression coefficient into the simple product

Note that SPV_{y }and SPV_{k }take on values from 0 to 1. All three factors have an immediate and intuitive interpretation:

_{k}. If the partial correlation between _{k }and _{k }if

_{k }due to the respective other covariates. In the algorithm outlined below a test of log(

_{k}.

The product

In this context it is also helpful to recall the diverse statistical interpretations of SPV:

• SPV is the

• For the OLS estimator SPV is equal to 1 - ^{2}, where

• SPV is the inverse of the diagonal of the inverse of the

• SPV may also be estimated by 1/VIF, where VIF is the usual variance inflation factor [cf.

Heuristic algorithm for discovering approximate causal networks

The above decomposition (Equation 2) suggests the following simple strategy for statistical learning of causal networks. First, by multiple testing of

In more detail, we propose the following five-step algorithm:

1. First, it is essential to determine an accurate and positive definite estimate ** R **of the correlation matrix. Only if the sample size is large with many more observations than variables (

2. From the estimated correlations we compute the partial variances and correlations (see Table

3. Subsequently, we infer the partial correlation graph following the algorithm described in

4. In a similar fashion we then conduct multiple testing of all log(

_{0 }_{0}) _{A }(

Assuming that most _{A }(_{0}, as well as the variance parameter

5. Finally, a partially directed network is constructed as follows. All edges in the correlation graph with significant log(

Results and discussion

Interpretation of the resulting graph

The above algorithm returns a partially directed partial correlation graph, whose directed edges form a causal network.

This procedure can be motivated by the following connection between partial correlation graph and a system of linear equations, where each node is in turn taken as a response variable and regressed against all other remaining nodes. In this setting the partial correlation coefficient is the geometric mean of

[see also equation 16 of ref.

Reconstruction efficiency and approximations underlying the algorithm

Topology of the network

The proposed algorithm is an extension of the GGM inference approach of

However, it is well known that a directed Bayesian network and the corresponding undirected graph are not necessarily topologically identical: in the undirected graph for computing the partial correlations one conditions on all other nodes whereas in the directed graph one conditions only on a subset of nodes, in order to avoid conditioning "on the future" (i.e. on the dependent nodes). Therefore, it is critical to evaluate to what extent full order partial correlations are reasonable approximations for lower order partial correlations. This has already been investigated intensively by

Node ordering

A second approximation implicit in our algorithm concerns the determination of the ordering of the nodes, which is done by multiple testing of pairwise ratios of standardized partial variances. We have conducted a number of numerical simulations (data not shown) that indicate that for randomly simulated DAGs the ordering of the nodes is indeed well reflected in the partial variances, as expected.

However, from variable selection in linear models it is also known that the partial variance (or the related ^{2}) may not always be a reliable indicator for variable importance. Nevertheless, the partial ordering of nodes according to SPV and the implicit model selection in the underlying regressions is a very different procedure in comparison to the standard variable selection approaches, in which the increase or decrease of the ^{2 }is taken as indicator of whether or not a variable is to be included, or a decomposition of ^{2 }is sought [for a review see, e.g.,

It is also noteworthy that, as we impose directionality from the less well explained variable (large SPV, "exogenous", "independent") to the one with relatively lower SPV (well explained, "endogenous", "dependent" variable), we effectively choose the direction with the relatively

Further properties of the heuristic algorithm and of the resulting graphs

The simple heuristic network discovery algorithm exhibits a number of further properties worth noting:

1. The estimated partially directed network cannot contain any (partially) directed cycles. For instance, it is not possible for a graph to contain a pattern such as _{A }> SPV_{B }> SPV_{A}, which is a contradiction. As a consequence, the subgraph containing the directed edges only is also acyclic (and hence a DAG).

2. The assignment of directionality is transitive. If there is a directed edge from

3. As the algorithm relies on correlations as input, causal processes that produce the same correlation matrix lead to the same inferred graph, and hence are indistinguishable. The existence of such equivalence classes is well known for SEMs

4. The proposed algorithm is scale-invariant by construction. Hence, a (linear) change in any of units of the data has no effect on the overall estimated partially directed network, and the implied causal relations.

5. We emphasize that the partially directed network is

6. The computational complexity of the algorithm is ^{3}). Hence, it is no more expensive than computing the partial correlation graph, and thus allows for estimation of networks containing in the order of thousands and more nodes.

Analysis of a plant expression data set

To illustrate our algorithm for discovering causal structure, we applied the approach to a real world data example. Specifically, we reanalyzed expression time series resulting from an experiment investigating the impact of the diurnal cycle on the starch metabolism of

The data are gene expression time series measurements collected at 11 different time points (0, 1, 2, 4, 8, 12, 13, 14, 16, 20, and 24 hours after the start of the experiment). The corresponding calibrated signal intensities for 22,814 genes/probe sets and for two biological replicates are available from the NASCArrays repository, experiment no. 60

In order to estimate the correlation matrix for the 800 genes described by the data set we employed the dynamical correlation shrinkage estimator of

Correlation network inferred from the

Correlation network inferred from the

This is in great contrast to the partially directed partial correlation graph. For this specific data set, by multiple testing of the factor

Distribution of log

Distribution of log _{0 }= 0.8995) and of the alternative distribution (_{A }= 0.1005).

To construct the network, we projected upon the significant edges (factor

The resulting partially causal network is shown in Figure

Partially causal network inferred from the

Partially causal network inferred from the

We also see that the partially directed network contains both directed and undirected nodes. This is a distinct advantage of the present approach. Unlike, e.g., a vector autoregressive model

Finally, in order to investigate the stability of the inferred partial causal network, we randomly removed data points from the sample, and repeatedly reconstructed the network from the reduced data set. In all cases the general topological structure of the network remained intact, which indicates that this is a signal inherent in the data. This is also confirmed by the analysis using vector autoregressions

Conclusion

Methods for exploring causal structures in high-dimensional data are growing in importance, particularly in the study of complex biological, medical and financial systems. As a first (and often only) analysis step these data are explored using correlation networks.

Here we have suggested a simple heuristic algorithm that, starting from a (positive definite) correlation matrix, infers a partially directed network that in turn allows generating causal hypotheses of how the data were generated. Our approach is approximate, but it allows analysis of high-dimensional small sampled data, and its computational complexity is very modest. Thus, our heuristic is likely to be applicable whenever a correlation network is computed, and therefore is suitable for screening large-scale data set for causal structure.

Nevertheless, there a several lines along which this method could be extended. For instance, non-linear effects could be accounted for by employing entropy criteria, or by using higher order moments

Note that the PC algorithm is more refined than our algorithm, primarily due to additional steps that aim at removing spurious edges (i.e. those edges that are induced between otherwise uncorrelated parent nodes by conditioning on a common child node). However, these iterative refinements may be very time consuming, in particular for high-dimensional graphs.

In contrast, our procedure is non-iterative and therefore both computationally and algorithmically (nearly) as simple as a correlation network. Nevertheless, it still enables the discovery of partially directed processes underlying the data.

In summary, we recommend our approach as a procedure for exploratory screening for causal mechanisms. Subsequently, the resulting hypotheses may then form the basis for more refined analyzes, such as full Bayesian network modeling.

Authors' contributions

Both authors participated in the development of the methodology and wrote the manuscript. RO carried out all analyzes. All authors approved of the final version of the manuscript.

Availability and requirements

The method is implemented in the "GeneNet" R package (version 1.2.0), available from CRAN and from

Acknowledgements

This work was in part supported by an "Emmy Noether" excellence grant of the Deutsche Forschungsgemeinschaft (to K.S.).