Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Methodology article

Estimating causal effects with a non-paranormal method for the design of efficient intervention experiments

Reiji Teramoto*, Chiaki Saito and Shin-ichi Funahashi

Author Affiliations

Department for Research, Forerunner Pharma Research, Co., Ltd, Yokohama, Japan

For all author emails, please log on.

BMC Bioinformatics 2014, 15:228  doi:10.1186/1471-2105-15-228

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/15/228


Received:12 March 2014
Accepted:24 June 2014
Published:30 June 2014

© 2014 Teramoto et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Abstract

Background

Knockdown or overexpression of genes is widely used to identify genes that play important roles in many aspects of cellular functions and phenotypes. Because next-generation sequencing generates high-throughput data that allow us to detect genes, it is important to identify genes that drive functional and phenotypic changes of cells. However, conventional methods rely heavily on the assumption of normality and they often give incorrect results when the assumption is not true. To relax the Gaussian assumption in causal inference, we introduce the non-paranormal method to test conditional independence in the PC-algorithm. Then, we present the non-paranormal intervention-calculus when the directed acyclic graph (DAG) is absent (NPN-IDA), which incorporates the cumulative nature of effects through a cascaded pathway via causal inference for ranking causal genes against a phenotype with the non-paranormal method for estimating DAGs.

Results

We demonstrate that causal inference with the non-paranormal method significantly improves the performance in estimating DAGs on synthetic data in comparison with the original PC-algorithm. Moreover, we show that NPN-IDA outperforms the conventional methods in exploring regulators of the flowering time in Arabidopsis thaliana and regulators that control the browning of white adipocytes in mice. Our results show that performance improvement in estimating DAGs contributes to an accurate estimation of causal effects.

Conclusions

Although the simplest alternative procedure was used, our proposed method enables us to design efficient intervention experiments and can be applied to a wide range of research purposes, including drug discovery, because of its generality.

Keywords:
Non-paranormal; Gaussian assumption; Causal effect; Intervention-calculus; Directed acyclic graph; Machine learning; Causal inference; Experiment design

Background

Intervention experiments, e.g., knockdown or overexpression, are commonly conducted to identify genes that determine cell fates such as differentiation [1], induction of pluripotency [2], and direct reprogramming [3]. Those experiments are now indispensable in biological and medical research. Although intervention experiments identify a causal gene responsible for a phenotype of interest, they are time-consuming and cost-intense. Therefore, it is very important to prioritize and focus on causal genes with high confidence. However, it is difficult to infer causal effects only from observational data. This task coincides with inferring causal effects that are established in Statistics. Note that in this problem setting, a causal effect is different from a regression-type effect of association [4]. In fact, previous studies suggested that representative regression methods such as lasso and elastic net are not appropriate for our goal [4-6].

Recently, there has been much progress to address this problem by employing the intervention-calculus when the directed acyclic graph (DAG) is absent (IDA) [4-6] for the design of efficient intervention experiments. IDA combines two methods: (1) inference the unknown underlying DAG model from observational data by the PC-algorithm [7] and (2) estimating causal effects on the basis of DAG using intervention-calculus; furthermore, it provides estimated lower bounds of total causal effects from observational data. The PC-algorithm is computationally feasible and appropriate for high-dimensional settings. In addition, it has the desirable property to achieve high computational efficiency as a function of sparseness of the true underlying DAG model.

In spite of these advantages, the PC-algorithm requires the Gaussianity assumption to use partial correlations to test conditional independence, and this assumption is not necessarily true in real data sets. Because the normality assumption is restrictive and conclusions inferred under this assumption could be misleading, it is desirable to relax the Gaussian assumption.

On the other hand, non-paranormal methods that use a semiparametric Gaussian copula have been proposed for estimating sparse undirected graphs and exhibit significant improvement in the performance because the normality assumption is relaxed [8,9]. The main idea of the non-paranormal method is to exploit the nonparametric correlation coefficient instead of Pearson’s correlation coefficient for estimation. Although this is the simplest alternative procedure, the non-paranormal graphical model could be a viable alternative for the Gaussian graphical model.

Consequently, we present non-paranormal IDA (NPN-IDA), which uses nonparametric partial correlations to test conditional independencies in the PC-algorithm for intervention-calculus. In our method, the Gaussian assumption in the PC-algorithm is naturally relaxed by using nonparametric partial correlation. Although the non-paranormal method has been successfully applied to estimating undirected graphs in previous studies, we show that it works well for estimating DAGs in the PC-algorithm. Next, we applied our method to Arabidopsis thaliana microarray data and mouse microarray data to demonstrate that NPN-IDA outperforms IDA in exploring regulators of the flowering time in A. thaliana and regulators that control the browning of white adipocytes in mice.

In summary, the three main contributions of this work are: (1) introduction of a non-paranormal method for inference of the unknown underlying DAG model from observational data in the expansive framework of the PC-algorithm, (2) combination of the non-paranormal method and the PC-algorithm significantly improves the performance in estimating DAGs on synthetic data, and (3) NPN-IDA is effective in exploring regulators that control specific phenotypes of interest.

Methods

We first introduce the IDA procedure. IDA consists of (1) inference of the unknown underlying DAG model from observational data by PC-algorithm and (2) estimation of causal effects based on the DAG using intervention-calculus. Then, we introduce the non-paranormal method for PC-algorithm. Finally, we present the combination of the non-paranormal method for PC-algorithm and estimating causal effects as NPN-IDA algorithm.

Inference DAGs with the PC-algorithm

Let G = (V, E) be a graph consisting of vertices V and a set of edges E ⊆ V × V. In our context, the vertices represent random variables X1, …, Xp, and Y. The edges represent relationships between pairs of these variables. It is possible that some DAGs fulfill the Markov condition and encode the same list of conditional independencies. Two DAGs are observationally equivalent only if they have the same skeleton and same sets of v-structures, i.e., two converging arrows whose tails are connected by an arrow. In this way, DAGs can be partitioned into equivalent classes, where all members are observationally equivalent and represent the same conditional independence. In a given conditional independence set of DAGs, one can only determine a DAG up to its equivalence class. The equivalence class is called completed partially directed acyclic graph (CPDAG). It has the same skeleton as every DAG in the equivalence class and directed edges only where all DAGs in the equivalence class have the same directed edge. Arrows that point into one direction for some DAGs in the equivalence class and in the other direction for other DAGs in the equivalence class are represented by undirected edges (Figure 1). By assuming that random variables are multivariate normally distributed, conditional independencies can be inferred from a partial correlation between Xi and Xj given a set of other variables S that equals zero:

thumbnailFigure 1. Example of CPDAG. CPDAG G with the DAGs G1, G4 that are in its equivalence class.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M1">View MathML</a>

(1)

We then used the sample version of the PC-algorithm to estimate the corresponding CPDAG. This involves multiple testing for Fisher’s Z-transformed partial correlations,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M2">View MathML</a>

Because <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M3">View MathML</a> has a N(0, (n − |S| − 3)− 1) distribution if ρij|S = 0, we conclude that ρij|S ≠ 0 if

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M4">View MathML</a>

where Φ is the standard normal distribution function and α is a tuning parameter, which can be interpreted as the significance level of a single partial correlation test. Choosing an appropriate value for α is difficult but, for example, can be done with the Bayesian information criterion.

First, the PC-algorithm generates a skeleton on the basis of conditional independencies. The outline of the PC-algorithm is shown in Figure 2. The complete PC-algorithm is described in detail in a precious work [7]. Note that the PC-algorithm employs partial correlation to test conditional independency.

thumbnailFigure 2. PC-algorithm for generating the skeleton.

Estimating causal effects using intervention-calculus

Again, we considered p + 1 random variables X1, … Xp, Y (also referred to as X1, … Xp, Xp + 1). Any distribution that is generated from a DAG with independent error is called Markovian with respect to the DAG. Therefore, any Markovian distribution can be factorized as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M5">View MathML</a>

To represent the effect of an intervention of a set of variables, a do operator is introduced. We denoted the distribution of Y that would occur if the treatment condition <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M6">View MathML</a> was enforced uniformly over the population via some intervention as <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M7">View MathML</a>. For a Markovian model, the distribution generated by an intervention <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M8">View MathML</a> on the set of variables X1, …, Xp + 1 is given by the following truncated factorization formula:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M9">View MathML</a>

(2)

Where f(xj|paj) are the pre-intervention conditional distributions. Note that this formula employs the DAG structure to write the interventional distribution on the left-hand side in terms of pre-intervention conditional distributions on the right-hand side. The distribution of Y = Xp + 1 after an intervention <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M10">View MathML</a> can be obtained by marginalizing out x1, …, xp in equation (2). This can be written as follows:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M11">View MathML</a>

where f(·) and <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M12">View MathML</a> represent pre-intervention distributions. We can summarize the distribution generated by the intervention by its mean

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M13">View MathML</a>

and define the causal effect of <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M14">View MathML</a> on Y by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M15">View MathML</a>

(3)

Under the assumption that X1, … Xp, Y are jointly Gaussian, it is easy to compute the causal effects. Gaussianity implies that <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M16">View MathML</a> is linear in <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M17">View MathML</a> and pai That is

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M18">View MathML</a>

for some values, γ0, γi, and <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M19">View MathML</a>, where |pai| is the cardinality of the set pai. Then, we derive

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M20">View MathML</a>

which is linear in <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M21">View MathML</a>. The causal effect of Xi on Y when Y ∉ pai can be computed with (3) yielding γi, which is simply the regression coefficient of Xi in the regression of Y on Xi and pai. When Y ∈ pai, the causal effect becomes zero, because Y is a direct cause of Xi. Therefore, the causal effect of Xi on Y is given by the following equation:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M22">View MathML</a>

where Y ∼ Xi + pai is a shorthand notation for the linear regression of Y on Xi and pai. Consequently, in the Gaussian case, the causal effect does not depend on the value of <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M23">View MathML</a> and can be interpreted as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M24">View MathML</a>

for any value of <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M25">View MathML</a>.

Next, we will describe the IDA algorithm in detail. For simplicity, we only show the case in which the PC-algorithm gives the correct CPDAG. Details of the sample version of the IDA algorithm can be found in previously conducted studies [5-7]. On the basis of CPDAG, which is inferred from the PC-algorithm, we can compute the causal effect for every DAG in the equivalence class, which consists of multisets. Multisets differ from normal sets in that the multiplicity of the elements is taken into account. The IDA algorithm is given in Figure 3.

thumbnailFigure 3. IDA algorithm.

For computing the causal effects θij of Xi on Y in DAG Gj, the IDA algorithm simply computes the regression coefficient of Xi in a regression of Y on Xi and pai, which is denoted by <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M26">View MathML</a>. This procedure is performed for every DAG in the equivalence class yielding a vector of length j of possible causal effects, where j is the number of DAGs in the equivalence class. Computing the effect of every X1, …, Xp on Y yields a matrix Θ with θij entries. We describe the illustrative procedure of IDA in Figure 4. When we obtain a single value for the estimated causal effect, we can take the minimal absolute value of the multiset as a lower bound from Θ for the true absolute intervention effect. This procedure intends to reduce the number of false positives. From a practical point of view, because the number of false positive should be kept as low as possible, it is desirable to provide conservative results.

thumbnailFigure 4. Illustrative procedure of IDA.

Non-paranormal method for NPN-IDA

In the PC-algorithm, the conditional independency is inferred from a sample partial correlation <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M27">View MathML</a> between Xi and Xj given a set of other variables S. As described in the section of the PC-algorithm, a sample partial correlation coefficient can be obtained by calculating a sample Pearson’s correlation coefficient. This fact enables us to relax the Gaussianity assumption of the PC-algorithm by using the non-paranormal method. On the basis of [9], we derive the non-paranormal version of the PC-algorithm (NPN-PC algorithm). Let f = (f1, …, fp) be a set of monotonic univariate functions and let ∑ 0 ∈ p × p be a positive definite correlation matrix with diag(∑0) = 1. We suppose that the p-dimensional random variable X = (X1, …, Xp)T has a non-paranormal distribution X ∼ NPNp(f, ∑ 0) if f(X) ≡ (f1(X1), …, fp(Xp))T ∼ N(0, ∑ 0). For continuous distributions, the non-paranormal family is equivalent to the Gaussian copula family [8]. It has been shown that the non-paranormal family is much richer than the normal family. The conditional independence is encoded by Ω0 = (∑0)− 1. Therefore, we can write the following formula given a set of other variables S:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M28">View MathML</a>

(4)

For Gaussian copula distributions, the following important lemma connects Spearman’s correlation coefficient <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M29">View MathML</a> to the underlying Pearson’s correlation coefficient [8,9].

Lemma 1.[10] By assuming X ∼ NPN(f, ∑ 0), we have <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M30">View MathML</a>.

According to the lemma, we can define the following estimator <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M31">View MathML</a> for the unknown correlation matrix ∑ 0: <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M32">View MathML</a> for i ≠ j, and <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M33">View MathML</a>. Finally, if we define <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M34">View MathML</a>, we obtain the following formula that connects Spearman’s correlation coefficient and the non-paranormal partial correlation coefficient <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M35">View MathML</a>.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M36">View MathML</a>

Instead of ρij|S in (1), we employ <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M37">View MathML</a> to test conditional independence for estimating CPDAG in the PC-algorithm. For simplicity, we denoted the method that combines the NPN-PC algorithm and IDA as NPN-IDA. We describe the pseudo code of the NPN-IDA algorithm in Figure 5.

thumbnailFigure 5. NPN-IDA algorithm.

Results

Experimental settings

Two experiments were conducted for different purposes. The first purpose was to evaluate the performance of the NPN-PC algorithm on synthetic data when the Gaussianity assumption is not true. According to a previous study [7], the four metrics of performance, i.e., true positive rate (TPR), false positive rate (FPR), true discovery rate (TDR), and structural hamming distance (SHD), are used. TPR, FPR, and TDR are defined as follows:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M38">View MathML</a>

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M39">View MathML</a>

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/228/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/228/mathml/M40">View MathML</a>

where gaps indicate the pairs of vertex that are not linked.

SHD is the number of edge insertions, deletions, and flips to transfer the estimated DAG into the true DAG [7]. A large SHD indicates a poor fit, whereas a small SHD indicates a good fit.

To simulate the case that the Gaussian assumption is not true, we generated multivariate data with dependency structure by a given DAG with nodes corresponding to random variables and added standard Cauchy distribution in partial samples using the rmvDAG function in the pcalg package. Figure 6 depicts the normal distribution and the Cauchy distribution. As shown in Figure 6, the standard Cauchy distribution is tail-heavier than the standard normal distribution, and is quite different from the standard normal distribution.

thumbnailFigure 6. Comparison of standard normal distribution and standard Cauchy distribution.

The second purpose was to evaluate the performance of NPN-IDA when applied to predicting regulators of the flowering time as phenotype of interest using an A. thaliana microarray data set and regulators that control the browning of white adipocytes in mice using mouse microarray data. These two data sets are entirely different in terms of species and target variables. Therefore, if we obtain the consistent results thorough performance comparison of the methods, the consequence will be solid. We implemented the R language and employed the pcalg package for the PC-algorithm and IDA [11].

Performance evaluation of the NPN-PC algorithm

We evaluated the performance under all combinations of settings listed in Table 1. The mixing rate indicates the percentage of samples whose error distribution was drawn from the standard Cauchy distribution. The higher the mixing rate, the less accurate is the Gaussianity assumption. To evaluate this hypothesis thoroughly, we repeated this experiment 50 times and averaged the values of TPR, FPR, TDR, and SHD. To explain the essence of these experiments, we show the representative results of the settings (α = 10− 4, cp = 0.005) in Table 2. All experimental results are shown in Additional file 1. To estimate the causal effects, estimated DAGs inferred by NPN-PC were employed. Because we considered the situation that the Gaussianity assumption is violated, we determined the significance level α on the basis of the average SHD when the mixing rate was 1. Average SHDs under the different probabilities of connecting one node to another node are shown in Figure 7. Because the average SHD was very small, we employed estimated DAGs when the significance level α was set to 10−4 to further estimate the causal effects.

Table 1. Parameter setting for performance evaluation of the NPN-PC algorithm

Table 2. Performance comparison between PC-algorithm and NPN-PC algorithm

Additional file 1. Experimental results of all combined parameter settings for performance comparison between the NPN-PC algorithm and the PC-algorithm.

Format: XLSX Size: 66KB Download fileOpen Data

thumbnailFigure 7. Average SHD for p = 100 and 200 under different conditions of significance levels. (a) (c) cp = 0.005, (b) (d) cp = 0.01.

Performance evaluation of the NPN-IDA algorithm

Exploring regulators of the flowering time in A. thaliana

We employed the microarray data set of A. thaliana and the flowering time used in a previous study [7]. The data set consisted of 21326 probes and 47 samples. We regarded the nine known regulators of the flowering time (DWF4, FLC, FRI, RPA2B, SOC1, PDH-E1 BETA, ATGH9B5, LRR, and OTLD1) as true-positive genes. Because IDA requires a significant amount of computation time, we selected genes for further estimation that had higher absolute correlation coefficients against flowering time until a predefined number was reached (correlation screening). In this study, the numbers of remaining genes obtained by correlation screening were set to 500, 1000, 1500, and 2000. According to the description in the previous section, we set the significance level α to 10−4 for estimating DAGs. Figure 8 shows the enrichment curves under the different conditions of correlation screening. Vertical axes represent the average numbers of true positives and horizontal axes indicate the ranks of causal effects. To quantitatively compare the performances of IDA and NPN-IDA, we also summarized the area under the enrichment curves (AUCs) under the different conditions of correlation screening in Table 3.

thumbnailFigure 8. Enrichment curves on exploring regulators of the flowering time in Arabidopsis thaliana. (a) Correlation screening is set to 500, (b) correlation screening is set to 1000, (c) correlation screening is set to 1500, and (d) correlation screening is set to 2000.

Table 3. AUC values under the different conditions of correlation screening on exploring regulators of the flowering time in Arabidopsis thaliana

Exploring the regulators that control the browning of white adipocytes in mice

We employed the mouse microarray data set of white adipose tissue (WAT) obtained from a previous study [12]. The data set consisted of 43681 probes and 349 WAT samples. According to a previous review [13], we regarded Ucp1 as marker of brown adipose tissue (BAT). We selected genes that had higher absolute correlation coefficients against Ucp1 until a predefined number was reached. The numbers of remaining genes were set to 2000, 3000, and 4000. Table 4 shows the known regulators of the differentiation of WAT to BAT for the different conditions of correlation screening. Note that there are no true positive genes when the number of remaining genes obtained by correlation screening is below 1000.

Table 4. Known regulators of the differentiation of WAT to BAT

Because there are two probes for Tbx15, i.e., merck-NM_009323_at and merck-NM_01154_a_at, we regarded the probe that had the larger causal effect as Tbx15. Figure 9 shows the enrichment curves under different conditions of correlation screening. Vertical axes represent the average numbers of true positives and horizontal axes indicate the ranks of causal effects. The AUCs under the different conditions of correlation screening are summarized in Table 5.

thumbnailFigure 9. Enrichment curves on exploring the regulators that control the browning of white adipocytes. (a) Correlation screening is 2000, (b) correlation screening is 3000, and (c) correlation screening is 4000.

Table 5. AUC values under the different conditions of correlation screening on exploring the regulators that control the browning of white adipocytes in mice

Discussion

From Table 2, it appears that the NPN-PC algorithm consistently outperforms the PC-algorithm with regard to all performance metrics, TPR, FPR, TDR, and SHD. Furthermore, as the mixing rate increases, the performance values of the PC-algorithm decrease. This result clearly shows that the PC-algorithm does not work when the Gaussianity assumption is not true. In contrast, the NPN-PC algorithm works well when mixing rate is high. In other words, the NPN-PC algorithm does not require the Gaussianity assumption of data distribution. In terms of FPR, it appears that the FPR of the NPN-PC algorithm is strictly suppressed under all set values. In the NPN-PC algorithm, all performance metrics improved when the number of samples was 100 compared to when the number of samples was 50. From these observations, it can be concluded that the NPN-PC algorithm is robust and does not depend on data distribution, unlike the PC-algorithm. This characteristic is very attractive from the practical point of view.

From Table 3 and Figure 8, NPN-IDA consistently outperformed IDA on exploring regulators of the flowering time in A. thaliana. In particular, when the correlation screenings were set to 1000 and 1500, the difference in the AUCs between NPN-IDA and IDA increased. When the correlation screenings were set to 500, both NPN-IDA and IDA worked well. This result indicates that the known regulators were sufficiently recovered against the flowering time within genes that have high absolute correlation coefficients. Therefore, although we do not know whether the unknown regulators have high absolute correlation coefficients against the flowering time, it would be a good strategy from a practical perspective to explore novel regulators from genes that have high absolute correlation coefficients against the flowering time.

From Table 5 and Figure 9, NPN-IDA consistently outperformed IDA on exploring regulators that control the browning of white adipocytes in mice. In particular, when the correlation screenings were set to 3000 and 4000, the difference in the AUCs between NPN-IDA and IDA increased. These results suggest that NPN-IDA successfully enriches known regulators at the top ranks when the number of available genes increases. Consequently, our two experiments clearly demonstrated that NPN-IDA performs better than IDA.

To discuss whether the Gaussian assumption is true or not in the data sets used in this study, we applied the Shapiro-Wilk test to microarray data and a target variable of interest, which tests the null hypothesis that a sample came from a normally distributed population [14]. We show a histogram of the p-values of the Shapiro-Wilk test for target phenotypes of interest (flowering time in A. thaliana and gene expression of Ucp1 in mice) and gene expression levels in Figure 10. We also show individual histograms of phenotypes of interest and expression levels of the known regulators in Figure 11; the respective p-values of the Shapiro-Wilk test are summarized in Table 6. From Figure 10, it appears that the p-values of most genes were <0.05 for both A. thaliana and mouse WAT microarray data. In other words, the normality assumption was violated in most genes. These results justify the use of NPN-IDA rather than IDA, because the latter method requires a normal distribution. With regard to the A. thaliana data, it appears that the p-values for flowering time, FLC, FRI, RPA2B, ATGH9B5, and LRR were <0.05 (Table 6); as shown in Figure 11, their distributions were skewed. With regard to the mouse WAT data, the p-values of all genes were very small (Table 6). As shown in Figure 12, their distributions were also skewed. These results strongly suggest that NPN-IDA, which does not require the Gaussian distribution, works better than IDA.

thumbnailFigure 10. Histogram of p-values of the Shapiro-Wilk test. (a)Arabidopsis thaliana microarray data and (b) mouse WAT microarray data (logarithmic scale).

thumbnailFigure 11. Distributions of flowering time and expression levels of the nine known regulators of the flowering time. Each panel is a histogram of the flowering time and each gene’s expression.

Table 6. p-values of the Shapiro-Wilk test of target variables and known regulators

thumbnailFigure 12. Distributions of expression levels of the BAT marker gene Ucp1 and the five known genes.Tbx has two different probes. Each panel is a histogram of each gene’s expression.

Although the method presented here performs significantly better than IDA, one might consider the difference in the performance as too small. However, within the known regulators of flowering time in A. thaliana, four genes (PDH-E1 BETA, ATGH9B5, LRR, and OTLD1) were experimentally validated using the IDA-based method [6]. Therefore, one should take into account that some of the known regulators were obtained on the basis of the Gaussian assumption. Thus, it is possible that the improvement achieved with NPN-IDA is greater than the experimental results shown in this study.

In summary, the two main results of our experimental study are: (1) the NPN-PC algorithm works better than the PC-algorithm in estimating DAGs on synthetic data, and (2) NPN-IDA performs better than does IDA in predicting regulators of the flowering time in A. thaliana and regulators of the differentiation of WAT to BAT in mice on the basis of microarray data. From these two results, we conclude that the performance to estimate DAGs contributes to the accurate prediction of causal effects.

From a practical point of view, because regulators that control the browning of white adipocytes have not been identified only from microarray data of WAT so far, it might be worthwhile to demonstrate this possibility for the first time using our novel method.

For further performance enhancement, we consider that NPN-IDA could be embedded into CStaR (causal stability ranking) [6] in the future. CStaR uses IDA with stability selection [6,15] and the performance was greatly improved compared to plain IDA. By simply replacing IDA with NPN-IDA in the estimation process for causal effects, it would be easy to combine NPN-IDA with CStaR.

Conclusions

We presented NPN-IDA, which uses nonparametric partial correlations, to test conditional independencies in the PC-algorithm for intervention-calculus. To reveal the contribution of estimating DAGs, we evaluated the NPN-PC algorithm to estimate DAGs with the non-paranormal method. The two main results of our experimental study are: (1) the NPN-PC algorithm works better than the PC-algorithm in estimating DAGs on synthetic data, and (2) NPN-IDA performs better than IDA does in predicting regulators of the flowering time in A. thaliana and regulators of the differentiation of WAT to BAT in mice. Considering these two results, we concluded that the performance to estimate DAGs contributes to the accurate prediction of causal effects.

Thus, the simplest alternative procedure we used for our proposed method enables us to design efficient intervention experiments and can be applied to a wide range of research purposes, including drug discovery and medicine, because of its generality.

Availability of supporting data

The microarray data sets used in this study are deposited in Additional file 2.

Additional file 2. The mouse microarray data set of microarray data set of white adipose tissue (WAT) and the microarray data set of A. thaliana and the flowering time.

Format: ZIP Size: 18MB Download fileOpen Data

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

RT proposed the method, implemented, conduct experiments and wrote the manuscript. CS edited the tables. SF supervised the research project. All authors read and approved the final manuscript.

Acknowledgements

We are grateful to Tatsumi Yamazaki, the president of Forerunner Pharma Research, for his support and encouragement throughout the study. We also thank Takumi Nagata and Kenji Yoshida for their valuable discussions.

References

  1. Ahfeldt T, Schinzel RT, Lee YK, Hendrickson D, Kaplan A, Lum DH, Camahort R, Xia F, Shay J, Rhee EP, Clish CB, Deo RC, Shen T, Lau FH, Cowley A, Mowrer G, Al-Siddiqi H, Nahrendorf M, Musunuru K, Gerszten RE, Rinn JL, Cowan CA: Programming human pluriopotent stem cells into white and brown adipocytes.

    Nat Cell Biol 2012, 14:209-219. OpenURL

  2. Takahashi K, Tanabe K, Ohnuki M, Narita M, Ichisaka T, Tomoda K, Yamanaka S: Induction of pluripotent stem cells from adult fibroblasts by defined factors.

    Cell 2007, 13:861-872. OpenURL

  3. Ring KL, Tong LM, Balestra ME, Javier R, Andrew-Zwilling Y, Li G, Walker D, Zhang WR, Kreitzer AC, Huang Y: Direct reprogramming of mouse and human fibroblasts into multipotent neural stem cells with a single factor.

    Cell Stem Cell 2012, 11:100-2109. OpenURL

  4. Maathuis MH, Kalisch MK, Buhlmann P: Estimating high-dimensional intervention effects from observational data.

    Ann Stat 2009, 37:3133-3164. OpenURL

  5. Maathuis MH, Colombo D, Kalisch MK, Buhlmann P: Predicting causal effects in large-scale systems from observational data.

    Nat Methods 2010, 7:247-248. OpenURL

  6. Stekhoven DJ, Moraes I, Sveinbjornsson G, Hennig L, Maathuis MH, Buhlmann P: Causal stability ranking.

    Bioinformatics 2012, 28:2819-2823. OpenURL

  7. Kalisch M, Buhlmann P: Estimating high-dimensional directed acyclic graphs with the PC-algorithm.

    J Machine Learning Res 2007, 8:613-636. OpenURL

  8. Liu H, Laffery J, Wasserman L: The nonparanormal: semiparametric estimation of high dimensional undirected graphs.

    J Machine Learning Res 2009, 10:2295-2328. OpenURL

  9. Liu H, Han F, Yuan M, Laffery J, Wasserman L: The Nonparanomal SKEPTIC. In Proceedings of the 29th International Conference of Machine Learning. Edinburg, Scotland, UK; 2012:1415-1422. OpenURL

  10. Kruskal WH: Ordinal measures of association.

    J Am Stat Assoc 1958, 53:814-861. OpenURL

  11. Kalisch M, Maechler M, Colombo D, Maathuis MH, Buhlmann P: Causal inference using graphical models with R package pcalg.

    J Stat Software 2012, 47:1-26. OpenURL

  12. Muise ES, Souza S, Chi A, Tan Y, Zhao X, Liu F, Dallas-Yang Q, Wu M, Sarr T, Zhu L, Guo H, Li Z, Li W, Hu W, Jiang G, Paweletz CP, Hendrickson RC, Thompson JR, Mu J, Berger JP, Mehmet H: Downstream signaling pathways in mouse adipose tissues following acute in vivo administration of fibroblast growth factor 21.

    PLoS One 2013, 8:e73011. OpenURL

  13. Lo KA, Sun L: Turning WAT into BAT: a review on regulators controlling the browning of white adipocytes.

    Biosci Rep 2013, 33:711-719. OpenURL

  14. Shapiro SS, Wilk MB: An analysis of variance test for normality (complete samples).

    Biometrika 1965, 52:591-611. OpenURL

  15. Meinshausen N, Buhlmann P: Stability selection.

    J Roy Stat Soc B 2010, 72:417-473. OpenURL