Machine Learning Group, Computer Science Department, Université Libre de Bruxelles, Belgium

Computational Biology and Functional Genomics Laboratory, Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Harvard School of Public Health, USA

Breast Cancer Translational Research Laboratory, Department of Medical Oncology, Institut Jules Bordet, Université Libre de Bruxelles, Belgium

Abstract

Background

Traditional strategies for selecting variables in high dimensional classification problems aim to find sets of maximally relevant variables able to explain the target variations. If these techniques may be effective in generalization accuracy they often do not reveal direct causes. The latter is essentially related to the fact that high correlation (or relevance) does not imply causation. In this study, we show how to efficiently incorporate causal information into gene selection by moving from a single-input single-output to a multiple-input multiple-output setting.

Results

We show in synthetic case study that a better prioritization of causal variables can be obtained by considering a relevance score which incorporates a causal term. In addition we show, in a meta-analysis study of six publicly available breast cancer microarray datasets, that the improvement occurs also in terms of accuracy. The biological interpretation of the results confirms the potential of a causal approach to gene selection.

Conclusions

Integrating causal information into gene selection algorithms is effective both in terms of prediction accuracy and biological interpretation.

Background

Supervised analysis of genomic datasets (gene expression microarray or comparative genomic hybridization array for instance) with a large number of features and a respectively small number of samples requires the adoption of either regularization or feature selection strategies

It is well established that the detection of causal patterns cannot be carried out in a bivariate (single-input single-output) context and that at least a trivariate setting has to be considered

The contributions of this paper can be summarized as follows. First we introduce a new causal filter based on the interaction information and we show how to estimate this quantity in a multiple-input multiple-output setting. Second we assess the capacity of such filter to prioritize causal variables by using a synthetic case study. Third we measure from an accuracy and a biological point of view the performance of such causal filter in a number of prognostic studies in breast cancer. We advocate that a multiple-input multiple-output approach is particularly relevant in clinical studies where it is common that more than a single target variable is collected. This is the case of prognostic studies of breast cancer patients where several clinical indices, including patients' tumor size and histological grade, are collected together with the survival of the patients and the gene expressions of their tumor. It is worth to note that, in spite of their availability, these additional phenotypes are usually not taken into consideration since statistical studies focus on survival prediction and adopt single-output methods.

This paper describes an original multiple-input multiple-output score which combines a conventional relevance term with a causal term. This additional term quantifies the causal role of the features and allows the prioritization of causal variables in the resulting ranking. We carried out a synthetic study, where the set of causal dependencies is known, which shows that causal variables are highly ranked once this score is adopted. We performed a meta-analysis of six publicly available breast cancer microarray datasets to assess the improvement of using our causal relevance score in terms of accuracy over the conventional ranking. The related discussion shows also that it is possible to carry out a biological interpretation of the role of selected variables which allows to discriminate between potentially causal and relevant, yet non causal, features. The source code, documentation and data are open-source and publicly available from

Methods

Mutual information and interaction

Let us consider a multiple-input multiple-output (MIMO) classification problem characterized by **X **= {**x**_{i}, **Y **= {**y**_{j}, **y**_{1 }as the **y**_{1}, we want to take advantage of the causal information which can be extracted by multiple targets. We begin by reviewing some notions of information theory by considering three random (boldface) variables, notably two inputs **x**_{1}, **x**_{2 }and the primary target **y**_{1}. The mutual information **x**_{1 }and **x**_{2 }is defined in terms of their probabilistic density functions _{1}), _{2}) and _{1}, _{2}) as

where **x**_{1 }and **x**_{2 }and is also called two-way interaction **x**_{1 }and **x**_{2 }are Gaussian distributed the following relation holds

where

Let us now consider the target **y**_{1}, too. The **x**_{1}; **x**_{2}|**y**_{1}) **x**_{1 }and **x**_{2}, once **y**_{1 }is given, is defined by

The conditional mutual information is null iff **x**_{1 }and **x**_{2 }are conditionally independent given **y**_{1}. The change of dependence between **x**_{1 }and **x**_{2 }due to the knowledge of **y**_{1 }is measured by the three-way

This measure quantifies the amount of mutual dependence that cannot be explained by bivariate interactions. When it is different from zero, we say that **x**_{1}, **x**_{2 }and **y**_{1 }three-interact. A non-zero interaction can be either negative, which denotes a synergy or complementarity between the variables, or positive, which indicates redundancy. Because of the symmetry of the

By (4) we derive

Since the joint information of **x**_{1 }and **x**_{2 }to **y**_{1 }can be written as

it follows that by adding **x**_{2}; **y**) to both sides of (5) we obtain

Note that the above relationships hold also when either **x**_{1 }or **x**_{2 }are vectorial random variables.

Feature selection, causality and interaction

Consider a multiple-class classification problem where **x **∈ **X **⊂ ℝ^{n }is the **X*** of

where the score **X**_{S }of variables is given by the mutual information it brings to the target. In other words, for a given number

If we want to carry out the maximization (7), both an estimation of **X **are required. As far as the search is concerned, according to the Cover and Van Campenhout theorem

If **y**. For ^{th }feature which maximizes the increase of the dependency

where (**X**_{S}, **x**_{k}) stands for the set of variables resulting from the union of **X**_{S }and **x**_{k}. Since for large

which leads to a ranking of the variables according to their mutual information with the target. More advanced approaches rely on bivariate decompositions

where **x**_{i }and **x**_{k }contain jointly about **y**_{1}.

However a feature selection procedure targeting the Max-Dependency is not able in general to discriminate between causal and non causal dependencies. For instance in a selection procedure applied to a dataset derived from a causal process like the one in Figure **x**_{4 }could be more highly ranked than the direct causes **x**_{1 }and **x**_{2}.

Single-output case with different causal patterns: (i) **x**_{1}, **x**_{2 }and **y**_{1}; (ii) **x**_{5 }and **y**_{1}; (iii) **y**_{1}, **x**_{3}, **x**_{4}; and (iv) **x**_{1}, **y**_{1}, **x**_{4}

**Single-output case with different causal patterns: (i) common effect or explaining away effect configuration involving **.

Here we propose to modify the conventional score

Interaction and causal dependency

This section aims to establish the link between information theory and causality. Causality is at the same time an essential and imprecise notion in scientific discovery. In order to avoid any ambiguity, here we adopt the formalism of causal Bayesian network which is a sound and convenient framework for reasoning about causality between random variables **x**_{i }to a node **x**_{j }means that **x**_{i }directly causes **x**_{j}. In formal terms we assume that the Causal Markov condition, the Causal Faithfulness and the Causal Sufficiency conditions hold

Let us consider a triplet made of two inputs **x**_{i}, **x**_{j }and one target **y**_{1}. As discussed in **x**_{i }and **x**_{j }are linked) and it is well known in literature that for a system of two variables the causal structure is not distinguishable. Another configuration corresponds to a fully connected graph and in this case the lack of independencies implies that the direction of the arrows cannot be determined. The remaining configurations can be illustrated and detected by studying the relationship **x**_{i}; **x**_{j}; **y**_{1}) and causal patterns of the triplet, like the ones sketched in Figure

A negative interaction **x**_{i}; **x**_{j}; **y**_{1}) means that the knowledge of the value **y**_{1 }increases the amount of dependence between **x**_{i }and **x**_{j}; this situation occurs in the presence of a collider. According to the label of the collider we can have two cases: i) the **x**_{1}, **x**_{2 }and **y**_{1}, also known as the **x**_{3}, **x**_{5 }and **y**_{1 }in Figure **x**_{3 }is the common descendant of **y**_{1 }and **x**_{5}). This is a consequence of the fact that, if we assume Causal Faithfulness, the graph structure entails that the two parents are independent (null mutual information) but conditionally dependent (conditional mutual information bigger than zero). Note also that both configurations are characterized by the presence of a collider.

On the contrary a positive interaction **x**_{i}; **x**_{j}; **y**_{1}) between **x**_{i }and **x**_{j }means that the knowledge of **y**_{1 }decreases the amount of dependence. This situation occur in two cases: i) the **x**_{3 }and **x**_{4 }become independent once the value of the common cause **y**_{1 }is known as illustrated in Figure **x**_{1}) is the cause and the other (let say, **x**_{4}) is the effect of **y**_{1}. This is due to the fact that the graph entails the dependence between **x**_{i }and **x**_{j }as well as their conditional independence (null conditional mutual information).

So far we have considered a single-output configuration. However causal patterns can be better identified if we consider a multiple-output configuration, for instance the two output configuration sketched in Figure **y**_{1 }and **y**_{2 }are two outputs representing different observations of the same phenomenon (for example a disease) we expect that the causal configurations concerning the first output appear also for the second one. This is a reasonable assumption in breast cancer clinical studies where the measured phenotypes (size and histological grade of the tumor for instance) can be considered as different manifestations of the state of the tumor.

Two-output case with different causal patterns: (i) **x**_{3}, **y**_{1 }and **y**_{2}; (ii) **y**_{2 }and **x**_{6}; (iii) **x**_{1}, **y**_{1 }and **y**_{2}; and (iv) **x**_{1}, **y**_{2 }and **x**_{7}

**Two-output case with different causal patterns: (i) common effect configuration involving x_{3}, y_{1 }and y_{2}; (ii) spouse configuration involving y_{2 }and x_{6}; (iii) common cause configuration involving x_{1}, y_{1 }and y_{2}; and (iv) causal chain configuration involving x_{1}, y_{2 }and x_{7}**.

Let us consider for instance the inputs **x**_{1 }and **x**_{2 }and the two targets **y**_{1 }and **y**_{2}: the **x**_{1 }and **x**_{2 }and **y**_{1 }holds also for the triplet **x**_{1 }and **x**_{2 }and **y**_{2}. The same happens for the common cause pattern involving both the triplet **x**_{3}, **x**_{4}, **y**_{1 }and **x**_{3}, **x**_{4}, **y**_{2}. The presence of multiple outputs can therefore make more robust the identification of a causal pattern, especially in data configurations characterized by a very large number of variables.

In the following we will take advantage of these considerations to design a causal filter able to extract from observed data causal dependencies between variables.

The MIMO causal filter

The link between causality and interaction discussed in the previous section suggests that, if we want to detect causality without estimating large variate dependencies, we may search for patterns like the one sketched in Figure

Two-inputs two-outputs causal pattern

**Two-inputs two-outputs causal pattern**.

1 the interaction **x**_{1}; **x**_{2}; **y**_{1}) is negative

2 the interaction **x**_{1}; **x**_{2}; **y**_{2}) is negative

In what follows we implement this idea into a MIMO causal filter where input variables belonging to causal patterns like the one in Figure

For the pair of inputs **x**_{1 }and **x**_{2 }and the pair of outputs **y**_{1 }and **y**_{2}, we define a structural score

which is composed of two multiple-input interaction terms. The magnitude of this score depends on whether **x**_{1 }and **x**_{2 }jointly play a joint causal role on **y**_{1 }and **y**_{2}, or in other words, the pattern in Figure **x**_{1}, **x**_{2}), the higher is the evidence that the pair **x**_{1}, **x**_{2 }be a cause of **y**_{1 }ad **y**_{2}. This score plays a similar role to the score that is maximized in structural identification of Bayesian networks **x**_{1}, **x**_{2}) measures the likelihood of the data for a structural pattern where the pair **x**_{1}, **x**_{2 }has a causal role.

In the case of bivariate output (**x**_{1 }and **x**_{2}

where

In other terms this formulation suggests to add at the (^{th }step, among all the remaining variables, the one which has the better combination of relevance and causality, where the causal term is obtained by averaging over the selected variables and the considered outputs. Note that in the case of

Similarly to what is done in regularization approaches

Results

In this section we perform two experiments to assess the role of the causation term in the feature selection process. The first one is based on a number of synthetic datasets generated by simulating a causal Bayesian network while the second relies on public microarray breast cancer datasets to assess the approach in a real data setting.

Synthetic data

This experiment focuses on the prioritization of causes in a set of classification tasks defined on the basis of simulated data generated by the causal structure depicted in Figure **y**_{1}, **y**_{2}, and **y**_{3 }of the classification task and are discretized to two binary values. Note that all measures are centered and scaled in order to have a zero mean and unit standard deviation; this allows for a better understanding of the impact of the noise amplitude on the ranking.

Bayesian causal network used for synthetic experiment

**Bayesian causal network used for synthetic experiment**. The green node 9 denotes the non observable variable. The three red nodes denote the targets of the multiple-output classification problem. The isolated node (30-40) represents a set of 11 independent variables.

The quality of our causal prioritization strategy is assessed by measuring the average ranking of the direct causes (nodes 4-8) and the percentage of time that the direct causes are ranked among the first 5 variables. These two measures (together with a 90% confidence interval) for different values of λ are shown in Figure

Synthetic data experiment: average ranking of direct causes for different values of λ as a function of the noise standard deviation

**Synthetic data experiment: average ranking of direct causes for different values of λ as a function of the noise standard deviation**. Dotted lines are used to denote the 90% confidence interval estimated on the basis of 150 trials.

Synthetic data experiment: probability of selecting a direct cause among the first 5 ranked variables for different values of λ as a function of the noise standard deviation

**Synthetic data experiment: probability of selecting a direct cause among the first 5 ranked variables for different values of λ as a function of the noise standard deviation**. Dotted lines are used to denote the 90% confidence interval estimated on the basis of 150 trials.

Real expression data

The real data experiment consists of 6 public microarray datasets derived from breast cancer clinical studies (Table

Affymetrix microarray datasets and related clinical study where the gene expression have been originally published

**Dataset**

**Patients**

**Reference**

UPP

251 (110)

STK

159

VDX

344

UNT

137 (92)

MAINZ

200

TRANSBIG

198

Duplicated patients between studies have been removed in two studies, UPP and UNT; the remaining unique patients are reported in brackets. All the datasets have been generated from Affymetrix technology and normalized using fRMA

All the microarray studies analyzed hereafter are characterized by the collection of gene expression data (the inputs **X **representing **y**_{1}) and 2 additional clinical (secondary) variables about the state of the tumor, namely the histological grade and the tumor size. These clinical variables are well known by clinicians to be highly relevant for prognosis since large tumors of high grade are usually aggressive and lead to poor prognosis. Each experiment was conducted in a meta-analytical

• Holdout: we carried out 100 training-and-test repetitions where for each repetition the training set is composed of half of the samples of each dataset and the test is composed of the remaining ones.

• Leave-one-dataset-out where for each dataset the features used for classification are selected without considering the patients of the dataset itself. Once the selection is over, 100 holdout repetitions are used to assess the generalization power of the selected set of features.

All the mutual information terms are computed by using the Gaussian approximation (2). This allows the meta-analysis integration at the correlation level by means of the weighted estimation approach proposed by

The quality of the selection is represented by the accuracy of a Naive Bayes classifier measured by four different criteria: the Area Under the ROC curve (AUC), the Root Mean Squared Error (RMSE), the SAR (Squared error, Accuracy, and ROC score introduced by

Holdout: accuracy criteria (to be maximized) for different numbers

**λ = 0**

**λ = 0.2**

**λ = 0.4**

**λ = 0.6**

**λ = 0.8**

**λ = 0.9**

**λ = 1**

**λ = 2**

AUC

0.688

0.688

0.694

0.699

0.703

0.704

0.705

0.707

1-RMSE

0.460

0.466

0.481

0.493

0.504

0.510

0.515

0.542

SAR

0.559

0.561

0.569

0.575

0.580

0.583

0.585

0.595

F

0.255

0.254

0.260

0.262

0.265

0.265

0.266

0.274

W-L

1-0

3-0

5-0

6-0

5-0

5-0

5-0

λ = 0

λ = 0.2

λ = 0.4

λ = 0.6

λ = 0.8

λ = 0.9

λ = 1

λ = 2

AUC

0.693

0.698

0.702

0.706

0.709

0.710

0.711

0.715

1-RMSE

0.451

0.458

0.465

0.471

0.477

0.479

0.482

0.503

SAR

0.552

0.556

0.562

0.567

0.571

0.572

0.574

0.583

F

0.263

0.265

0.268

0.270

0.272

0.271

0.273

0.277

W-L

2-0

3-0

3-0

2-0

2-0

3-0

4-0

λ = 0

λ = 0.2

λ = 0.4

λ = 0.6

λ = 0.8

λ = 0.9

λ = 1

λ = 2

AUC

0.699

0.704

0.708

0.711

0.714

0.715

0.715

0.716

1-RMSE

0.454

0.457

0.459

0.463

0.467

0.470

0.472

0.487

SAR

0.545

0.549

0.553

0.557

0.561

0.563

0.564

0.573

F

0.272

0.271

0.272

0.274

0.274

0.274

0.275

0.284

W-L

1-0

1-0

1-0

2-0

3-0

4-1

4-1

AUC = Area Under the Curve; 1-RMSE = one minus Root Mean Squared Error; SAR = Squared error, Accuracy, and ROC; F = precision-recall; W-L = Win -Loss reporting the number of datasets for which the causal filter is significantly more (W) or less (L) accurate than the conventional ranking filter according both to the McNemar test (p-value < 0.05 adjusted for multiple testing by Holm's method) and the Wilcoxon paired test on squared errors (p-value < 0.05 adjusted for multiple testing by Holm's method).

Leave-one-dataset-out: accuracy criteria (to be maximized) for different numbers

**λ = 0**

**λ = 0.2**

**λ = 0.4**

**λ = 0.6**

**λ = 0.8**

**λ = 0.9**

**λ = 1**

**λ = 2**

AUC

0.678

0.674

0.678

0.680

0.682

0.682

0.680

0.669

1-RMSE

0.447

0.448

0.467

0.469

0.482

0.528

0.544

0.556

SAR

0.553

0.552

0.560

0.561

0.566

0.582

0.586

0.586

F

0.280

0.275

0.275

0.281

0.279

0.283

0.287

0.276

W-L

1-1

5-1

2-0

4-0

5-0

4-0

4-0

λ = 0.

λ = 0.2

λ = 0.4

λ = 0.6

λ = 0.8

λ = 0.9

λ = 1

λ = 2

AUC

0.681

0.687

0.692

0.693

0.698

0.700

0.700

0.693

1-RMSE

0.428

0.438

0.453

0.457

0.464

0.473

0.490

0.516

SAR

0.542

0.551

0.559

0.561

0.565

0.569

0.576

0.582

F

0.284

0.284

0.281

0.281

0.285

0.291

0.298

0.303

W-L

3-0

4-0

5-1

3-0

5-0

4-0

6-0

λ = 0

λ = 0.2

λ = 0.4

λ = 0.6

λ = 0.8

λ = 0.9

λ = 1

λ = 2

AUC

0.687

0.694

0.704

0.708

0.711

0.706

0.708

0.676

1-RMSE

0.430

0.436

0.449

0.457

0.463

0.463

0.476

0.477

SAR

0.537

0.545

0.556

0.562

0.566

0.565

0.571

0.561

F

0.290

0.292

0.294

0.296

0.299

0.294

0.304

0.288

W-L

1-0

4-0

6-0

4-0

4-0

5-0

5-1

AUC = Area Under the Curve; 1-RMSE = one minus Root Mean Squared Error; SAR = Squared error, Accuracy, and ROC; F = precision-recall; W-L = Win -Loss reporting the number of datasets for which the causal filter is significantly more (W) or less (L) accurate than the conventional ranking filter according both to the McNemar test (p-value < 0.05 adjusted for multiple testing by Holm's method) and the Wilcoxon paired test on squared errors (p-value < 0.05 adjusted for multiple testing by Holm's method).

Discussion

In the previous section we reported the accuracy results of the traditional ranking approach and our novel method based on a causal relevance score. Here we discuss the added value of our causal approach both from a quantitative and qualitative perspective.

The performance measured in cross-validation suggests that the incorporation of a causal term leads to a significant improvement of classification accuracy. This improvement is observed for different validation configurations and different sizes of the prognostic gene signature. From these results we can conclude that (i) causal feature selection is interesting also for a prediction perspective and (ii) relevant (prognostic) information is contained into secondary output variables (in our case tumor size and histological grade). Although the absolute improvement is only moderate (3% to 6% depending on the validation configurations and performance estimates), the use of our causal ranking strategy in more sophisticated modeling approach for prognosis, such as in

The other advantage of our approach is that the introduction of a causality term leads to an interpretation of the causal role of the selected genes. We illustrate this characteristic in Figure

Most enriched GO terms with respect to λ according to a pre-ranked gene set enrichment analysis (GSEA): (A) GO terms enriched in the conventional ranking and having a high degree of causality for tumorigenesis; (B) GO terms increasingly enriched with respect to larger λ, suggesting they are putative causes for tumorigenesis; (C) GO terms decreasingly enriched with respect to larger λ, suggesting they are putative effects for tumorigenesis

**Most enriched GO terms with respect to λ according to a pre-ranked gene set enrichment analysis (GSEA): (A) GO terms enriched in the conventional ranking and having a high degree of causality for tumorigenesis; (B) GO terms increasingly enriched with respect to larger λ, suggesting they are putative causes for tumorigenesis; (C) GO terms decreasingly enriched with respect to larger λ, suggesting they are putative effects for tumorigenesis**. The normalized enrichment score (NES) depends on the genome-ranking of the genes, which in turn depends on λ. Larger the NES of a GO term, stronger the association of this gene set with survival; the sign of NES reflects the direction of association of the GO term with survival, a positive score meaning that over-expression of the genes implies worst survival and inversely.

Genes that remains among the top ranked ones for increasing λ can be considered as relevant (they contain predictive information about survival) and causal. Genes whose rank increases for increasing λ are putative causes: they have less relevance than other genes (for example, those being direct effects) but they are potentially causal. These genes would have been missed by conventional ranking, where they would appear as false negatives if we interpret the outcome of conventional ranking in causal terms. Genes whose rank decreases for increasing λ are putative effects in the sense that they are relevant but probably not causal. This set of genes could be erroneously considered as causal, and represent false positives if we interpret the outcome of conventional ranking in causal terms.

Since genes are not acting in isolation but rather in pathways, we analyzed the gene rankings in terms of gene set enrichment. As described in

We computed NES for multiple genome-wide rankings generated with increasing values of λ, and displayed in Figure

**Spreadsheet containing the normalized enrichment scores with respect to increasing λ as computed by preranked GSEA (****gsea_res_all.csv****)**.

Click here for file

**Archive containing the output files computed by the preranked GSEA for λ ∈ {0.1,0.2,0.3,0.4,0.5} (****GSEA_MIMO_part1.zip****)**.

Click here for file

**Archive containing the output files computed by the preranked GSEA for λ ∈ {0.6,0.7,0.8,0.9,1.0,2.0} (****GSEA_MIMO_part2.zip****)**.

Click here for file

Figure

The last group of GO terms are less enriched when the degree of causality increases and the vast majority of the corresponding genes are related to cell-cycle and proliferation (Figure

Our approach allows to identify biological processes that may be direct causes of cancer. These processes are likely to be missed by conventional methods. Given the promising performance of our approach, we plan to integrate our method in analytical frameworks combining efficiently the available clinical data and

Conclusions

It is well known in statistics that correlation does not imply causation or, in more general terms, that features that are relevant or strongly relevant for predicting a target are not necessarily direct causes. Direct effects are typical examples of variables that provide information about a target without having any causal role. In a data-driven approach to gene selection it is therefore more and more important to discriminate not only between relevant and non-relevant variables but also, within the subset of relevant variables, to discriminate between direct or indirect causes and effects. This paper proposes a computationally affordable strategy to infer causal patterns that take advantage of multiple outputs. Experimental results in terms of accuracy and clinical interpretation show the added value deriving from the inclusion of a causal term into conventional ranking.

Abbreviations

AUC: Area Under the ROC Curve; DISR: Double Input Symmetrical Relevance; GO: Gene Ontology; GSEA: Gene Set Enrichment Analysis; MIMO: multiple-input multiple-output; NES: Normalized Enrichment Score; RMSE: Root Mean Squared Error; ROC: Receiver Operating Characteristics; SAR: Squared error, Accuracy, and ROC score; W-L: Win-Loss.

Authors' contributions

GB and BHK were responsible for the design and execution of the study, data analysis and interpretation. CD and CS participated to the data analysis and interpretation. GB and BHK were responsible for writing the manuscript; JQ supervised the study. All authors read and approved the final manuscript.

Acknowledgements

This work was supported by the ARC project "Discovery of the molecular pathways regulating pancreatic beta cell dysfunction and apoptosis in diabetes using functional genomics and bioinformatics" funded by the Communauté Française de Belgique (GB), the US National Institutes of Health (NCI/NIH/DHHS: 5U19CA148065-02, BHK and JQ), by the Belgian National Foundation for Research FNRS (CD, CS), the MEDIC Foundation (CS).