Abstract
Background
Traditional strategies for selecting variables in high dimensional classification problems aim to find sets of maximally relevant variables able to explain the target variations. If these techniques may be effective in generalization accuracy they often do not reveal direct causes. The latter is essentially related to the fact that high correlation (or relevance) does not imply causation. In this study, we show how to efficiently incorporate causal information into gene selection by moving from a singleinput singleoutput to a multipleinput multipleoutput setting.
Results
We show in synthetic case study that a better prioritization of causal variables can be obtained by considering a relevance score which incorporates a causal term. In addition we show, in a metaanalysis study of six publicly available breast cancer microarray datasets, that the improvement occurs also in terms of accuracy. The biological interpretation of the results confirms the potential of a causal approach to gene selection.
Conclusions
Integrating causal information into gene selection algorithms is effective both in terms of prediction accuracy and biological interpretation.
Background
Supervised analysis of genomic datasets (gene expression microarray or comparative genomic hybridization array for instance) with a large number of features and a respectively small number of samples requires the adoption of either regularization or feature selection strategies [1]. The most common feature selection strategies select or rank the variables according to a relevance score. In ranking, for instance, the score of each variable is the univariate association with the target returned by a measure of relevance, like mutual information, correlation, or pvalue. If on one hand the ranking is widely used for its simple implementation and its low complexity, on the other hand it suffers from wellknown limitations. A drawback is that ranking relies on univariate terms and as such it cannot take into consideration higherorder interaction terms or redundancy between features [2]. Another limitation is that ranking techniques are not able to distinguish between causes and effects. This is due to the fact that univariate correlation (or relevance) does not imply causation [3]. This problem is not solved in multivariate feature selection approaches since their cost function typically takes into consideration accuracy but disregards causal aspects. Nowadays the importance of bringing causality into play when designing feature selection methods is more widely acknowledged in the bioinformatics and the machine learning communities [4,5]. This is typically the case in microarray classification, where the goal is, for example, to distinguish between tumor classes or predict the effects of therapies on the basis of gene expression profiles [6]. In these settings the number of input variables, represented by the number of gene probes, is huge (typically several thousands) while the number of samples, represented by the patients' tumors, is very limited (a few hundreds) making the selection of relevant genes a challenging task. Moreover the inference of causal relationships between variables plays a major role in the context of genomic studies since more and more biologists and medical doctors expect data analysis to provide not only accurate prediction models (for prognostic purposes) but also insights into the mechanisms associated with disease and appropriate therapeutic targets.
It is well established that the detection of causal patterns cannot be carried out in a bivariate (singleinput singleoutput) context and that at least a trivariate setting has to be considered [7]. This is put into evidence by the literature on graphical models where arc orientation relies on notions of conditional independence (requiring at least three terms) [8] and by the work on information theoretic methods for network inference [9]. In particular this paper will focus on the notion of feature interaction, a threeway mutual information that differs from zero when group of attributes are complementary [10]. The role of interaction in feature selection has already been discussed in the machine learning literature. Jakulin proposed an heuristic based on interaction for selecting attributes within the naive Bayesian classifier [11]. Meyer et al. proposed a filter algorithm which relies on the maximization of an information theoretic criterion, denoted Double Input Symmetrical Relevance (DISR), which implicitly takes into account the interaction, or complementarity between variables, in the choice of the features [12]. Watkinson et al. used a notion of synergy related to feature interaction to assign a score to a pair of genes and then measured the degree of confidence that one of the genes regulates the other [9]. A causal filter algorithm which computes interaction between inputs has been recently proposed in [5]. However it is unclear whether these techniques are capable of recovering the set of features that are both relevant and causal, in highdimensional problems, such as in microarray analysis.
The contributions of this paper can be summarized as follows. First we introduce a new causal filter based on the interaction information and we show how to estimate this quantity in a multipleinput multipleoutput setting. Second we assess the capacity of such filter to prioritize causal variables by using a synthetic case study. Third we measure from an accuracy and a biological point of view the performance of such causal filter in a number of prognostic studies in breast cancer. We advocate that a multipleinput multipleoutput approach is particularly relevant in clinical studies where it is common that more than a single target variable is collected. This is the case of prognostic studies of breast cancer patients where several clinical indices, including patients' tumor size and histological grade, are collected together with the survival of the patients and the gene expressions of their tumor. It is worth to note that, in spite of their availability, these additional phenotypes are usually not taken into consideration since statistical studies focus on survival prediction and adopt singleoutput methods.
This paper describes an original multipleinput multipleoutput score which combines a conventional relevance term with a causal term. This additional term quantifies the causal role of the features and allows the prioritization of causal variables in the resulting ranking. We carried out a synthetic study, where the set of causal dependencies is known, which shows that causal variables are highly ranked once this score is adopted. We performed a metaanalysis of six publicly available breast cancer microarray datasets to assess the improvement of using our causal relevance score in terms of accuracy over the conventional ranking. The related discussion shows also that it is possible to carry out a biological interpretation of the role of selected variables which allows to discriminate between potentially causal and relevant, yet non causal, features. The source code, documentation and data are opensource and publicly available from http://mlg.ulb.ac.be/software/ webcite and http://compbio.dfci.harvard.edu/pubs/mimocausal/ webcite.
Methods
Mutual information and interaction
Let us consider a multipleinput multipleoutput (MIMO) classification problem characterized by n input variables X = {x_{i}, i = 1,..., n} and m targets Y = {y_{j}, j = 1,..., m} where is continuous and . Let us denote y_{1 }as the primary target and the remaining m  1 outputs as secondary targets. We make this distinction since, though we assume that the goal of classification is to predict y_{1}, we want to take advantage of the causal information which can be extracted by multiple targets. We begin by reviewing some notions of information theory by considering three random (boldface) variables, notably two inputs x_{1}, x_{2 }and the primary target y_{1}. The mutual information [13] between the continuous variables x_{1 }and x_{2 }is defined in terms of their probabilistic density functions p(x_{1}), p(x_{2}) and p(x_{1}, x_{2}) as
where H is the entropy and the convention is adopted. This quantity measures the amount of stochastic dependence between x_{1 }and x_{2 }and is also called twoway interaction [11]. Note that, if x_{1 }and x_{2 }are Gaussian distributed the following relation holds
where ρ is the Pearson correlation coefficient.
Let us now consider the target y_{1}, too. The conditional mutual information I(x_{1}; x_{2}y_{1}) [13] between x_{1 }and x_{2}, once y_{1 }is given, is defined by
The conditional mutual information is null iff x_{1 }and x_{2 }are conditionally independent given y_{1}. The change of dependence between x_{1 }and x_{2 }due to the knowledge of y_{1 }is measured by the threeway interaction information defined in [14] as
This measure quantifies the amount of mutual dependence that cannot be explained by bivariate interactions. When it is different from zero, we say that x_{1}, x_{2 }and y_{1 }threeinteract. A nonzero interaction can be either negative, which denotes a synergy or complementarity between the variables, or positive, which indicates redundancy. Because of the symmetry of the H operator in (3), we have
By (4) we derive
Since the joint information of x_{1 }and x_{2 }to y_{1 }can be written as
it follows that by adding I(x_{2}; y) to both sides of (5) we obtain
Note that the above relationships hold also when either x_{1 }or x_{2 }are vectorial random variables.
Feature selection, causality and interaction
Consider a multipleclass classification problem where x ∈ X ⊂ ℝ^{n }is the nvariate input and is the primary target variable. Let A = {1,..., n} be the set of indices of the n inputs. Let us formulate the feature selection problem as the problem of finding the subset X* of v > 0 variables such that
where the score of a subset X_{S }of variables is given by the mutual information it brings to the target. In other words, for a given number v of variables the optimal feature set is the one that maximizes the information about the target. Note that this formulation of the feature selection problem, also known as MaxDependency [12,15], is classifierindependent.
If we want to carry out the maximization (7), both an estimation of I and a search strategy in the space of subsets of X are required. As far as the search is concerned, according to the Cover and Van Campenhout theorem [16], to be assured of finding the optimal feature set of size v, all feature subsets should be assessed. Given the infeasibility of exhaustive approaches for large n, we will consider here only forward selection search approaches. Forward selection starts with an empty set of variables and incrementally updates the solution by adding the variable that is expected to bring the best improvement (according to a given criterion). The hillclimbing search selects a subset of v < n variables in v steps by exploring only configurations. For this reason the forward approach is commonly adopted in filter approaches for classification problems with high dimensionality [17,18].
If v = 1 the optimal set returned by (7) is composed of the most relevant variable, that is the one carrying the highest mutual information to y. For v > 1, we need to provide an incremental solution to (7) in order to obtain, given a set of d variables, the (d + 1)^{th }feature which maximizes the increase of the dependency
where (X_{S}, x_{k}) stands for the set of variables resulting from the union of X_{S }and x_{k}. Since for large d the term requires the computation of multivariate mutual information, its estimation is often prone to illconditioning and large variance. This led to the adoption of low variate approximations in literature, like the univariate approximation
which leads to a ranking of the variables according to their mutual information with the target. More advanced approaches rely on bivariate decompositions [12] like
where quantifies the amount of information that x_{i }and x_{k }contain jointly about y_{1}.
However a feature selection procedure targeting the MaxDependency is not able in general to discriminate between causal and non causal dependencies. For instance in a selection procedure applied to a dataset derived from a causal process like the one in Figure 1, the effect x_{4 }could be more highly ranked than the direct causes x_{1 }and x_{2}.
Figure 1. Singleoutput case with different causal patterns: (i) common effect or explaining away effect configuration involving x_{1}, x_{2 }and y_{1}; (ii) spouse configuration involving x_{5 }and y_{1}; (iii) common cause configuration involving y_{1}, x_{3}, x_{4}; and (iv) causal chain configuration involving x_{1}, y_{1}, x_{4}.
Here we propose to modify the conventional score into a causal score able to keep into consideration the causal information returned by the adoption of a multiple output configuration. This is made possible by integrating in the score an interaction term which is strictly related to the notion of causal dependency.
Interaction and causal dependency
This section aims to establish the link between information theory and causality. Causality is at the same time an essential and imprecise notion in scientific discovery. In order to avoid any ambiguity, here we adopt the formalism of causal Bayesian network which is a sound and convenient framework for reasoning about causality between random variables [8]. This means that all causal dependencies between variables are expressed by a directed acyclic graph where the existence of an oriented edge from a node x_{i }to a node x_{j }means that x_{i }directly causes x_{j}. In formal terms we assume that the Causal Markov condition, the Causal Faithfulness and the Causal Sufficiency conditions hold [4]. Several works in literature showed that the structure of a causal graph can, to some extent, be inferred from observational data. The vast majority of these works rely on statistical tests of conditional independence [19]. Here we present a way to reason about causality which do not use independence tests but estimate an information theory score to prioritize potential causes.
Let us consider a triplet made of two inputs x_{i}, x_{j }and one target y_{1}. As discussed in [4] six possible configurations of directed acyclic graphs involving three variables can occur. One configuration is trivial and corresponds to a completely unconnected graph. One configuration corresponds to a single arrow chain (for example only x_{i }and x_{j }are linked) and it is well known in literature that for a system of two variables the causal structure is not distinguishable. Another configuration corresponds to a fully connected graph and in this case the lack of independencies implies that the direction of the arrows cannot be determined. The remaining configurations can be illustrated and detected by studying the relationship [5] between the sign of I(x_{i}; x_{j}; y_{1}) and causal patterns of the triplet, like the ones sketched in Figure 1.
A negative interaction I(x_{i}; x_{j}; y_{1}) means that the knowledge of the value y_{1 }increases the amount of dependence between x_{i }and x_{j}; this situation occurs in the presence of a collider. According to the label of the collider we can have two cases: i) the common effect configuration (for example the pattern involving x_{1}, x_{2 }and y_{1}, also known as the explainingaway effect) and ii) the spouse configuration (the pattern involving x_{3}, x_{5 }and y_{1 }in Figure 1 where x_{3 }is the common descendant of y_{1 }and x_{5}). This is a consequence of the fact that, if we assume Causal Faithfulness, the graph structure entails that the two parents are independent (null mutual information) but conditionally dependent (conditional mutual information bigger than zero). Note also that both configurations are characterized by the presence of a collider.
On the contrary a positive interaction I(x_{i}; x_{j}; y_{1}) between x_{i }and x_{j }means that the knowledge of y_{1 }decreases the amount of dependence. This situation occur in two cases: i) the common cause configuration (for example, two dependent effects x_{3 }and x_{4 }become independent once the value of the common cause y_{1 }is known as illustrated in Figure 1) and ii) the causal chain configuration where one of the variables (let say, x_{1}) is the cause and the other (let say, x_{4}) is the effect of y_{1}. This is due to the fact that the graph entails the dependence between x_{i }and x_{j }as well as their conditional independence (null conditional mutual information).
So far we have considered a singleoutput configuration. However causal patterns can be better identified if we consider a multipleoutput configuration, for instance the two output configuration sketched in Figure 2. If y_{1 }and y_{2 }are two outputs representing different observations of the same phenomenon (for example a disease) we expect that the causal configurations concerning the first output appear also for the second one. This is a reasonable assumption in breast cancer clinical studies where the measured phenotypes (size and histological grade of the tumor for instance) can be considered as different manifestations of the state of the tumor.
Figure 2. Twooutput case with different causal patterns: (i) common effect configuration involving x_{3}, y_{1 }and y_{2}; (ii) spouse configuration involving y_{2 }and x_{6}; (iii) common cause configuration involving x_{1}, y_{1 }and y_{2}; and (iv) causal chain configuration involving x_{1}, y_{2 }and x_{7}.
Let us consider for instance the inputs x_{1 }and x_{2 }and the two targets y_{1 }and y_{2}: the common effect configurations between x_{1 }and x_{2 }and y_{1 }holds also for the triplet x_{1 }and x_{2 }and y_{2}. The same happens for the common cause pattern involving both the triplet x_{3}, x_{4}, y_{1 }and x_{3}, x_{4}, y_{2}. The presence of multiple outputs can therefore make more robust the identification of a causal pattern, especially in data configurations characterized by a very large number of variables.
In the following we will take advantage of these considerations to design a causal filter able to extract from observed data causal dependencies between variables.
The MIMO causal filter
The link between causality and interaction discussed in the previous section suggests that, if we want to detect causality without estimating large variate dependencies, we may search for patterns like the one sketched in Figure 3. This dependency pattern is characterized by two causal inputs and two outputs and can be detected when the following two conditions are satisfied:
Figure 3. Twoinputs twooutputs causal pattern.
1 the interaction I(x_{1}; x_{2}; y_{1}) is negative
2 the interaction I(x_{1}; x_{2}; y_{2}) is negative
In what follows we implement this idea into a MIMO causal filter where input variables belonging to causal patterns like the one in Figure 3 are prioritized.
For the pair of inputs x_{1 }and x_{2 }and the pair of outputs y_{1 }and y_{2}, we define a structural score
which is composed of two multipleinput interaction terms. The magnitude of this score depends on whether x_{1 }and x_{2 }jointly play a joint causal role on y_{1 }and y_{2}, or in other words, the pattern in Figure 3 is encountered. This means that the higher the term C(x_{1}, x_{2}), the higher is the evidence that the pair x_{1}, x_{2 }be a cause of y_{1 }ad y_{2}. This score plays a similar role to the score that is maximized in structural identification of Bayesian networks [20]. If in that case the score measures the likelihood of the data for a given graph structure, here the quantity C(x_{1}, x_{2}) measures the likelihood of the data for a structural pattern where the pair x_{1}, x_{2 }has a causal role.
In the case of bivariate output (m = 2) we propose then a causal version of the univariate score which accounts both for the relevance and the causal role of a pair of input variables x_{1 }and x_{2}
where λ > 0 stands for the degree of causality imposed to the selection. If we adopt the filter approximation (10) the incremental formula takes the form
In other terms this formulation suggests to add at the (d + 1)^{th }step, among all the remaining variables, the one which has the better combination of relevance and causality, where the causal term is obtained by averaging over the selected variables and the considered outputs. Note that in the case of m > 2 targets the structural score (11) is obtained by averaging the interaction terms over the m variables.
Similarly to what is done in regularization approaches [21] where specific configurations (typically those with higher complexity) are penalized by adding a complexity term to the one measuring the error, the causality parameter λ in (13) is expected to penalize input variables with no causal role (positive interaction). Note that for λ = 0 the selection rule (13) boils down to the rule (9). The following section will study the impact of the causality term on the accuracy and the stability of a filter algorithm implementing the rule (13).
Results
In this section we perform two experiments to assess the role of the causation term in the feature selection process. The first one is based on a number of synthetic datasets generated by simulating a causal Bayesian network while the second relies on public microarray breast cancer datasets to assess the approach in a real data setting.
Synthetic data
This experiment focuses on the prioritization of causes in a set of classification tasks defined on the basis of simulated data generated by the causal structure depicted in Figure 4. Note that this causal structure aims to represent in a very simplified manner a stochastic dependency characterized by a number of indirect (nodes 13) and direct causes (nodes 48), a latent non measurable variable (node 9), one observable primary target (node 10), two secondary targets (nodes 1112), a set of additional effects (nodes 1329) and a number of independent and irrelevant variables (nodes 3040). In order to set up an analogy with the real data experiments of the following subsection, we could make the assumption that the latent variable represents the cancer progression, the three targets denote a set of observable measures depending on the cancer state (patients' prognosis, size and histological grade of the tumor for instance), and that all other variables represent the expression of genes whose activity could play a causal role, be determined as an effect of the disease or be completely irrelevant. It is worth to note that also in the presence of a hidden variable the interaction between marginally independent causes given an effect is negative. This is due to the fact that conditioning on the hidden variable or on one of his children is equivalent in terms of dseparation between the variables [8] and consequently is equivalent in terms of the sign of the interaction. We simulate a number of multivariate datasets from the causal structure in Figure 4 and for each of them we rank the inputs of the MIMO classification problem by using the conventional ranking approach based on mutual information (Equation (9)) and our novel approach based on causality (Equation (13)). The stochastic dependency between parents and descendants of the network is modeled by a linear regression where the parameters are uniformly sampled in [2, 2] and the noise distribution is Gaussian with zero mean and standard deviation σ. We carry out a series of experiments, each characterized by 150 datasets and an increasing noise standard deviation ranging between 0.01 and 0.4. All the variables are continuous apart from the variables 10, 11, and 12, which correspond to the targets y_{1}, y_{2}, and y_{3 }of the classification task and are discretized to two binary values. Note that all measures are centered and scaled in order to have a zero mean and unit standard deviation; this allows for a better understanding of the impact of the noise amplitude on the ranking.
Figure 4. Bayesian causal network used for synthetic experiment. The green node 9 denotes the non observable variable. The three red nodes denote the targets of the multipleoutput classification problem. The isolated node (3040) represents a set of 11 independent variables.
The quality of our causal prioritization strategy is assessed by measuring the average ranking of the direct causes (nodes 48) and the percentage of time that the direct causes are ranked among the first 5 variables. These two measures (together with a 90% confidence interval) for different values of λ are shown in Figure 5 and 6 respectively. These plots show that by increasing the value of λ, the average ranking position of direct causes decreases (direct causes are better prioritized) and that the percentage of correct selection increases (among the first ranked variables we find the direct causes with higher probability). The improvement occurs in a consistent manner for different values of the noise standard deviation though the detection of causal terms become less accurate as the noise increases. Note also that the very bad performance of the ranking (λ = 0) strategy (0% rate of correct selection) derives from the very large number of effects which tend to be ranked before the real causes.
Figure 5. Synthetic data experiment: average ranking of direct causes for different values of λ as a function of the noise standard deviation. Dotted lines are used to denote the 90% confidence interval estimated on the basis of 150 trials.
Figure 6. Synthetic data experiment: probability of selecting a direct cause among the first 5 ranked variables for different values of λ as a function of the noise standard deviation. Dotted lines are used to denote the 90% confidence interval estimated on the basis of 150 trials.
Real expression data
The real data experiment consists of 6 public microarray datasets derived from breast cancer clinical studies (Table 1) in order to compare the generalization accuracy of the selection returned by the conventional ranking approach based on mutual information (Equation (9)) with the accuracy of the selection returned by our novel approach based on causality (Equation (13)).
Table 1. Affymetrix microarray datasets and related clinical study where the gene expression have been originally published
All the microarray studies analyzed hereafter are characterized by the collection of gene expression data (the inputs X representing n = 13,091 unique genes), the survival data (the primary target y_{1}) and 2 additional clinical (secondary) variables about the state of the tumor, namely the histological grade and the tumor size. These clinical variables are well known by clinicians to be highly relevant for prognosis since large tumors of high grade are usually aggressive and lead to poor prognosis. Each experiment was conducted in a metaanalytical [22] and crossvalidation [23] framework, that is the set of variables are selected by relying on the samples of several datasets and the validation is performed on a set of samples not used for the selection. In order to adopt a classification framework, the survival of the patients was transformed in a binary class such as low or high risk of the patients given their clinical outcome at five years as in [24]. We conducted two sets of metaanalysis validation experiments to compare the conventional ranking approach (λ = 0 case) and our causal version for different values of λ:
• Holdout: we carried out 100 trainingandtest repetitions where for each repetition the training set is composed of half of the samples of each dataset and the test is composed of the remaining ones.
• Leaveonedatasetout where for each dataset the features used for classification are selected without considering the patients of the dataset itself. Once the selection is over, 100 holdout repetitions are used to assess the generalization power of the selected set of features.
All the mutual information terms are computed by using the Gaussian approximation (2). This allows the metaanalysis integration at the correlation level by means of the weighted estimation approach proposed by [22]. All the experiments were repeated for three sizes of the gene signature (number of selected features): v = 20, 50, 100.
The quality of the selection is represented by the accuracy of a Naive Bayes classifier measured by four different criteria: the Area Under the ROC curve (AUC), the Root Mean Squared Error (RMSE), the SAR (Squared error, Accuracy, and ROC score introduced by [25]) and the precisionrecall F score measure [26]. Table 2 reports for the holdout experiment the value of the four performance criteria for different values of v and λ. Table 3 refers to the leaveonedatasetout experiments for v = 20, v = 50, and v = 100, respectively. Note that the WL (WinLoss) line reports the number of datasets for which the causal filter is significantly more (W) or less (L) accurate than the ranking filter according both to the McNemar test [27] (pvalue < 0.05 adjusted for multiple testing by Holm's method [28]) and the Wilcoxon paired test on squared errors (pvalue < 0.05 adjusted for multiple testing by Holm's method).
Discussion
In the previous section we reported the accuracy results of the traditional ranking approach and our novel method based on a causal relevance score. Here we discuss the added value of our causal approach both from a quantitative and qualitative perspective.
The performance measured in crossvalidation suggests that the incorporation of a causal term leads to a significant improvement of classification accuracy. This improvement is observed for different validation configurations and different sizes of the prognostic gene signature. From these results we can conclude that (i) causal feature selection is interesting also for a prediction perspective and (ii) relevant (prognostic) information is contained into secondary output variables (in our case tumor size and histological grade). Although the absolute improvement is only moderate (3% to 6% depending on the validation configurations and performance estimates), the use of our causal ranking strategy in more sophisticated modeling approach for prognosis, such as in [29], may help develop more clinically relevant prognostic classifiers in breast cancer.
The other advantage of our approach is that the introduction of a causality term leads to an interpretation of the causal role of the selected genes. We illustrate this characteristic in Figure 7 by comparing, through Gene Ontological (GO) terms, gene rankings with increasing degree of causality using a preranked gene set enrichment analysis (GSEA) [30]. By quantifying how the causal rank of genes diverges from the conventional one (λ = 0) with respect to λ we can identify the gene sets that are potential causes or effects of breast cancer.
Figure 7. Most enriched GO terms with respect to λ according to a preranked gene set enrichment analysis (GSEA): (A) GO terms enriched in the conventional ranking and having a high degree of causality for tumorigenesis; (B) GO terms increasingly enriched with respect to larger λ, suggesting they are putative causes for tumorigenesis; (C) GO terms decreasingly enriched with respect to larger λ, suggesting they are putative effects for tumorigenesis. The normalized enrichment score (NES) depends on the genomeranking of the genes, which in turn depends on λ. Larger the NES of a GO term, stronger the association of this gene set with survival; the sign of NES reflects the direction of association of the GO term with survival, a positive score meaning that overexpression of the genes implies worst survival and inversely.
Genes that remains among the top ranked ones for increasing λ can be considered as relevant (they contain predictive information about survival) and causal. Genes whose rank increases for increasing λ are putative causes: they have less relevance than other genes (for example, those being direct effects) but they are potentially causal. These genes would have been missed by conventional ranking, where they would appear as false negatives if we interpret the outcome of conventional ranking in causal terms. Genes whose rank decreases for increasing λ are putative effects in the sense that they are relevant but probably not causal. This set of genes could be erroneously considered as causal, and represent false positives if we interpret the outcome of conventional ranking in causal terms.
Since genes are not acting in isolation but rather in pathways, we analyzed the gene rankings in terms of gene set enrichment. As described in [30], the normalized enrichment score (NES) computed in GSEA enables quantification of the strength of association of a gene set (GO term) with a phenotye of interest, here poor or good prognosis (survival). In more details, given a list of genes L ranked by their prognostic relevance and an a priori defined set of genes S (for example genes sharing the same GO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout L or primarily found at the top or bottom; gene sets associated with the prognosis phenotype tend to show the latter distribution. NES reflects the degree to which a gene set S is overrepresented at the extremes (top or bottom) of the entire ranked list L. The score is calculated by walking down the list L, increasing a runningsum statistic when a gene in S is encountered and decreasing it when genes not in S are encountered. The magnitude of the increment depends on the statistic used to rank the genes in L. In our study the statistic of a gene is simply its rank (the most relevant genes have the largest ranks) and its sign depends on the association of its expression with survival: positive sign if overexpression is associated with poor survival and inversely. The score is the maximum deviation from zero encountered in the "walk"; it corresponds to a weighted KolmogorovSmirnovlike statistic [30,31]. Finally the score is normalized for each gene set to account for the size of the gene set, yielding a NES.
We computed NES for multiple genomewide rankings generated with increasing values of λ, and displayed in Figure 7 the score of the 3 most enriched GO terms which are identified as being potentially (A) both causes and effects, (B) causes, and (C) effects of breast tumorigenesis (GSEA results for all the GO terms are provided in Additional File 1, 2 and 3). The first group of GO terms that show similar enrichment scores independently of their level of causality are implicated in cell movement and division, cellular respiration and regulation of cell cycle (Figure 7A). The first GO term involves genes encoding for the Rho family of GTPases proteins that are among key regulators of actin and microtubule cytoskeleton [32] and are often overexpressed in human breast cancers [33]. Bromberg et al. showed that, when affected by RNF5, this family of proteins may cause dysregulation of cell proliferation to promote tumor progression [34]. The second GO term represents the coenzyme metabolic process which includes proteins showed to be early indicators of breast cancer [35]; perturbation of these coenzymes might cause cancers by compromising the structure of important enzyme complexes implicated in mitochondrial functions [35]. Genes involved in the third GO term "regulation cyclindependent protein kinase activity" are key players in cell cycle regulation and inhibition of such kinases proved to block proliferation of human breast cancer cells [36]. Moreover, Moore et al. recently highlighted the role of cyclindependent kinases as progesterone activators that could give raise to tumors and sustain their progression in breast cancer [37].
Additional file 1. Spreadsheet containing the normalized enrichment scores with respect to increasing λ as computed by preranked GSEA (gsea_res_all.csv).
Format: CSV Size: 9KB Download file
Additional file 2. Archive containing the output files computed by the preranked GSEA for λ ∈ {0.1,0.2,0.3,0.4,0.5} (GSEA_MIMO_part1.zip).
Format: ZIP Size: 10.6MB Download file or display content in a new window
Additional file 3. Archive containing the output files computed by the preranked GSEA for λ ∈ {0.6,0.7,0.8,0.9,1.0,2.0} (GSEA_MIMO_part2.zip).
Format: ZIP Size: 10.6MB Download file or display content in a new window
Figure 7B displays the GO terms that are increasingly enriched with their degree of causality, involving genes that are putative causes of the tumorigenesis affecting patients' survival; these genes might have been missed by the conventional ranking approach (λ = 0). Counterintuitively, the three GO terms in this category are related to the immune system what is sought to be more an effect of the tumor growth as lymphocytes strike cancer cells as they proliferate. However, several independent research groups showed that frequent usage of aspirin significantly decrease the longterm risk of cancer death by correcting immune system dysfunction [38,39], findings that have been confirmed in breast cancer [40], what supports that the immune system might have a causal role in tumorigenesis. There is strong evidence of interplay between immune system and tumors since solid tumors are commonly infiltrated by immune cells; in contrast to infiltration of cells responsible for chronic inflammation, the presence of high numbers of lymphocytes, especially T cells, has been reported to be an indicator of good prognosis in many cancers [41], what concours with the sign of the enrichment (negative enrichment; Figure 7B). We and others have reported that gene expression signatures representing the immune response process were associated with a better prognosis in particular subtypes of breast cancer [29,42,43].
The last group of GO terms are less enriched when the degree of causality increases and the vast majority of the corresponding genes are related to cellcycle and proliferation (Figure 7C). Cellcycle and proliferationrelated genes, such as for example Ki67, have been used for many decades to describe breast tumors: High levels of Ki67 have been correlated with worse prognosis and are also known to be associated with high tumor grade and negativity of estrogen receptor status [44,45]. We and others have shown that a quantitative measurement of proliferation genes using mRNA gene expression could provide an accurate assessment of prognosis of breast cancer patients [43,46,47]. We also have shown that only one of those genes, AURKA, which is significantly enriched in this case in the M phase GO term, was sufficient to recapitulate the prognostic performance of different prognostic signatures [48]. However the enrichment of these proliferationrelated genes seems to be a downstream effect of the breast tumorigenesis instead of its cause.
Our approach allows to identify biological processes that may be direct causes of cancer. These processes are likely to be missed by conventional methods. Given the promising performance of our approach, we plan to integrate our method in analytical frameworks combining efficiently the available clinical data and a priori biological knowledge, potentially retrieved from biomedical literature [49] or pathway database [50], in order to unravel gene sets or network of genes causal of cancer patients' survival.
Conclusions
It is well known in statistics that correlation does not imply causation or, in more general terms, that features that are relevant or strongly relevant for predicting a target are not necessarily direct causes. Direct effects are typical examples of variables that provide information about a target without having any causal role. In a datadriven approach to gene selection it is therefore more and more important to discriminate not only between relevant and nonrelevant variables but also, within the subset of relevant variables, to discriminate between direct or indirect causes and effects. This paper proposes a computationally affordable strategy to infer causal patterns that take advantage of multiple outputs. Experimental results in terms of accuracy and clinical interpretation show the added value deriving from the inclusion of a causal term into conventional ranking.
Abbreviations
AUC: Area Under the ROC Curve; DISR: Double Input Symmetrical Relevance; GO: Gene Ontology; GSEA: Gene Set Enrichment Analysis; MIMO: multipleinput multipleoutput; NES: Normalized Enrichment Score; RMSE: Root Mean Squared Error; ROC: Receiver Operating Characteristics; SAR: Squared error, Accuracy, and ROC score; WL: WinLoss.
Authors' contributions
GB and BHK were responsible for the design and execution of the study, data analysis and interpretation. CD and CS participated to the data analysis and interpretation. GB and BHK were responsible for writing the manuscript; JQ supervised the study. All authors read and approved the final manuscript.
Acknowledgements
This work was supported by the ARC project "Discovery of the molecular pathways regulating pancreatic beta cell dysfunction and apoptosis in diabetes using functional genomics and bioinformatics" funded by the Communauté Française de Belgique (GB), the US National Institutes of Health (NCI/NIH/DHHS: 5U19CA14806502, BHK and JQ), by the Belgian National Foundation for Research FNRS (CD, CS), the MEDIC Foundation (CS).
References

Saeys Y, Inza I, Larranaga P: A review of feature selection techniques in bioinformatics.
Bioinformatics 2007, 23:25072517. PubMed Abstract  Publisher Full Text

Guyon I, Elisseeff A: An introduction to variable and feature selection.

Shipley B: Cause and Correlation in Biology. Cambridge University Press; 2000.

Guyon I, Aliferis C, Elisseeff A: Computational Methods of Feature Selection. Chapman and Hall; 2007:6386.
chap. Causal Feature Selection

Bontempi G, Meyer P: Causal filter selection in microarray data. In Proceedings of the 27th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA; 2010.

Xing EP, Jordan MI, Karp RM: Feature Selection for HighDimensional Genomic Microarray Data. In Proceedings of the 18th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA; 2001:601608.

British Journal of Philosophy of Science 1985, 36:273289. Publisher Full Text

Koller D, Friedman N: Probabilistic graphical models. The MIT Press; 2009.

Watkinson J, Liang K, Wang X, Zheng T, Anastassiou D: Inference of regulatory gene interactions from expression data using threeway mutual information.
Annals of NY Academy of Sciences 2009, 1158:302313. Publisher Full Text

Freitas AA: Understanding the Crucial Role of Attribute Interaction in Data Mining.

Jakulin A: Machine Learning Based on Attribute Interactions. PhD thesis. University of Ljubliana, Faculty of Computer and Information Science; 2005.

Meyer P, Schretter C, Bontempi G: InformationTheoretic Feature Selection in Microarray Data using Variable Complementarity.
IEEE Journal of Selected Topics in Signal Processing 2008, 2:261274.

Cover TM, Thomas JA: Elements of Information Theory. New York: John Wiley; 1990.

Peng H, Long F, Ding C: Feature Selection Based On Mutual Information: Criteria of MaxDependency,MaxRelevance, and MinRedundancy.
IEEE Transactions on Pattern Analysis and Machine Intelligence 2005, 27(8):12261238. PubMed Abstract  Publisher Full Text

Devroye L, Györfi L, Lugosi G: A Probabilistic Theory of Pattern Recognition. Springer Verlag; 1996.

Fleuret F: Fast Binary Feature Selection with Conditional Mutual Information.

Peng H, Long F: An efficient maxdependency algorithm for gene selection.
Proceedings of the 36th Symposium on the Interface: Computational Biology and Bioinformatics 2004.

Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos X: Local causal and Markov Blanket induction for causal discovery and feature selection for classification. Part I.

Pearl J: [http://www.amazon.com/CausalityReasoningInferenceJudeaPearl/dp/0521773628] webcite
Causality: Models, Reasoning, and Inference. Cambridge University Pres; 2000.

Engl HW: Regularization of inverse problems. Kluwer Academic Publishers Group; 1996.

Hedges L, Olkin I: Statistical Methods for MetaAnalysis. [http://www.jstor.org/pss/2289186] webcite
Journal of the American Statistical Association 1987, 82(397):350351. Publisher Full Text

Stone M: Crossvalidatory choice and assessment of statistical predictions.
Journal of the Royal Statistical Society B 1974, 36:111147.

van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhiven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene Expression Profiling Predicts Clinical Outcome of Breast Cancer.
Nature 2002, 415:530536. PubMed Abstract  Publisher Full Text

Caruana R, NiculescuMizil A: Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria.

van Rijsbergen CJ: Information Retrieval. Butterworth; 1979.

Dietterich GT: Approximate statistical tests for comparing supervised classification learning algorithms. [http://dx.doi.org/10.1162/089976698300017197] webcite
Neural Comput 1998, 10:18951923. PubMed Abstract  Publisher Full Text

Holm S: A simple sequentially rejective multiple test procedure.

HaibeKains B, Desmedt C, Rothe F, Piccart M, Sotiriou C, Bontempi G: A fuzzy gene expressionbased computational approach improves breast cancer prognostication. [http://genomebiology.com/2010/11/2/R18] webcite
Genome Biology 2010, 11(2):R18. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: A knowledgebased approach for interpreting genomewide expression profiles. [http://www.pnas.org/content/102/43/15545.abstract] webcite
Proceedings of the National Academy of Sciences of the United States of America 2005, 102(43):1554515550. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hollander M, Wolfe DA: Nonparametric statistical inference. New York: John Wiley and Sons; 1973.

Jaffe AB, Hall A: Rho GTPases: biochemistry and biology.
Annu Rev Cell Dev Biol 2005, 21:24769. PubMed Abstract  Publisher Full Text

Burbelo P, Wellstein A, Pestell RG: Altered Rho GTPase signaling pathways in breast cancer cells.
Breast Cancer Res Treat 2004, 84:438. PubMed Abstract  Publisher Full Text

Bromberg KD, Kluger HM, Delaunay A, Abbas S, DiVito KA, Krajewski S, Ronai Z: Increased expression of the E3 ubiquitin ligase RNF5 is associated with decreased survival in breast cancer.
Cancer Res 2007, 67(17):81729. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Heikal AA: Intracellular coenzymes as natural biomarkers for metabolic activities and mitochondrial anomalies.
Biomark Med 2010, 4(2):24163. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yenugonda VM, Deb TB, Grindrod SC, Dakshanamurthy S, Yang Y, Paige M, Brown ML: Fluorescent cyclindependent kinase inhibitors block the proliferation of human breast cancer cells.
Bioorg Med Chem 2011, 19(8):271425. PubMed Abstract  Publisher Full Text

Moore NL, Weigel NL: Regulation of progesterone receptor activity by cyclin dependent kinases 1 and 2 occurs in part by phosphorylation of the SRC1 carboxylterminus.

Rothwell PM, Fowkes FGR, Belch JFF, Ogawa H, Warlow CP, Meade TW: Effect of daily aspirin on longterm risk of death due to cancer: analysis of individual patient data from randomised trials.
Lancet 2011, 377(9759):3141. PubMed Abstract  Publisher Full Text

De Santo C, Serafini P, Marigo I, Dolcetti L, Bolla M, Del Soldato P, Melani C, Guiducci C, Colombo MP, Iezzi M, Musiani P, Zanovello P, Bronte V: Nitroaspirin corrects immune dysfunction in tumorbearing hosts and promotes tumor eradication by cancer vaccination.
Proc Natl Acad Sci USA 2005, 102(11):418590. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Brasky TM, Bonner MR, Moysich KB, Ambrosone CB, Nie J, Tao MH, Edge SB, Kallakury BVS, Marian C, Goerlitz DS, Trevisan M, Shields PG, Freudenheim JL: Nonsteroidal antiinflammatory drugs (NSAIDs) and breast cancer risk: differences by molecular subtype.

Hsu DS, Kim MK, Balakumaran BS, Acharya CR, Anders CK, Clay T, Lyerly HK, Drake CG, Morse MA, Febbo PG: Immune Signatures Predict Prognosis in Localized Cancer.
Cancer Invest 2010, 28(7):765773. PubMed Abstract  Publisher Full Text

Teschendorff A, Miremadi A, Pinder S, Ellis I, Caldas C: An immune response gene expression module identifies a good prognosis subtype in estrogen receptor negative breast cancer. [http://genomebiology.com/2007/8/8/R157] webcite
Genome Biology 2007, 8(8):R157. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Desmedt C, HaibeKains B, Wirapati P, Buyse M, Larsimont D, Bontempi G, Delorenzi M, Piccart M, Sotiriou C: Biological Processes Associated with Breast Cancer Clinical Outcome Depend on the Molecular Subtypes. [http://clincancerres.aacrjournals.org/cgi/content/abstract/14/16/5158] webcite
Clin Cancer Res 2008, 14(16):51585165. PubMed Abstract  Publisher Full Text

Barnard NJ, Hall PA, Lemoine NR, Kadar N: Proliferative index in breast carcinoma determined in situ by Ki67 immunostaining and its relationship to clinical and pathological variables.
J Pathol 1987, 152(4):28795. PubMed Abstract  Publisher Full Text

Locker AP, Birrell K, Bell JA, Nicholson RI, Elston CW, Blamey RW, Ellis IO: Ki67 immunoreactivity in breast carcinoma: relationships to prognostic variables and short term survival.
Eur J Surg Oncol 1992, 18(3):2249. PubMed Abstract

HaibeKains B, Desmedt C, Piette F, Buyse M, Cardoso F, van't Veer L, Piccart M, Bontempi G, Sotiriou C: Comparison of prognostic gene expression signatures for breast cancer. [http://www.biomedcentral.com/14712164/9/394] webcite
BMC Genomics 2008, 9:394. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Wirapati P, Sotiriou C, Kunkel S, Farmer P, Pradervand S, HaibeKains B, Desmedt C, Ignatiadis M, Sengstag T, Schutz F, Goldstein D, Piccart M, Delorenzi M: Metaanalysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. [http://breastcancerresearch.com/content/10/4/R65] webcite
Breast Cancer Research 2008, 10(4):R65. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

HaibeKains B, Desmedt C, Sotiriou C, Bontempi G: A comparative study of survival models for breast cancer prognostication based on microarray data: does a single gene beat them all? [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/24/19/2200] webcite
Bioinformatics 2008, 24(19):22002208. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Heidorn PB, Palmer CL, Wright D: Biological information specialists for biological informatics.
J Biomed Discov Collab 2007, 2:1. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes.
Nucleic Acids Res 2000, 28:2730. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

McCall MN, Bolstad BM, Irizarry RA: Frozen robust multiarray analysis (fRMA).
Biostatistics 2010, 11(2):24253. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Miller LD, Smeds J, George J, Vega VB, Vergara L, Pioner A, Pawitan Y, Hall P, Klaar S, Liu ET, Bergh J: An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival.
PNAS 2005, 102(38):1355013555. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pawitan Y, Bjohle J, Amler L, Borg A, Egyhazi S, Hall P, Han X, Holmberg L, Huang F, Klaar S, Liu ET, Miller L, Nordgren H, Ploner A, Sandelin K, Shaw PM, Smeds J, Skoog L, Wedren S, Bergh J: Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two populationbased cohorts.
Breast Cancer Research 2005, 7(6):953964. BioMed Central Full Text

Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, van Gelder MEM, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA: GeneExpression Profiles to Predict Distant Metastasis of LymphNodeNegative Primary Breast Cancer.
Lancet 2005, 365(9460):671679. PubMed Abstract  Publisher Full Text

Minn AJ, Gupta GP, Padua D, Bos P, Nguyen DX, Nuyten D, Kreike B, Zhang Y, Wang Y, Ishwaran H, Foekens JA, van de Vijver M, Massague J: Lung metastasis genes couple breast tumor size and metastatic spread. [http://www.pnas.org/cgi/content/abstract/104/16/6740] webcite
Proceedings of the National Academy of Sciences 2007, 104(16):67406745. Publisher Full Text

Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, HaibeKains B, Desmedt C, Larsimont D, Cardoso F, Peterse H, Nuyten D, Buyse M, Van de Vijver MJ, Bergh J, Piccart M, Delorenzi M: Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis. [http://jnci.oxfordjournals.org/cgi/content/abstract/jnci;98/4/262] webcite
J Natl Cancer Inst 2006, 98(4):262272. PubMed Abstract  Publisher Full Text

Schmidt M, Bohm D, von Torne C, Steiner E, Puhl A, Pilch H, Lehr HA, Hengstler JG, Kolbl H, Gehrmann M: The Humoral Immune System Has a Key Prognostic Impact in NodeNegative Breast Cancer. [http://cancerres.aacrjournals.org/cgi/content/abstract/68/13/5405] webcite
Cancer Res 2008, 68(13):54055413. PubMed Abstract  Publisher Full Text

Desmedt C, Piette F, Loi S, Wang Y, Lallemand F, HaibeKains B, Viale G, Delorenzi M, Zhang Y, d'Assignies MS, Bergh J, Lidereau R, Ellis P, Harris AL, Klijn JG, Foekens JA, Cardoso F, Piccart MJ, Buyse M, Sotiriou C: Strong Time Dependence of the 76Gene Prognostic Signature for NodeNegative Breast Cancer Patients in the TRANSBIG Multicenter Independent Validation Series. [http://clincancerres.aacrjournals.org/cgi/content/abstract/13/11/3207] webcite
Clin Cancer Res 2007, 13(11):32073214. PubMed Abstract  Publisher Full Text