Use of Attribute Driven Incremental Discretization and Logic Learning Machine to build a prognostic classifier for neuroblastoma patients

Cangelosi, Davide; Muselli, Marco; Parodi, Stefano; Blengio, Fabiola; Becherini, Pamela; Versteeg, Rogier; Conte, Massimo; Varesio, Luigi

doi:10.1186/1471-2105-15-S5-S4

Volume 15 Supplement 5

Italian Society of Bioinformatics (BITS): Annual Meeting 2013: Bioinformatics

Research
Open access
Published: 06 May 2014

Use of Attribute Driven Incremental Discretization and Logic Learning Machine to build a prognostic classifier for neuroblastoma patients

Davide Cangelosi¹,
Marco Muselli²,
Stefano Parodi²,
Fabiola Blengio¹,
Pamela Becherini¹,
Rogier Versteeg³,
Massimo Conte⁴ &
…
Luigi Varesio¹

BMC Bioinformatics volume 15, Article number: S4 (2014) Cite this article

3075 Accesses
20 Citations
10 Altmetric
Metrics details

Abstract

Background

Cancer patient's outcome is written, in part, in the gene expression profile of the tumor. We previously identified a 62-probe sets signature (NB-hypo) to identify tissue hypoxia in neuroblastoma tumors and showed that NB-hypo stratified neuroblastoma patients in good and poor outcome [1]. It was important to develop a prognostic classifier to cluster patients into risk groups benefiting of defined therapeutic approaches. Novel classification and data discretization approaches can be instrumental for the generation of accurate predictors and robust tools for clinical decision support. We explored the application to gene expression data of Rulex, a novel software suite including the Attribute Driven Incremental Discretization technique for transforming continuous variables into simplified discrete ones and the Logic Learning Machine model for intelligible rule generation.

Results

We applied Rulex components to the problem of predicting the outcome of neuroblastoma patients on the bases of 62 probe sets NB-hypo gene expression signature. The resulting classifier consisted in 9 rules utilizing mainly two conditions of the relative expression of 11 probe sets. These rules were very effective predictors, as shown in an independent validation set, demonstrating the validity of the LLM algorithm applied to microarray data and patients' classification. The LLM performed as efficiently as Prediction Analysis of Microarray and Support Vector Machine, and outperformed other learning algorithms such as C4.5. Rulex carried out a feature selection by selecting a new signature (NB-hypo-II) of 11 probe sets that turned out to be the most relevant in predicting outcome among the 62 of the NB-hypo signature. Rules are easily interpretable as they involve only few conditions.

Furthermore, we demonstrate that the application of a weighted classification associated with the rules improves the classification of poorly represented classes.

Conclusions

Our findings provided evidence that the application of Rulex to the expression values of NB-hypo signature created a set of accurate, high quality, consistent and interpretable rules for the prediction of neuroblastoma patients' outcome. We identified the Rulex weighted classification as a flexible tool that can support clinical decisions. For these reasons, we consider Rulex to be a useful tool for cancer classification from microarray gene expression data.

Background

Neuroblastoma (NB) is the most common solid pediatric tumor, deriving from ganglionic lineage precursors of the sympathetic nervous system [2]. It shows notable heterogeneity of clinical behavior, ranging from rapid progression, associated with metastatic spread and poor clinical outcome, to spontaneous, or therapy-induced regression into benign ganglioneuroma. Age at diagnosis, stage and amplification of the N-myc proto-oncogene (MYCN) are clinical and molecular risk factors that the International Neuroblastoma Risk Group (INRG) utilized to classify patients into high, intermediate and low risk subgroups on which current therapeutic strategy is based. About fifty percent of high-risk patients die despite treatment making the exploration of new and more effective strategies for improving stratification mandatory [3].

The availability of genomic profiles improved our prognostic ability in many types of cancers [4]. Several groups used gene expression-based approaches to stratify NB patients. Prognostic gene signatures were described [5–11] and classifier proposed to predict the risk class and/or patients' outcome [5–13]. We and other scientific groups have identified tumor hypoxia as a critical component of neuroblastoma progression [14–16]. Hypoxia is a condition of low oxygen tension occurring in poorly vascularized areas of the tumor which has profound effects on cell growth, genotype selection, susceptibility to apoptosis, resistance to radio- and chemotherapy, tumor angiogenesis, epithelial to mesenchymal transition and propagation of cancer stem cells [17–20]. Hypoxia activates specific genes encoding angiogenic, metabolic and metastatic factors [18, 21] and contributes to the acquisition of the tumor aggressive phenotype [18, 22, 23]. We have used gene expression profile to assess the hypoxic status of NB cells and we have derived a robust 62-probe sets NB hypoxia signature (NB-hypo) [14, 24], which was found to be an independent risk factor for neuroblastoma patients [1].

The use of gene expression data for tumor classification is hindered by the intrinsic variability of the microarray data deriving from technical and biological variability. These limitation can be overcome by analyzing the results through algorithms capable to discretize the gene expression data in broad ranges of values rather than considering the absolute values of probe set expression. We will focus on the discretization approach to deal with gene expression data for patients' stratification in the present work.

Classification is central to the stratification of cancer patients into risk groups and several statistical and machine learning techniques have been proposed to deal with this issue [25]. We are interested in classification methods capable of constructing models described by a set of explicit rules for their immediate translation in the clinical setting and for their easily interpretability, consistency and robustness verification [15, 26]. A rule is a statement in the form "if<premise> then<consequence>" where the premise is a logic product (AND) of conditions on the attributes of the problem and the consequence indicates the predicted output. Most used rule generation techniques belong to two broad paradigms: decision trees and methods based on Boolean function synthesis.

The decision tree approach implements discriminant policies where differences between output classes are the driver for the construction of the model. These algorithms divide iteratively the dataset into smaller subsets according to a divide and conquer strategy, giving rise to a tree structure from which an explicit set of rules can be easily retrieved. At each iteration a part of the training set is split into two or more subsets to obtain non-overlapping portions belonging to the same output class [27]. Decision tree methods provide simple rules, and require a reduced amount of computational resources. However, the accuracy of the models is often poor. The divide and conquer approach prevents the applicability of these models to relatively small datasets that would be progressively fractionated in very small, poorly indicative, subsets.

Methods based on Boolean function synthesis adopt an aggregative policy where some patterns belonging to the same output class are clustered to produce an explicit rule at any iteration. Suitable heuristic algorithms [28–30] are employed to generate rules exhibiting the highest covering and the lowest error; a tradeoff between these two objectives has been obtained by applying the Shadow Clustering (SC) technique [28] which leads to final models, called Logic Learning Machines (LLM), exhibiting good accuracy. The aggregative policy can also consider patterns already included in previously built rules; therefore, SC generally produces overlapping rules that characterize each output class better than the divide-and-conquer strategy. Clustering samples of the same kind permits to extract knowledge regarding similarities of the members of a given class rather than information on their differences. This is very useful in most applications and leads to models showing higher generalization ability, as shown by trials performed with SC [31, 32].

LLM algorithms prevent the excessive fragmentation problem typical of divide-and-conquer approach but come at the expense of the need to implement an intelligent strategy for managing conflicts occurring when one instance is satisfied by more than one rules classifying opposite outcomes. LLM is a novel and efficient implementation of the Switching Neural Network (SNN) model [33] trained through an optimized version of the SC algorithm. LLM, SNN and SC have been successfully used in different applications: from reliability evaluation of complex systems [34] to prediction of social phenomena [35], form bulk electric assessment [36] to analysis of biomedical data [15, 31, 32, 37, 38].

The ability of generating models described by explicit rules has several advantages in extracting important knowledge from available data. Identification of prognostic factors in tumor diseases [15, 37] as well as selection of relevant features in microarray experiments [31] are only two of the valuable targets achieved through the application of LLM and SNN. In this analysis, to improve the accuracy of the model generated by LLM, a recent innovative preprocessing method, called Attribute Driven Incremental Discretization (ADID) [39] has been employed. ADID is an efficient data discretization algorithm capable of transforming continuous attributes into discrete ones by inserting a collection of separation points (cutoffs) for each variable. The core of ADID consists in an incremental algorithm that adds the cutoff iteratively obtaining the highest value of a quality measure based on the capability of separating patterns of different classes. Smart updating procedures enable ADID to efficiently get a (sub) optimal discretization. Usually, ADID produces a minimal set of cutoffs for separating all the patterns of different classes. ADID and LLM algorithms are implemented in Rulex 2.0 [40], a software suite developed and commercialized by Impara srl that has been utilized for the present work.

Blending the generalization and the feature selection strength of LLM and the efficiency of ADID in mapping continuous variables into a discrete domain with the stratification power of the NB-hypo signature we obtained an accurate predictor of NB patients' outcome and a robust tool for supporting clinical decisions. In the present work, we applied Rulex 2.0 components to the problem of classifying and predicting the outcome of neuroblastoma patients on the bases of hypoxia- specific gene expression data. We demonstrate that our approach generates an excellent discretization of gene expression data resulting in a classifier predicting NB patients' outcome. Furthermore, we show the flexibility of this approach, endowed with the ability to steer the outcome towards clinically oriented specific questions.

Results

Rulex model

We analyzed gene expression of 182 neuroblastoma tumors profiled by the Affymetrix platform. The characteristics of the NB patients are shown in Table 1 and are comparable to what previously described [6]. We selected this dataset because the gene expression profile of the primary tumor, performed by microarray, was available for each patient. "Good" or "poor" outcome are defined, from here on, as the patient's status "alive" or "dead" 5 years after diagnosis respectively.

Table 1 Characteristics of 182 neuroblastoma patients included in the study.

Full size table

We previously described the NB-hypo 62 probe sets signature that represents the hypoxic response of neuroblastoma cells [14, 24] and used this signature for developing the hypoxia-based classifier to predict the patients' status utilizing ADID to convert the continuous probe sets values into discrete attributes and LLM algorithm to generate classification rules. Both techniques are implemented in Rulex 2.0. The first assessment of the classifier was done on the training set of 109 randomly chosen patients, while the remaining 73 patients were utilized to validate the predictions (Figure 1). The outcome of the classifier is a collection of rules, in the form if<premise> then<consequence>, where the premise includes conditions based on the probe sets values and the consequence is the patient status. Rulex 2.0 will use these rules collectively for outcome prediction on the validation set.

The generation of the classification rules requires a discretization step because this simplifies the selection of the cut-off values of the probe sets expression (Figure 1). The discretization yielded one cutoff value for each probe set which was sufficient for modeling the outcome. The use of a single cutoff dichotomizes the probe set attributes in low or high expression, drastically reducing the influence of the technical and biological variability present in the models associated with the absolute values of probe sets expression.

Furthermore, a test on the maximum error allowed for a rule was defined. The final classification rules were trained with the optimal value of 25% associated with the maximal mean accuracy of 87%, determined by 10 fold cross validation analysis (Figure 1). The procedure generated 9 rules, numbered from 1 to 9 (Rule ID) in Table 2 and based on conditions containing high or low probe sets expression value (above or below the set threshold respectively). Rule premises are limited to two conditions with the exception of rule 7 that has only one condition. Six rules predict good outcome and 3 poor outcome; they will be considered together in scoring the class attribution of new patients of the validation set. This is the optimal scenario proposed by Rulex 2.0 to utilize the 62 probe sets NB-hypo signature for classifying patients' outcome.

Table 2 Classification rules.

Full size table

Three parameters, shown in Table 2, estimate the quality of the rules: 1) covering, measuring the generality, 2) error, measuring the ambiguity and 3) Fisher's p-value measuring the significance of each rule. The statistical significance of each rule by Fisher's exact test was very high (p< 0.001) providing strong evidence of the excellent quality of the rules. The covering ranged among rules from 48% (rule 6) to 80% (rule 7) for good outcome classes and ranged from 57% (rule 9) to 92% (rule 7) for poor outcome. Error ranged from 3.5% (rule 1) to 14% (rules 2 and 5) for good outcome class and ranged from 7.4% (rule 9) to 17% (rule 7) for poor outcome class. These rules have interesting features that will be addressed in detail. The first consideration is that the overall covering of the rules classifying good and poor outcome adds up to 380% and 209%, respectively, indicating overlap among rules. This is a characteristic of the LLM method implementing an aggregative, rather than fragmentation, policy as illustrated in the Materials and Methods section. However, overlap among the rules can lead to a conflict if the probe sets values of a patient satisfies two or more rules predicting opposite outcomes. To investigate whether overlapping among our rules can be source of conflicts we plotted in Figure 2 each patient's membership to the nine rules. The plot clusters the rules classifying good and poor outcome and the patients belonging to good or poor outcome classes. The results show that each patient is covered by at least one rule. Overlaps exist and occur mainly among rules predicting the same outcome (Figure 2B and 2C), which do not lead to classification conflicts. However, 30 patients (27% of the dataset) were covered by rules pertaining to opposite outcomes and are represented by those present in quadrant 2A (19 patients) or 2D (11 patients). Rulex 2.0 overcomes this problem by employing for assigning a class to a new sample a fast procedure that evaluates all the rules satisfied by it and their covering, thus generating a consensus outcome to be assigned to the sample as detailed in the Materials and Methods section.

A second characteristic of the classification rules is that they include only 11 out of 62 probe sets of the original NB-hypo signature. The relationship among probe sets and rules is shown in Table 3. Rulex 2.0 operated a second feature selection on the original 62 probe sets optimized for functioning in a binary profile of low and high probe set expression and gave rise to a modified hypoxia signature that we name NB-hypo-II.

Table 3 Probe sets characteristics of the new NB-hypo-II signature.

Full size table

It was of interest to assess the relative importance of each probe set in the classification scheme leading to identification of poor and good outcome patients. Negative or positive values indicate low or high expression associated with the predicted outcome and the relative relevance is measured by the absolute value. Ranking the probe sets may be of relevance to pick the genes for further validation on an alternative platform. It is interesting to note that probe set 217356_s_at is the most relevant for the classification of good and poor outcome patients.

A third interesting feature of the rules is the dichotomization of the expression values of each probe set even if Rulex 2.0 had no constraints on the discretization that could lead to multiple cutoffs or overlapping expression values in different rules. We were intrigued by these results and decided to examine the relationship between the cutoff expression values identified by Rulex and those obtained by Kaplan-Meyer analysis. The Kaplan-Meier algorithm calculates all possible cut- off points of a given probe set in a cohort of patients and selects the one maximizing the distinction between good and poor outcome. The results, shown in Table 4, demonstrated a general quite good concordance between the cutoff values of the Kaplan-Meier and those generated by Rulex. In particular, the two measures always differed by less than ±50% in magnitude, but only in some cases (probe sets 2, 3, 4, 5, and 8 for both overall and relapse free survival, and probe sets 6 and 7 for overall survival such a difference was lower than 25%. These (even if rather small) discrepancies are probably related to the capability of ADID to exploit the complex multivariate correlation among probe sets. Furthermore, we found a concordance also in the relationship between high/low probe set and outcome in Rulex derived rules and Kaplan-Meier plots (Table 4).

Table 4 Expression cut-offs from the Kaplan-Meier and from the rules.

Full size table

Outcome prediction

The ability of the rules in Table 2 to predict patients' outcome was tested on a 73 patients' independent dataset. Results are expressed as accuracy, recall and precision, assessing the performance in classifying good outcome, specificity and negative predictive values (NPV) assessing the performance in classifying poor outcome patients. The direct evaluation of the performance of the 9 rules on the validation set is represented by ADID in Table 5, by ADID+LLM in Table 6, the base configuration of Table 7 and by no dataset modification in Additional File 1. The results demonstrate a good accuracy comparable to what reported by other algorithms [13]. Furthermore, the classification of good outcome was superior to that of poor outcome patients as shown for example by a recall of 90% relative to 57% of specificity.

Table 5 Comparison among discretization algorithms' performance.

Full size table

Table 6 Performance of classification algorithms.

Full size table

Table 7 Performance comparison among the configurations in the weighted classification on the test set.

Full size table

PVCA analysis [41] was utilized to estimate the potential variability of experimental effects including batch. The analysis revealed that batch effect explained a moderate 21% of the overall variation in our dataset and a Frozen Surrogate Variable Analysis (FSVA) was employed for removing batch effect. The application of FSVA reduced batch effect to less than 0.05% of the total variation (data not shown).

We compared the performances achieved by ADID and LLM on the batch-adjusted dataset and those on the original dataset (no dataset modification) to measure the impact of batch effect on classifier performances. Performance obtained with the adjusted dataset turns out to be very similar to that obtained with original data as shown in Additional file 1 demonstrating that batch effect had negligible impact on the performances. Therefore, the dataset with no modifications was utilized for subsequent analysis.

We then compared the performance of the ADID discretization approach and those of commonly used discretization algorithms, namely: entropy based (EntMDL [42]), Modified Chi Square [43], ROC based (Highest Youden index([44]), and Equal frequency (i.e. median expression for each feature). Results detailed in Table 5 showed that the discretization performed by ADID produced better accuracy with respect to the others (80% vs. 68%-77%).

We also compared the performance of LLM and those of Decision Tree [45], Support Vector Machines (SVM) [46], and Prediction Analysis of Microarrays (PAM) [47], to evaluate the ability of LLM to predict patients' outcome with respect to other standard supervised learning methods. Results in Table 6 revealed that ADID and LLM were able to predict patients' outcome with better performances with respect to the decision tree classifier. The performances of LLM, PAM and SVM classifiers were comparable.

Overall, the performance was good but unbalanced datasets tend to bias the performance towards the most represented class [48]. In our dataset the good outcome patients were more frequent (26 % poor and 74% good outcome). Therefore, we explored the possibility of utilizing a weighted classification system (WCS) [48–53] to improve the classification of poorly represented classes or to force the algorithm to maximize the performance on selected outcomes. The performance of the base configuration was taken as reference.

We performed a weighted classification accounting for the unbalanced class representation in the dataset (Table 7 configuration W26_74). The weight was calculated as the inverse proportion of the number of patients belonging to each class, about 3 times more weight on the poor outcome class. The major improvement over the base configuration was the specificity whereas all other parameters were similar or worst. Configuration W1_1000, was similar to W26_74, but set the weight of poor outcome 1000 times higher than that of the good one. Interestingly, its performance was very close to that of configuration W26_74 despite the disparity in the weight applied, suggesting that small changes in the relative weight of poor outcome are sufficient to optimize the results. In conclusion, increased weight on poor outcome augmented the percentage of correctly classified poor outcome patients even though a smaller number of patients were included in this class. This correction may be relevant when maximization of the specificity isof primary importance as in the case of using a prudent therapy.

In contrast, configuration W1000_1 sets the weight of good outcome 1000 times higher than that of poor outcome. The performance parameters were similar or higher than the base configuration with the exception of specificity that was quite low, a situation that appears symmetrical to those observed previously. The recall is almost absolute indicating the exceptional ability to classify good outcome. The drawback of this configuration is a very low percentage of correct poor outcome classification that is 35% of all poor outcome patients. This configuration may be useful in the case of using an aggressive therapy.

In conclusion, WCS can improve performance parameters of classification of poor or good outcome patients and may be particularly relevant in a situation where the dataset contains a major unbalance between classes and/or when clinical decisions may require minimizing false positives or false negatives.

Discussion

Our study is based on gene expression data derived through the analysis of primary neuroblastoma tumors by microarray on the Affymetrix platform. We focused on the expression of 62 probe sets comprising the NB-hypo signature that we have previously shown to represent the hypoxia status of neuroblastoma cells [24]. The association of hypoxia with poor prognosis in neuroblastoma patients was previously demonstrated [16, 54]. We studied a cohort of 182 NB patients characterized by clinical and molecular data addressing the question of the potential prognostic value of this signature. Rulex 2.0 suite was used to train a model on a set of patients and validate it on an independent one. We demonstrate that Attribute Driven Incremental Discretization and Logic Learning Machine algorithms, implemented in Rulex 2.0, generated a robust set of rules predicting outcome of neuroblastoma patients using expression values of 11 probe sets, specific for hypoxia extracted from the gene NB-hypo expression profile.

Outcome prediction of NB patients was reported by several groups using a combination of different risk factors and utilizing various algorithms [5, 6, 12, 55, 56]. Several groups have used gene expression-based approaches to stratify neuroblastoma patients and prognostic gene signatures have been described often based on the absolute values of the probe sets after appropriate normalization [5–11, 13]. Affymetrix GeneChip microarrays are the most widely used high-throughput technology to measure gene expression, and a wide variety of preprocessing methods have been developed to transform probe intensities reported by a microarray scanner into gene expression estimates [57]. However, variations from one experiment to another [58] may increase data variability and complicate the interpretation of expression analysis based on absolute gene expression values. We addressed the problem by applying the Attribute Driven Incremental Discretization algorithm [39] that maps continuous gene expression values into discrete attributes. Interestingly, the algorithm applied to our dataset showed that the introduction of a single cutoff was sufficient to create two expression patterns, operationally defined as low and high, capable of describing the probe set status accurately enough for effective patients classification. This approach minimizes the variability and errors associated with the use of absolute values to interpret microarray gene expression data.

The validity of the discretization implemented by Rulex 2.0 was further documented by an empirical validation where we calculated the optimal cutoff value for each of the 11 probe sets tested in a Kaplan-Meier analysis of the patients' survival. It is noteworthy that such analysis utilized the survival time of the patient as opposed to 5 years survival considered by Rulex 2.0. Nevertheless, we demonstrated that the cutoff values calculated by either approaches were rather similar, thus supporting the robustness of the ADID algorithm to identify relevant discrete groups of expression values. From a technical point of view ADID is a multivariate method searching for the minimum number of cutoffs that separate patients belonging to different classes. On the other hand, the Kaplan-Meier scan is a univariate technique having the aim of identifying the value of the probe set that maximizes the distance among the survival times of resulting groups. It should be noted that these two approaches are independent from each other since they are based on different algorithms and different classifications.

Only 11 out of 62 probe sets of the original signature were considered by LLM for building the classifier. This selection has a biological meaning. In fact, the original 62 probe sets NB-hypo signature was obtained following a biology driven approach [1] in which the prior knowledge on tumor hypoxia was the bases for the analysis and the signature was derived from hypoxic neuroblastoma cell lines [14]. Hence, NB-hypo is optimized for detecting tumor hypoxia. The importance of hypoxia in conditioning tumor aggressiveness is documented by an extensive literature [17, 19, 20, 22, 59]. However, NB-hypo was not optimized to predict outcome that is dependent on factors other than hypoxia. Rulex 2.0 performed a feature selection by identifying the 11 probe sets that were the most relevant in predicting outcome among those of the NB-hypo signature.

One key feature of LLM is to implement an aggregative policy leading to the situation in which one patient can be covered by more than one rule. This leads to the advantage of avoiding dataset fragmentation typical of the divide-and-conquer paradigm. Furthermore, the robustness of the resulting model is increased; in fact, if a patient satisfies more than one rule for the same output class, the probability of a correct classification is higher.

The same outcome is generally predicted by every rule verified by a given patient. However, there are situations in which a patient satisfies rules associated with opposite outcomes, thus generating a potential conflict. Rulex 2.0 overcomes this problem by adopting a procedure for assigning a specific class on the basis of the characteristics of the verified rules. A conflict should not necessarily be considered as a limit of the proposed approach, but it could reflect a source of ambiguity present in the dataset. If this were the case, any method building models from data would always reflect this ambiguity.

The generation of a predictive classifier based on gene expression obtained from different institutions raised the question of a possible batch effect in the data. We utilized Frozen Surrogate Variable Analysis method, a batch effect removal method capable of estimating the training batch, and used it as a reference for adjusting batch effect of other batches. In particular, those for which no information about the outcome is known. This was not possible with other known batch removal methods such as the Combating Batch Effects (Combat) [60], which adjusts the expression values of both training and test batch [61]. We compared the performances achieved by ADID and LLM on the batch-adjusted dataset and those on the original dataset (no dataset modification) to measure the impact of batch effect on classifier performances. Performance obtained with the adjusted dataset showed that batch effect had negligible impact on performances. Previous studies observed that the application of batch effect removal methods for prediction does not necessarily result in a positive or negative impact [61]. Furthermore, batch effect removal methods may remove the true biologically based signal [61]. For this reasons, the analysis of the present manuscript was performed on the original dataset excluding any modification for batch removal.

The 9 rules generated by ADID and LLM achieved a good accuracy on an independent validation set. We compared the accuracy of our new classifier with that of the top performing classifiers for NB patients' outcome prediction listed in [13] to study the concordance of the performance achieved by the 9 rules with that of other previously published classifiers. The classifiers were generated on different signatures and algorithms and the accuracy reported ranged from 80% to 87%. Note that those classifiers were validated on a different test set; they utilized different algorithms and signatures. We concluded that the accuracy of our rules and that of other techniques reported in literature were concordant.

ADID represents an innovative discretization method that was used in combination with LLM for classification purposes. In the present study, ADID demonstrated to outperform other discretization algorithms, based on univariate analysis, indicating the capability of ADID to exploit the complex correlation structure commonly encountered in biomedical studies, including gene expression data sets. Moreover, ADID-LLM showed better performances with respect to the decision tree classifier. This was somewhat predictable because the lower performance of the decision tree with respect to other approaches has been pointed out in literature [15]. The LLM rules, SVM and PAM classifier achieved similar performances. The good performance achieved by ADID and LLM and the explicit representation of the knowledge extracted from the data provided by the rules demonstrated the utility of ADID and LLM in patients' outcome prediction.

Our dataset suffers of class-imbalance as many other datasets [48–53]. In fact, the good outcome class is over-expressed with respect to the poor outcome class. Rulex 2.0 implements a novel algorithmic strategy that allows setting up different weights to outcomes biasing the assignment to a class towards that of interest. It should be noted that weighting is effective only on class assignment for patients verified by conflicting rules. We have utilized the weight approach to represent the situation in which either poor or good outcome was favored or to address the imbalance between good and poor outcome patients in our dataset. We found that, in the absence of predefined weights, the algorithm generates a good performance somewhat unbalanced towards better classification of good outcome patients. By changing the weights we were in the position of steering the prediction towards a high precision in classifying poor outcome patients or in privileging the specificity. This tool may be or practical importance in the decision making process of clinicians that are confronted with difficult therapeutic choices.

Conclusions

We provided the first demonstration of the applicability of data discretization and rule generation methods implemented in Rulex 2.0 to the analysis of microarray data and generation of a prognostic classifier. Rulex automatically derived a new signature, NB-hypo II, which is instrumental in predicting the outcome of NB patients. The performances achieved by Rulex are comparable and in some case better than those of other known data discretization and classification methods. Furthermore, the easy interpretability of the rules and the possibility to employ weighted classification make Rulex 2.0 a flexible and useful tool to support clinical decisions and therapy assignment.

Methods

Patients

Affymetrix GeneChip HG-U133plus2.0 enrolled 182 neuroblastoma patients on the bases of the availability of gene expression profile. Eighty-eight patients were collected by the Academic Medical Center (AMC; Amsterdam, Netherlands) [1, 62]; 21 patients were collected by the University Children's Hospital, Essen, Germany and were treated according to the German Neuroblastoma trials, either NB97 or NB2004; 51 patients were collected at Hiroshima University Hospital or affiliated hospitals and were treated according to the Japanese neuroblastoma protocols [63]; 22 patients were collected at Gaslini Institute (Genoa, Italy) and were treated according to Italian AIEOP protocols. The data are stored in the R2 microarray analysis and visualization platform (AMC and Essen patients) or at the BIT-neuroblastoma Biobank of the Gaslini Institute. The investigators who deposited data in the R2 repository agree to use them for this work. In addition, we utilized data present on the public database at the Gene Expression Omnibus number GSE16237 for Hiroshima patients [63]. Informed consent was obtained in accordance with institutional policies in use in each country. In every dataset, median follow-up was longer than 5 years, tumor stage was defined as stages 1, 2, 3, 4, or 4s according to the International Neuroblastoma Staging System (INSS), normal and amplified MYCN status were considered and two age groups were considered, those with age at diagnosis smaller than 12 months and greater or equal to 12 months. Good and poor outcome were defined as the patient's status alive or dead 5 years after diagnosis. The characteristics of the patients are shown in Table 1.

Batch effect measure and removal

The PVCA approach [41] was used to estimate the variability of experimental effects including batch. The pvca package implemented in R was utilized to perform the analysis setting up a pre- defined threshold of 60%. The analysis included Age at diagnosis, MYCN amplification, INSS stage, and Outcome and Institute variables. The estimation of experimental effects was performed before and after the batch effect removal.

The frozen surrogate variable analysis (FSVA) implemented in the sva package [64] was utilized for removing the batch effect from the training and the test sets. The parametric prior method and the Institute batch variable were set up for the analysis.

Gene expression analysis

Gene expression profiles for the 182 tumors were obtained by microarray experiment using Affymetrix GeneChip HG-U133plus2.0 and the data were processed by MAS5.0 software according to Affymetrix guideline.

Preprocessing step

To describe the procedure adopted to discretize values assumed by the probe sets a basic notation must be introduced. In a classification problem d-dimensional examples $x \in X \subset ℜ^{d}$ , are to be assigned to one of q possible classes, labeled by the values of a categorical output y. Starting from a training set S including n pairs (x _i, y_i), i = 1, ..., n, deriving from previous observations, techniques for solving classification problems have the aim of generating a model g(x), called classifier, that provides the correct answer y = g(x) for most input patterns x. Concerning the components x_j two different situations can be devised:

1.
ordered variables: x_j varies within an interval [a,b] of the real axis and an ordering relationship exists among its values.
2.
nominal (categorical) variables: x_j can assume only the values contained in a finite set and there is no ordering relationship among them.

A discretization algorithm has the aim of deriving for each ordered variable x_j a (possibly empty) set of cutoffs γ_jk, with k = 1, ..., t_j, such that for every pair x_u, x_v of input vectors in the training set belonging to different classes (y_u ≠ y_v) their discretized counterparts z_u, z_v have at least one different component.

Denote with ρ_j the vector which contains all the α_j distinct values for the input variable x_j in the training set, arranged in ascending order, i.e. ρ_jl<ρ_j,_l+1for each l = 1, ..., α_j-1. Then, we can consider a set of binary values τ_jl, with j = 1, ..., d and l = 1, ..., α_j-1, asserting if a separation must be set for the j-th variable between its l-th and (l+1)-th values:

τ_{j l} = \{\begin{gathered} 1, & if γ_{j} contains ρ_{jl} \\ 0, & otherwise \end{gathered}

Of course, the total number of possible cutoffs is given by

\sum_{j = 1}^{d} \sum_{l = 1}^{α_{j} - 1} τ_{j l}

which must be minimized under the constraint that examples x_u and x_v belonging to different classes have to be separated at least by one cutoff. To this aim, let X_juv the set of indexes l such that ρ_jl lies between x_uj and x_vj:

X_{j u v} = \{l | x_{u j} < ρ_{j l} < x_{v j} |\}

Then, the discretization problem can be stated as:

m i n_{τ} \sum_{j = 1}^{d} \sum_{l = 1}^{α_{j} - 1} τ_{j l}

s u b j t o \sum_{j = 1}^{d} \sum_{l \in X_{j u v}} τ_{j l} \geq 1 for each u, v, s . t . y_{u} \neq y_{v}

(1)

To improve the separation ability of the resulting set of cutoffs the constraint in (1) can be reinforced by imposing that

\sum_{j = 1}^{d} \sum_{l \in X j_{u v}} τ_{j l} \geq s

for some s ≥ 1. Intensive trials on real world datasets have shown that a good value for s is given by s= 0.2d; this choice has been adopted in all the analysis performed in the present paper.

Since the solution of the programming problem in (1) can require an excessive computational cost, a near-optimal greedy approach is adopted by the Attribute Driven Incremental Discretization (ADID) procedure [39]. It follows an iterative algorithm that adds iteratively the cutoff obtaining the highest value of a quality measure based on the capability of separating patterns belonging to different classes. Smart updating procedures enable ADID to efficiently attain a (sub) optimal discretization.

After the set of candidate cutoffs is produced, a subsequent phase is performed, to refine their position. This updating task significantly increases the robustness of final discretization.

Classification by ADID and LLM implemented in Rulex 2.0

A classification model was built on the expression values of the 62 probe sets constituting NB-hypo signature [1]. Model generation and performance was established by splitting the dataset into a training set, comprising 60% of the whole patients cohort, and a test set comprising the remaining 40%. To build a classifier, a Rulex 2.0 process was designed. A discretizer component that adopts the Attribute Driven Incremental Discretization (ADID) procedure [39] and a classification component that adopts a rule generation method called Logic Learning Machine (LLM) were utilized into the process. Entropy based (EntMDL [42]), Modified Chi Square [43], ROC based (Highest Youden index ([44]), and Equal frequency (i.e. median expression for each feature) components have been executed as alternative discretization methods. To design the most accurate classifier one important parameter of the LLM component was evaluated. The parameter was the maximum error allowed on the training set. It defines the maximum percentage of examples covered by a rule with a class differing from the class of the rule. The parameter values evaluated ranged in the set 0%, 5%, 10%, 15%, 20%, 25%, and 30%. For each parameter value, a 10 times repeated 10-fold cross validation analysis was performed and the classification performances were collected. The parameters' choice that obtained the best mean classification accuracy was selected to train the final gene expression based classifier on the whole training set utilizing the aforementioned Rulex components. The Rulex software suite is commercialized by Impara srl[40].

Decision Tree [45], Support Vector Machines (SVM) [46], and Prediction Analysis of Microarrays (PAM) [47]were run on the same training and test sets for reference.

Performance evaluation in predicting patients' outcome

To evaluate the prediction performance of the classifiers we used the following metrics: accuracy, recall, specificity and negative predictive values (NPV), considering good outcome patients as positive instances and poor outcome patients as negative instances. Accuracy is the proportion of correctly predicted examples in the overall number of instances. Recall is the proportion of correctly predicted positive examples against all the positive instances of the dataset. Precision is the proportion of correctly classified positive examples against all the predicted positive instances. Specificity is the proportion of correctly predicted negative examples against all the negative instances of the dataset. NPV is the proportion of the correctly classified negative examples against all the predicted negative instances.

Rule quality measures

Rule generation methods constitute a subset of classification techniques that generate explicit models g(x) described by a set of m rules r_k, k = 1, ..., m, in the if-then form:

if < p r e m i s e > then < c o n s e q u e n c e >

where <premise> is the logical product (and) of m_k conditions c_kl, with l = 1, ..., m_k, on the components x_j, whereas <consequence> gives a class assignment $y = ỹ$ for the output. In general, a condition c_kl in the premise involving an ordered variable x_j has one of the following forms x_j >λ, x_j ≤ μ, λ<x_j ≤ μ, being λ and μ two real values, whereas a nominal variable x_k leads to membership conditions $x_{k} \in \{α, δ, σ\}$ , being α, δ, σ admissible values for the k-th component of x.

For instance, if x₁ is an ordered variable in the domain {1, ..., 100} and x₂ is a nominal component assuming values in the set {red, green, blue}, a possible rule r₁ is

i f x_{1} > 40 a n d x_{2} \in \{r e d, b l u e\} t h e n y = 0

where 0 denotes one of the q possible assignments (classes).

According to the output value included in their consequence part, the m rules r_k describing a given model g(x) can be subdivided into q groups G₁, G₂, ..., G_q. Considering the training set S, any rule r∈G_l is characterized by four quantities: the numbers TP(r) and FP(r) of examples (x _i, y_i) with y_i = y_l and y_i ≠ y_l, respectively, that satisfy all the conditions in the premise of r, and the numbers FN(r) and TN(r) of examples (x _i, y_i) with y_i = y_l and y_i ≠ y_l, respectively, that do not satisfy at least one of the conditions in the premise of r.

The quality of a rule was measured utilizing the following quantities. Give a rule r, we define the covering C(r), the error E(r), and the precision P(r) according to the following formulas:

C (r) = \frac{T P (r)}{T P (r) + F N (f)}, E (r) = \frac{F P (r)}{F P (r) + T N (r)}, P (r) = \frac{T P (r)}{T P (r) + F P (r)}

The covering of a rule is the fraction of examples in the training set that satisfy the rule and belong to the target class. The error of a rule is the fraction of examples in the training set that satisfy the rule and do not belong to the target class. The precision of a rule is the fraction of examples in the training set that do not belong to the target class but satisfy the premises of the rule. The greater was the covering and the precision, the higher was the generality and the correctness of the corresponding rule.

To test the statistical significance of the rules we used a Fisher's exact test (FET) implemented by the software package R. The test of significance considered significant any rule having P < 0.05.

Relevance measure and ranking of the probe sets

To obtain a measure of importance of the features included into the rules and rank these features according to this value, we utilized a measure called Relevance R(c) of a condition c. Consider the rule r' obtained by removing that condition from r. Since the premise part of r' is less stringent, we obtain that E(r') ≥ E(r) so that the quantityR(c) = (E(r')−E(r))C(r) can be used as a measure of relevance for the condition c of interest.

Since each condition c refers to a specific component of x, we define the relevance R_v(x_j) for every input variable x_j as follows:

R_{v} (x_{j}) = 1 - \prod_{k} (1 - R (c_{k l}))

where the product is computed on the rules r_k that includes a condition c_kl on the variable x_j.

Denote with V_kl the attribute involved in the condition c_kl of the rule r_k and with S_kl the subset of values of V_kl for which the condition c_kl is verified. If V_kl is an ordered attribute and the condition c_kl is V_kl ≤ a for some value a∈S_kl, then the contribution to R_v(x_j) is negative. Hence, by adding the superscript − (resp. +) to denote the attribute V_kl with negative (resp. positive) contribution, we can write R_v(x_j) for an ordered input variable x_j in the following way:

R_{v} (x_{j}) = \prod_{V_{k l}} (1 - R (c_{k l})) - \prod_{V_{k l}} (1 - R (c_{k l}))

where the first (resp. second) product is computed on the rules r_k that includes a condition c_kl

leading to a negative (resp. positive) contribution for the variable x_j.

Output assignment for a new instance

When the model g(x) described by the set of m rules r_k, k = 1, ..., m, is employed to classify a new instance x, the <premise> part of each rule is examined to verify if the components of x satisfy the conditions included in it. Denote with Q the subset of rules whose <premise> part is satisfied by x; then, the following three different situations can occur:

1.
The set Q includes only rules having the same output value ỹ in their <consequence> part; in this case the class ỹ is assigned to the instance x.
2.
The set Q contains rules having different output values in their <consequence> part; it follows that Q can be partitioned into q disjoint subsets Q_i, (some of which can be empty) including the rules r pertaining to the ith class. In this case, to every attribute x_j can be assigned a measure of consistency t_ij given by the maximum of the relevance r(c) for the conditions c involving the attribute x_j and included in the <premise> part of the rules in Q_i. Then, to the instance x is assigned the class $ỹ$ associated with the following maximum:
$ỹ = \underset{i = 1, \dots q}{argmax} \sum_{i = 1}^{d} t_{i j}$
3.
The set Q is empty, i.e. no rule is satisfied by the instance x; in this case the set Q₋₁ containing the subset of rules whose <premise> part is satisfied by x except for one condition is considered and points 1 and 2 are again tested with Q = Q₋₁. If again Q is empty the set Q₋₂ containing the subset of rules whose <premise> part is satisfied by x except for two conditions is considered and so on.

The conflicting case 2 can be controlled in Rulex 2.0 by assigning a set of weights w_i to the output classes; in this way equation (1) can be written as

ỹ = \underset{i = 1, \dots q}{argmax} \sum_{i = 1}^{d} t_{i j} w_{i}

and we can speak of weighted classification.

Abbreviations

INSS:: International Neuroblastoma Staging System
FET:: Fisher's Exact test
NPV:: Negative Predictive Value
INRG:: International Neuroblastoma Risk Group
LLM:: Logic Learning Machine
SNN:: Switching Neural Networks
SC:: Shadow Clustering
TP:: true positives
FP:: false positives
TN:: true negatives
FN:: false negatives
NB:: neuroblastoma
ADID:: Attribute Driven Incremental Discretization
WCS:: weighted classification system
PVCA:: principal variance component analysis
SVA:: surrogate variable analysis
FSVA:: frozen surrogate variable analysis.

References

Fardin P, Barla A, Mosci S, Rosasco L, Verri A, Versteeg R, Caron HN, Molenaar JJ, Ora I, Eva A: A biology-driven approach identifies the hypoxia gene signature as a predictor of the outcome of neuroblastoma patients. Molecular Cancer. 2010, 9: 185-10.1186/1476-4598-9-185.
Article PubMed Central PubMed Google Scholar
Thiele CJ: Neuroblastoma. Human Cell Culture. Edited by: Master JRW, Palsson B. 1999, London: Kluwer Academic, 21-22.
Google Scholar
Haupt R, Garaventa A, Gambini C, Parodi S, Cangemi G, Casale F, Viscardi E, Bianchi M, Prete A, Jenkner A: Improved survival of children with neuroblastoma between 1979 and 2005: a report of the Italian Neuroblastoma Registry. J Clin Oncol. 2010, 28: 2331-2338. 10.1200/JCO.2009.24.8351.
Article PubMed Google Scholar
Doroshow JH: Selecting systemic cancer therapy one patient at a time: is there a role for molecular profiling of individual patients with advanced solid tumors?. J Clin Oncol. 2010, 28: 4869-4871. 10.1200/JCO.2010.31.1472.
Article PubMed Google Scholar
Wei J, Greer B, Westermann F, Steinberg S, Son C, Chen Q, Whiteford C, Bilke S, Krasnoselsky A, Cenacchi N: Prediction of clinical outcome using gene expression profiling and artificial neural networks for patients with neuroblastoma. Cancer Res. 2004, 64: 6883-6891. 10.1158/0008-5472.CAN-04-0695.
Article PubMed Central CAS PubMed Google Scholar
Schramm A, Schulte JH, Klein-Hitpass L, Havers W, Sieverts H, Berwanger B, Christiansen H, Warnat P, Brors B, Eils J: Prediction of clinical outcome and biological characterization of neuroblastoma by expression profiling. Oncogene. 2005, 24: 7902-7912. 10.1038/sj.onc.1208936.
Article CAS PubMed Google Scholar
Ohira M, Oba S, Nakamura Y, Isogai E, Kaneko S, Nakagawa A, Hirata T, Kubo H, Goto T, Yamada S: Expression profiling using a tumor-specific cDNA microarray predicts the prognosis of intermediate risk neuroblastomas. Cancer Cell. 2005, 7: 337-350. 10.1016/j.ccr.2005.03.019.
Article CAS PubMed Google Scholar
Oberthuer A, Berthold F, Warnat P, Hero B, Kahlert Y, Spitz R, Ernestus K, Konig R, Haas S, Eils R: Customized oligonucleotide microarray gene expression-based classification of neuroblastoma patients outperforms current clinical risk stratification. J Clin Oncol. 2006, 24: 5070-5078. 10.1200/JCO.2006.06.1879.
Article CAS PubMed Google Scholar
Fischer M, Oberthuer A, Brors B, Kahlert Y, Skowron M, Voth H, Warnat P, Ernestus K, Hero B, Berthold F: Differential expression of neuronal genes defines subtypes of disseminated neuroblastoma with favorable and unfavorable outcome. Clin Cancer Res. 2006, 12: 5118-5128. 10.1158/1078-0432.CCR-06-0985.
Article CAS PubMed Google Scholar
Vermeulen J, De Preter K, Naranjo A, Vercruysse L, Van Roy N, Hellemans J, Swerts K, Bravo S, Scaruffi P, Tonini GP: Predicting outcomes for children with neuroblastoma using a multigene-expression signature: a retrospective SIOPEN/COG/GPOH study. Lancet Oncol. 2009, 10: 663-671. 10.1016/S1470-2045(09)70154-8.
Article PubMed Central CAS PubMed Google Scholar
De Preter K, Vermeulen J, Brors B, Delattre O, Eggert A, Fischer M, Janoueix-Lerosey I, Lavarino C, Maris JM, Mora J: Accurate Outcome Prediction in Neuroblastoma across Independent Data Sets Using a Multigene Signature. Clin Cancer Res. 2010, 16: 1532-1541. 10.1158/1078-0432.CCR-09-2607.
Article CAS PubMed Google Scholar
Oberthuer A, Hero B, Berthold F, Juraeva D, Faldum A, Kahlert Y, Asgharzadeh S, Seeger R, Scaruffi P, Tonini GP: Prognostic Impact of Gene Expression-Based Classification for Neuroblastoma. J Clin Oncol. 2010, 28: 3506-3515. 10.1200/JCO.2009.27.3367.
Article PubMed Google Scholar
Cornero A, Acquaviva M, Fardin P, Versteeg R, Schramm A, Eva A, Bosco MC, Blengio F, Barzaghi S, Varesio L: Design of a multi-signature ensemble classifier predicting neuroblastoma patients' outcome. BMC Bioinformatics. 2012, 13 (Suppl 4): S13-10.1186/1471-2105-13-S4-S13.
Article PubMed Central PubMed Google Scholar
Fardin P, Barla A, Mosci S, Rosasco L, Verri A, Varesio L: The l1-l2 regularization framework unmasks the hypoxia signature hidden in the transcriptome of a set of heterogeneous neuroblastoma cell lines. BMC Genomics. 2009, 10: 474-10.1186/1471-2164-10-474.
Article PubMed Central PubMed Google Scholar
Cangelosi D, Blengio F, Versteeg R, Eggert A, Garaventa A, Gambini C, Conte M, Eva A, Muselli M, Varesio L: Logic Learning Machine creates explicit and stable rules stratifying neuroblastoma patients. BMC Bioinformatics. 2013, 14 (Suppl 7): S12-10.1186/1471-2105-14-S7-S12.
Article PubMed Central PubMed Google Scholar
Pietras A, Johnsson AS, Pahlman S: The HIF-2alpha-driven pseudo-hypoxic phenotype in tumor aggressiveness, differentiation, and vascularization. Curr Top Microbiol Immunol. 2010, 345: 1-20.
CAS PubMed Google Scholar
Semenza GL: Regulation of cancer cell metabolism by hypoxia-inducible factor 1. Semin Cancer Biol. 2009, 19: 12-16. 10.1016/j.semcancer.2008.11.009.
Article CAS PubMed Google Scholar
Carmeliet P, Dor Y, Herbert JM, Fukumura D, Brusselmans K, Dewerchin M, Neeman M, Bono F, Abramovitch R, Maxwell P: Role of HIF-1alpha in hypoxia-mediated apoptosis, cell proliferation and tumour angiogenesis. Nature. 1998, 394: 485-490. 10.1038/28867.
Article CAS PubMed Google Scholar
Lin Q, Yun Z: Impact of the hypoxic tumor microenvironment on the regulation of cancer stem cell characteristics. Cancer Biol Ther. 2010, 9: 949-956. 10.4161/cbt.9.12.12347.
Article PubMed Central CAS PubMed Google Scholar
Lu X, Kang Y: Hypoxia and hypoxia-inducible factors: master regulators of metastasis. Clin Cancer Res. 2010, 16: 5928-5935. 10.1158/1078-0432.CCR-10-1360.
Article PubMed Central CAS PubMed Google Scholar
Chan DA, Giaccia AJ: Hypoxia, gene expression, and metastasis. Cancer Metastasis Rev. 2007, 26: 333-339. 10.1007/s10555-007-9063-1.
Article CAS PubMed Google Scholar
Harris AL: Hypoxia--a key regulatory factor in tumour growth. Nat Rev Cancer. 2002, 2: 38-47. 10.1038/nrc704.
Article CAS PubMed Google Scholar
Rankin EB, Giaccia AJ: The role of hypoxia-inducible factors in tumorigenesis. Cell Death Differ. 2008, 15: 678-685. 10.1038/cdd.2008.21.
Article PubMed Central CAS PubMed Google Scholar
Fardin P, Cornero A, Barla A, Mosci S, Acquaviva M, Rosasco L, Gambini C, Verri A, Varesio L: Identification of Multiple Hypoxia Signatures in Neuroblastoma Cell Lines by l(1)-l(2) Regularization and Data Reduction. Journal of Biomedicine and Biotechnology. 2010
Google Scholar
Kotsiantis SB, Zaharakis ID, Pintelas PE: Machine learning: a review of classification and combining techniques. Artif Intell Rev. 2006, 26: 159-190. 10.1007/s10462-007-9052-3.
Article Google Scholar
Tan AC, Naiman DQ, Xu LF, Winslow RL, Geman D: Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics. 2005, 21: 3896-3904. 10.1093/bioinformatics/bti631.
Article PubMed Central CAS PubMed Google Scholar
Fürnkranz J: Separate-and-conquer rule learning. Artificial Intelligence Review. 1999, 13: 3-54. 10.1023/A:1006524209794.
Article Google Scholar
Muselli M, Ferrari E: Coupling Logical Analysis of Data and Shadow Clustering for Partially Defined Positive Boolean Function Reconstruction. IEEE Transactions on Knowledge and Data Engineering. 2011, 23: 37-50.
Article Google Scholar
Muselli M, Liberati D: Binary rule generation via Hamming Clustering. IEEE Transactions on Knowledge and Data Engineering. 2002, 14: 1258-1268. 10.1109/TKDE.2002.1047766.
Article Google Scholar
Boros E, Hammer P, Ibaraki T, Kogan A, Muchnik I: An implementation of logical analysis of data. IEEE Transactions on Knowledge and Data Engineering. 2000, 12: 292-306.
Article Google Scholar
Muselli M, Costacurta M, Ruffino F: Evaluating switching neural networks through artificial and real gene expression data. Artif Intell Med. 2009, 45: 163-171. 10.1016/j.artmed.2008.08.002.
Article PubMed Google Scholar
Mangerini R, Romano P, Facchiano A, Damonte G, Muselli M, Rocco M, Boccardo F, Profumo A: The application of atmospheric pressure matrix-assisted laser desorption/ionization to the analysis of long-term cryopreserved serum peptidome. Anal Biochem. 2011, 417: 174-181. 10.1016/j.ab.2011.06.021.
Article CAS PubMed Google Scholar
Muselli M: Switching Neural Networks: A New Connectionist Model for Classification. WIRN/NAIS 2005 Volume 3931. Edited by: Apolloni B, Marinaro M, Nicosia G, Tagliaferri R. 2006, Berlin: Springer-Verlag, 23-30.
Google Scholar
Rocco CM, Muselli M: Approximate multi-state reliability expressions using a new machine learning technique. Reliability Engineering and System Safety. 2005, 89: 261-270. 10.1016/j.ress.2004.08.023.
Article Google Scholar
Zambrano O, Rocco CM, Muselli M: Estimating female labor force participation through statistical and machine learning methods: A comparison. Computational Intelligence in Economics and Finance Volume 2. Edited by: Shu-Heng C, Paul P W, Tzu-Wen K. 2007, Berlin: Springer- Verlag, 93-106.
Chapter Google Scholar
Rocco CM, Muselli M: Machine learning models for bulk electric system well-being assessment. 12th Conference of the Spanish Association for Artificial Intelligence. 2007, CAEPIA
Google Scholar
Paoli G, Muselli M, Bellazzi R, Corvo R, Liberati D, Foppiano F: Hamming clustering techniques for the identification of prognostic indices in patients with advanced head and neck cancer treated with radiation therapy. Med Biol Eng Comput. 2000, 38: 483-486. 10.1007/BF02345741.
Article CAS PubMed Google Scholar
Ferro P, Forlani A, Muselli M, Pfeffer U: Alternative splicing of the human estrogen receptor alpha primary transcript: mechanisms of exon skipping. Int J Mol Med. 2003, 12: 355-363.
CAS PubMed Google Scholar
Ferrari E, Muselli M: Maximizing pattern separation in discretizing continuous features for classification purposes. The 2010 International Joint Conference on Neural Networks (IJCNN). 2010, 2010: 1-8. 18 July
Google Scholar
Rulex software suite. [http://www.impara-ai.com]
Boedigheimer MJ, Wolfinger RD, Bass MB, Bushel PR, Chou JW, Cooper MF: Sources of variation in baseline gene expression levels from toxicogenomics study control animals across multiple laboratories. BMC Genomics. 2008, 9: 285-10.1186/1471-2164-9-285.
Article PubMed Central PubMed Google Scholar
Kohavi R, Sahami M: Error-Based and Entropy-Based Discretization of Continuous Features. Procedings of the Second International Conference on Knowledge Discovery and Data Mining. 1996, AAAI Press, 114-119.
Google Scholar
Tay FEH, Shen L: A Modified Chi2 Algorithm for Discretization. IEEE Transactions on Knowledge and Data Engineering. 2002, 14: 666-670. 10.1109/TKDE.2002.1000349.
Article Google Scholar
Perkins NJ, Schisterman EF: The inconsistency of "optimal" cutpoints obtained using two criteria based on the receiver operating characteristic curve. Am J Epidemiol. 2006, 163: 670-675. 10.1093/aje/kwj063.
Article PubMed Central PubMed Google Scholar
Quinlan JR: C4.5: programs for machine learning. 1993, Morgan Kaufmann Publishers Inc
Google Scholar
Chang C, Lin C: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011, 2: 1-27.
Article Google Scholar
Tibshirani RF, Hastie TF, Narasimhan BF, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA. 2002, 99: 6567-6572. 10.1073/pnas.082099299.
Article PubMed Central CAS PubMed Google Scholar
Blagus RF, Lusa L: Class prediction for high-dimensional class-imbalanced data. BMC Bioinformatics. 2010, 11: 523-10.1186/1471-2105-11-523.
Article PubMed Central PubMed Google Scholar
Takemura AF, Shimizu AF, Hamamoto K: A cost-sensitive extension of AdaBoost with markov random field priors for automated segmentation of breast tumors in ultrasonic images. Int J Comput Assist Radiol Surg. 2010, 5: 537-547. 10.1007/s11548-010-0411-1.
Article PubMed Google Scholar
Vidrighin CF, Potolea R: ProICET: a cost-sensitive system for prostate cancer data. Health informatics journal. 2008, 14: 297-307. 10.1177/1460458208096558.
Article PubMed Google Scholar
Sun TF, Zhang RF, Wang JF, Li XF, Guo X: Computer-aided diagnosis for early-stage lung cancer based on longitudinal and balanced data. PLoS ONE. 2013, 8: e63559-10.1371/journal.pone.0063559.
Article PubMed Central CAS PubMed Google Scholar
Doyle S, Monaco JF, Feldman MF, Tomaszewski JF, Madabhushi A: An active learning based classification strategy for the minority class problem: application to histopathology annotation. BMC Bioinformatics. 2011, 12: 424-10.1186/1471-2105-12-424.
Article PubMed Central PubMed Google Scholar
Teramoto R: Balanced gradient boosting from imbalanced data for clinical outcome prediction. Stat Appl Genet Mol Biol. 2009, 8: 1544-6115. (Electronic)
Google Scholar
Maxwell P, Pugh C, Ratcliffe P: Activation of the HIF pathway in cancer. Current Opinion in Genetics & Development. 2001, 11: 293-299. 10.1016/S0959-437X(00)00193-3.
Article CAS Google Scholar
Oberthuer A, Warnat P, Kahlert Y, Westermann F, Spitz R, Brors B, Hero B, Eils R, Schwab M, Berthold F: Classification of neuroblastoma patients by published gene-expression markers reveals a low sensitivity for unfavorable courses of MYCN non-amplified disease. Cancer Letters. 2007, 250: 250-267. 10.1016/j.canlet.2006.10.016.
Article CAS PubMed Google Scholar
Schramm A, Mierswa I, Kaderali L, Morik K, Eggert A, Schulte JH: Reanalysis of neuroblastoma expression profiling data using improved methodology and extended follow-up increases validity of outcome prediction. Cancer Lett. 2009, 282: 55-62. 10.1016/j.canlet.2009.02.052.
Article CAS PubMed Google Scholar
Mccall MN, Almudevar A: Affymetrix GeneChip microarray preprocessing for multivariate analyses. Briefings in Bioinformatics. 2012, 13: 536-546. 10.1093/bib/bbr072.
Article PubMed Central PubMed Google Scholar
Upton GJG, Harrison AP: Motif effects in Affymetrix GeneChips seriously affect probe intensities. Nucleic Acids Res. 2012, 40: 9705-9716. 10.1093/nar/gks717.
Article PubMed Central CAS PubMed Google Scholar
Brown JM, William WR: Exploiting tumour hypoxia in cancer treatment. Nat Rev Cancer. 2004, 4: 437-447. 10.1038/nrc1367.
Article CAS PubMed Google Scholar
Johnson WE, Li CF, Rabinovic A: Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007, 8: 118-127. 10.1093/biostatistics/kxj037.
Article PubMed Google Scholar
Luo JF, Schumacher MF, Scherer AF, Sanoudou DF, Megherbi DF, Davison TF, Shi TF, Tong WF, Shi LF, Hong HF: A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data. Pharmacogenomics J. 2010, 10: 278-291. 10.1038/tpj.2010.57.
Article PubMed Central CAS PubMed Google Scholar
Huang S, Laoukili J, Epping MT, Koster J, Holzel M, Westerman BA, Nijkamp W, Hata A, Asgharzadeh S, Seeger RC: ZNF423 is critically required for retinoic acid-induced differentiation and is a marker of neuroblastoma outcome. Cancer Cell. 2009, 15: 328-340. 10.1016/j.ccr.2009.02.023.
Article PubMed Central CAS PubMed Google Scholar
Ohtaki M, Otani K, Hiyama K, Kamei N, Satoh K, Hiyama E: A robust method for estimating gene expression states using Affymetrix microarray probe level data. BMC Bioinformatics. 2010, 11: 183-10.1186/1471-2105-11-183.
Article PubMed Central PubMed Google Scholar
Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD: The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 2012, 28: 882-883. 10.1093/bioinformatics/bts034.
Article PubMed Central CAS PubMed Google Scholar

Download references

Acknowledgements

The work was supported by the Fondazione Italiana per la Lotta al Neuroblastoma, the Associazione Italiana per la Ricerca sul Cancro, the Società Italiana Glicogenosi, the Fondazione Umberto Veronesi, the Ministero della Salute Italiano and the Italian Flagship Project "InterOmics". The authors would like to thank the Italian Association of Pediatric Hematology/Oncology (AIEOP) for tumor samples collection and Dr. Erika Montani for her valuable support concerning the use of both statistical and graphical Rulex 2.0 routines. DC and FB have a fellowship from the Fondazione Italiana per la Lotta al Neuroblastoma.

Declarations

Charge for this article was paid by a grant of the Fondazione Italiana per la Lotta al Neuroblastoma.

This article has been published as part of BMC Bioinformatics Volume 15 Supplement 5, 2014: Italian Society of Bioinformatics (BITS): Annual Meeting 2013. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S5

Author information

Authors and Affiliations

Laboratory of Molecular Biology, Gaslini Institute, Largo Gaslini 5, 16147, Genoa, Italy
Davide Cangelosi, Fabiola Blengio, Pamela Becherini & Luigi Varesio
Institute of Electronics, Computer and Telecommunication Engineering, National Research Council of Italy, Genoa, 16149, Italy
Marco Muselli & Stefano Parodi
Department of Human Genetics, Academic Medical Center, University of Amsterdam, Meibergdreef 15, Amsterdam, 1100, The Netherlands
Rogier Versteeg
Department of Hematology-Oncology, Gaslini Institute, Largo Gaslini 5, Genoa, 16147, Italy
Massimo Conte

Authors

Davide Cangelosi
View author publications
You can also search for this author in PubMed Google Scholar
Marco Muselli
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Parodi
View author publications
You can also search for this author in PubMed Google Scholar
Fabiola Blengio
View author publications
You can also search for this author in PubMed Google Scholar
Pamela Becherini
View author publications
You can also search for this author in PubMed Google Scholar
Rogier Versteeg
View author publications
You can also search for this author in PubMed Google Scholar
Massimo Conte
View author publications
You can also search for this author in PubMed Google Scholar
Luigi Varesio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luigi Varesio.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

DC conceived the project, performed the statistical analysis and drafted the manuscript. MM suggested the use of LLM, designed some of the experiments, designed the Rulex software and helped to draft the manuscript. SP performed computer experiments and helped to draft the manuscript, RV and MC, participated to the development of the project. FB and PB carried out the microarray data analysis. LV supervised the study and wrote the manuscript.

Marco Muselli and Luigi Varesio contributed equally to this work.

Electronic supplementary material

12859_2014_6372_MOESM1_ESM.pdf

Additional file 1: Title of data: Batch effect and LLM prediction performance. Description of data: the file contains a table showing the influence of batch effect on LLM prediction performance. Additional file 1. Table 1. Influence of batch effect on LLM prediction performance. The table shows the influence of batch effect calculated on accuracy, recall, precision, and specificity and NPV measures. Performances are comparable removing batch effect from the dataset. (PDF 206 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Cangelosi, D., Muselli, M., Parodi, S. et al. Use of Attribute Driven Incremental Discretization and Logic Learning Machine to build a prognostic classifier for neuroblastoma patients. BMC Bioinformatics 15 (Suppl 5), S4 (2014). https://doi.org/10.1186/1471-2105-15-S5-S4

Download citation

Published: 06 May 2014
DOI: https://doi.org/10.1186/1471-2105-15-S5-S4

Italian Society of Bioinformatics (BITS): Annual Meeting 2013: Bioinformatics

Use of Attribute Driven Incremental Discretization and Logic Learning Machine to build a prognostic classifier for neuroblastoma patients