NUS Graduate School for Integrative Sciences and Engineering, Singapore 117597

School of Computing, National University of Singapore, Singapore 117417

Abstract

One important application of microarray in clinical settings is for constructing a diagnosis or prognosis model. Batch effects are a well-known obstacle in this type of applications. Recently, a prominent study was published on how batch effects removal techniques could potentially improve microarray prediction performance. However, the results were not very encouraging, as prediction performance did not always improve. In fact, in up to 20% of the cases, prediction accuracy was reduced. Furthermore, it was stated in the paper that the techniques studied require sufficiently large sample sizes in both batches (train and test) to be effective, which is not a realistic situation especially in clinical settings. In this paper, we propose a different approach, which is able to overcome limitations faced by conventional methods. Our approach uses ranking value of microarray data and a bagging ensemble classifier with sequential hypothesis testing to dynamically determine the number of classifiers required in the ensemble. Using similar datasets to those in the original study, we showed that in only one case (

Introduction

Noise has a negative connotation in the classical view of biology. Therefore, one often attempts to remove "noise" from data using various statistical methods before any downstream analysis. However, there are two different types of noise in biological data, experimental noise and inherent cell variation. Distinguishing experimental noise from natural fluctuation due to inherent cell variation is a daunting task, and attempts to de-noise data often remove meaningful cell variation as well. Therefore, in this work, we take a different approach of embracing noise instead.

Inherent cell variations could arise from intrinsic and extrinsic sources

Experimental noise in gene expression measurement data mainly contains two forms of experimental errors: measurement errors and batch effects. Measurements errors in gene expression microarrays are studied by the MicroArray Quality Control (MAQC) project, a large-scale study led by FDA scientists involving 137 participants from 51 organizations, where they showed that the median coefficient of variation of replicates is between 5% and 15%

An important application of microarrays in clinical settings is to construct a predictive model for diagnosis or prognosis purposes. To do so, we need to overcome the various types of noises mentioned above, especially batch effects

Most batch effects removal algorithms try to accurately estimate the batch effects before removing them, which is why large sample sizes are required for each batch and a balanced class ratio is often desired. In this paper, we attack the problem from a different angle. Specifically, we propose a computational approach that increases cross-batch microarray prediction accuracy that mitigates batch effects without explicitly estimating and removing them. Our proposed approach uses the following two main ideas. Firstly, it is well known that while batch effects affect the absolute values of the gene expression measured, they often do not affect the relative ranking of the gene ordered by their expression values

Materials and methods

Data sets

Four data sets from the MAQC project are used in this paper (Table

Data sets from MAQC project used in this work.

**Training set**

**Validation set**

**Data set code**

**Data set description**

**Number of samples**

**Positives**

**Negatives**

**Number of Samples**

**Positives**

**Negatives**

A

Lung tumorigen vs. nontumorigen (Mouse)

70

26

44

88

28

60

D

Breast cancer pre-operative treatment response (pathologic complete response)

130

33

97

100

15

85

F

Multiple myeloma overall survival milestone outcome

340

51

289

214

27

187

I

Same as data set F but class labels are randomly assigned

340

200

140

214

122

92

PCA plots of data sets used

**PCA plots of data sets used**. PCA plots are typically used to visualize batch effects. These data sets are chosen from the FDA-led Microarray Quality Control (MAQC) Consortium project. See

The Hamner Institutes for Health Sciences (Research Triangle Park, NC, USA) provided data set A. The objective of the study was to apply gene expression data from the lungs of mice exposed to a 13-week treatment of chemicals to predict increased lung tumor incidence in the two-year rodent cancer bioassays of the National Toxicology Program. Results of this study may be used to create a more efficient and economical approach for evaluating the carcinogenic activity of chemicals. A total of 70 mice were analyzed in the first phase and used as the training set. An additional 88 mice were later collected and analyzed, and subsequently used as the validation set.

The University of Texas M. D. Anderson Cancer Center (MDACC, Houston, TX, USA) generated data set D. 230 stages I-III breast cancers gene expression samples were collected from newly diagnosed breast cancers before any therapy. Specimens were collected sequentially between 2000 and 2008 during a prospective pharmacogenomics marker discovery study. Patients received 6 months of preoperative chemotherapy followed by surgical resection of the cancer. Response to preoperative chemotherapy was categorized either as a pathological complete response (pCR), which indicates no residual invasive cancer in the breast or lymph nodes, or residual invasive cancer (RD). Gene expression profiling was performed in multiple batches using Affymetrix U133A microarrays. The first 130 collected samples were assigned as the training set, whereas the next 100 samples were used as the validation set.

The Myeloma Institute for Research and Therapy at the University of Arkansas for Medical Sciences (UAMS, Little Rock, AR, USA) contributed data sets F and I. Highly purified bone marrow plasma cells were collected from patients with newly diagnosed multiple myeloma followed by gene expression profiling of these cells. The training set consisted of 340 cases enrolled on total therapy 2 (TT2) and the validation set comprised 214 patients enrolled in total therapy 3 (TT3). Dichotomized overall survival (OS) and event-free survival (EFS) were determined based on a two-year milestone cutoff.

As all the data sets above from the MAQC project are cancer-related, we have therefore gathered an additional non-cancer-related data set from a different source

Proposed algorithm

As previously mentioned, we are proposing an entirely different approach towards overcoming batch effects. This computational approach is inspired by two articles in the field of biology, and is further enhanced with idea from our previous work on sequential hypothesis testing

First, we propose using rank values instead of absolute values of gene expression microarray data. This is inspired by the FDA-led Microarray Quality Control (MAQC) Consortium project

Percentage of cases of AUC changes under various settings

**Percentage of cases of AUC changes under various settings**. The number of scenarios explored in each setting is 108. "A. Rank Values" is using rank values instead of absolute values of microarray data. "B. Bagging (10)" and "C. Bagging (100)" are using bagging of 10 and 100 bootstrap replicates respectively with rank values. "D. Dynamic Bagging" is using bagging with non-fixed number of bootstrap replicates where the number of bootstrap replicates is determined by the sequential hypothesis testing algorithm proposed in ^{-4}. AUC Change = AUCafter - AUCbefore. The base AUC (i.e., AUCbefore) is where absolute gene expression values and no bagging are used. "Increased" and "Decreased" refers to cases where the change of AUC is

In another article

Suppose a set _{B}(^{m}C_{k})(^{m-k}
^{k}
^{m}C_{k }
_{B}(_{B}(

Theoretical values of _{B}(_{B}(

**Theoretical values of P _{B}(< x) - P_{B}(> x)**. Theoretical values of

Therefore, we have shown that training clones produced using bootstrapping technique are likely to be enriched with more clean samples than the original training data.

We further show that an ensemble classifier built using these training clones has better performance. Let _{1}, _{2}, _{n }
_{1}, _{2}, _{n }
_{1}) and C(_{2}) be the classifiers trained on _{1 }and _{2 }respectively. It is reasonable to postulate (*) that C(_{1}) would have a better accuracy than C(_{2}) if _{1 }has more "good" samples than _{2}. Let _{1},_{2}, _{n }
_{1},_{2}, _{n }
_{B}(_{B}(_{B}(

This shows that an ensemble classifier built from this collection of bags--which is called a bagging classifier--improves prediction accuracy by embracing (i.e., reducing the influence of) noisy samples, as long as there are many more "good" samples than "bad" samples in the original training set

However, in the original flavor of bagging, while _{≥θ
}{_{≥0.8}{_{1}
_{1 }

Let _{i }
^{th }test instance and _{n}
_{i}
_{i }
_{n}
_{i }
_{i }
_{≥0.5}{_{n}
_{i}
_{i }
_{i }
^{-4}.

In summary, we propose using ranking value of microarray data and bagging with sequential hypothesis testing to dynamically determine the number of classifiers required. Finally, the average of these classifiers scores is taken as the final prediction score for a particular test instance.

Evaluation of effectiveness

In this work, our main objective is to improve cross-batch prediction accuracy. Therefore, we will be using it as our performance measurement. The primary performance metric used will be area under the ROC curve (AUC) as it has the advantage of evaluating performance across the full range of sensitivity and specificity. A prediction model will be built using the training set and evaluated using the validation set (forward prediction) and vice versa (backward prediction).

To demonstrate the applicability of our proposed algorithm in small-sample-size scenarios, we create two additional data sets by randomly selecting 25% or 50% of the samples while maintaining the class ratio from each of the original data sets given in Table

Using the above-mentioned data sets, feature selection algorithms and classification methods, we measure the difference in AUC before and after our proposed algorithms in each of the possible permutations. There are a total of 9 different data sets, 3 from each data set A, D and F. Data set I is not used to measure performance improvement since it is a negative control; it is used instead to ensure arbitrary improvement is not seen. Together with two different prediction directions (forward and backward), two different feature selection algorithm (t-test and Wilcoxon Rank Sum test) and three different classification methods (SVM, k-NN and C4.5), there are a total of 108 (9x2x2x3) different possible scenarios.

Results

The main objective of this work is to improve cross-batch prediction performance. In Figure

Classifiers used by various settings

**Classifiers used by various settings**. "A. Rank Values" is using rank values instead of absolute values of microarray data. "B. Bagging (10)" and "C. Bagging (100)" are using bagging of 10 and 100 bootstrap replicates respectively with rank values. "D. Dynamic Bagging" is using bagging with non-fixed number of bootstrap replicates where the number of bootstrap replicates is determined by the sequential hypothesis testing algorithm proposed in ^{-4}. "MIN" is the minimum number of classifiers used in all scenarios. "MAX" is the maximum number of classifiers used in all scenarios. "AVG" is the average number of classifiers used in all scenarios. The number of scenarios explored in each setting is 108.

Another important consideration in building prediction models for clinical usage is the required sample size of training and test sets to properly deploy it. As the MAQC project is a large-scale study, its data sets are larger than usual. We did random subset sampling to reduce the number of samples available to us to as low as 25% of the original data, during the training phase, to mimic the low sample size in clinical settings. Despite the reduction in training samples, our algorithm still maintained its improvements with median AUC improvements well above 0.05 (Figure

Boxplot of AUC change on varying subset sizes under various scenarios (36) for data set A, D and F

**Boxplot of AUC change on varying subset sizes under various scenarios (36) for data set A, D and F**. AUC Change = AUCafter - AUCbefore. Subset size here implies using a random subset of the given data during training phase. "Dynamic Bagging.0.25", "Dynamic Bagging.0.5" and "Dynamic Bagging.1.0" are the AUC change after applying dynamic bagging and using rank values with 25%, 50% and 100% of the original given data for training respectively compared with the conventional approach, which is without bagging and using absolute values

As the PCA plots of Figure

Boxplot of AUC change on different data sets (A, D, F) under various scenarios (36)

**Boxplot of AUC change on different data sets (A, D, F) under various scenarios (36)**. AUC Change = AUCafter - AUCbefore. "Dynamic Bagging.A", "Dynamic Bagging.D" and "Dynamic Bagging.F" are the AUC change after applying dynamic bagging and using rank values on data sets A, D and F respectively compared with the conventional approach, which is without bagging and using absolute values

Finally, one critical issue highlighted by the MAQC project

Boxplot of AUC on varying subset sizes under various scenarios (36) for data set I

**Boxplot of AUC on varying subset sizes under various scenarios (36) for data set I**. Subset size here implies using a random subset of the given data during training phase. "Dynamic Bagging.0.25", "Dynamic Bagging.0.5" and "Dynamic Bagging.1.0" are the AUC achieved by applying dynamic bagging and using rank values with 25%, 50% and 100% of original given data for training.

Additional validation

In addition to cancer-related data sets from MAQC projects, we have also obtained a DMD data set from a different source

Percentage of cases of AUC changes under various settings for DMD data set

**Percentage of cases of AUC changes under various settings for DMD data set**. The number of scenarios explored in each setting is 36. "A. Rank Values" is using rank values instead of absolute values of microarray data. "B. Bagging (10)" and "C. Bagging (100)" are using bagging of 10 and 100 bootstrap replicates respectively with rank values. "D. Dynamic Bagging" is using bagging with non-fixed number of bootstrap replicates where the number of bootstrap replicates is determined by the sequential hypothesis testing algorithm proposed in ^{-4}. AUC Change = AUCafter - AUCbefore. The base AUC (i.e., AUCbefore) is where absolute gene expression values and no bagging are used. "Increased" and "Decreased" refers to cases where the change of AUC is

Discussion

Overcoming batch effect is an important step before the deployment of diagnostic or prognostic model based on gene expression data in clinical settings. Numerous algorithms have been proposed in an attempt to solve this widespread and critical problem in high-throughput experiments

In this work, we approached the batch effects problem from a different angle. We proposed a computational algorithm that attempts to embrace noise instead of estimating and removing it. By simply employing the ranking of values instead of using the absolute values of data, we were already able to show noticeable improvements. Combining this with bagging and a sequential hypothesis-testing algorithm; we were able to achieve a significant increase in cross-batch prediction performance over a wide range of training data sample size and severity of batch effects. It is important to note that our approach does not face the same limitations as conventional batch effects removal methods; thus making it appealing for use in practical applications.

Feature selection algorithms considered in this work use only generic statistical tests that look at one gene at a time. However, more recent feature selection algorithms for gene expression data are increasingly focused on using prior biological information to group genes and perform statistical tests on these group of genes instead of individual genes

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

Both authors contributed equally to conceiving the proposed method. C.H.K carried out the programming, prepared and ran the experiments, and drafted the manuscript. L.W designed the experiments and played a supervision role for the study. Both authors read and approved the final manuscript.

Acknowledgements

We thank Sharene Lin for her help rendered in the preparation and proofreading of this manuscript.

This article has been published as part of