Dorothy P. and Richard P. Simmons Center for Interstitial Lung Disease, Division of Pulmonary, Allergy and Critical Care Medicine, University of Pittsburgh Medical Center, Pittsburgh, PA 15213, USA

Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261, USA

Center for Automated Learning and Discovery and Language Technology Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA

Abstract

Background

The quality of microarray data can seriously affect the accuracy of downstream analyses. In order to reduce variability and enhance signal reproducibility in these data, many normalization methods have been proposed and evaluated, most of which are for data obtained from cDNA microarrays and Affymetrix GeneChips. CodeLink Bioarrays are a newly emerged, single-color oligonucleotide microarray platform. To date, there are no reported studies that evaluate normalization methods for CodeLink Bioarrays.

Results

We compared five existing normalization approaches, in terms of both noise reduction and signal retention: Median (suggested by the manufacturer), CyclicLoess, Quantile, Iset, and Qspline. These methods were applied to two real datasets (a time course dataset and a lung disease-related dataset) generated by CodeLink Bioarrays and were assessed using multiple statistical significance tests. Compared to Median, CyclicLoess and Qspline exhibit a significant and the most consistent improvement in reduction of variability and retention of signal. CyclicLoess appears to retain more signal than Qspline. Quantile reduces more variability than Median in both datasets, yet fails to consistently retain more signal in the time course dataset. Iset does not improve over Median in either noise reduction or signal enhancement in the time course dataset.

Conclusion

Median is insufficient either to reduce variability or to retain signal effectively for CodeLink Bioarray data. CyclicLoess is a more suitable approach for normalizing these data. CyclicLoess also seems to be the most effective method among the five different normalization strategies examined.

Background

DNA microarrays have made possible the expression profiling of thousands of genes in a single experiment. They have been used in a wide range of applications,

Regardless of platforms, microarray data are noisy due to the co-existence of genuine biological variations (signal) and noise. Signal is desirable and makes samples of different biological natures distinguishable from one another. Noise, however, is of no biological relevance and can arise from any step: sample preparation, labelling, hybridization, or scanning

A challenge inherent for the normalization of microarray data is the lack of a gold standard (

It is relatively ambiguous to evaluate normalization methods in terms of signal retention, because there is no ground truth for total signal in real biological samples. Prevailing approaches compare the ability to predict a fixed number of known differentially expressed genes using

Previously we have developed a strategy for the evaluation of normalization approaches for Affymetrix GeneChips

Results

Noise in the normalized time course dataset

Intensity-dependent differences in the normalized data

We used pairwise MA plots to examine the ability of the normalization methods to remove intensity-dependent differences between each pair of arrays in the technical replicates. Figure

MA plots of two pairs of microarrays in the TRC1 replicate set from the time course dataset

**MA plots of two pairs of microarrays in the TRC1 replicate set from the time course dataset**. The plots in each row show the results from non-normalized data (row 1) and data normalized with each of the five normalization methods (row 2–6). Columns depict two pairs of microarrays, pair 1 (array 1 vs. array 4, left) and pair 2 (array 4 vs. array 5, right), in the TRC1 replicates that exhibit obvious intensity-dependent differences between the arrays. The yellow line in each plot shows the loess fitting of the entire data in the plot.

Variability of the normalized data

We calculated the coefficient of variation (CV) of the normalized intensity values for each transcript across all arrays in each set of the normalized technical replicates. Figure

Comparison of variability of the normalized data in the three sets of technical replicates from the time course dataset

**Comparison of variability of the normalized data in the three sets of technical replicates from the time course dataset**. The density plots show the CVs of all data points (either non-normalized or normalized with the five normalization methods) from each set of the technical replicates (TRC1, TRC2 and TRC3).

Finally, we calculated the CVs of the normalized intensity values for positive control probes across all arrays in the time course dataset. Our results show that CyclicLoess, Quantile and Qspline reduce variability from the normalized data more effectively than Nonorm, Median and Iset. The means of the CVs for CyclicLoess, Quantile and Qspline are 57%, 54% and 56% respectively, whereas the means of the CVs for Nonorm, Median and Iset are 72%, 66% and 61% respectively. CyclicLoess has a slightly higher mean CV than Quantile and Qspline, but the differences are not statistically significant (Welch's

These results suggest that overall CyclicLoess reduces variability most effectively and consistently than Median in both the 'corrupted' (TRC1) and good-quality (TRC2 and TRC3) data. Although Quantile and Qspline perform as well as CyclicLoess in the good-quality data, CyclicLoess outperforms them in the 'corrupted' data. Iset fails to improve over Median consistently.

Signal in the normalized time course dataset

Since the aim of an effective normalization method is to remove noise while retaining biological signal in the data, we next compared the effectiveness of the normalization approaches in signal retention. As no spike-in datasets were available for CodeLink Bioarrays, we were unable to determine total signal in the data. Instead, we estimated signal by calculating the number of differentially expressed genes in the data. We expected that the more signal retained, the more differentially expressed genes should be revealed. Although similar methodology has been shown effective in our previous work

Simulation model and data

Click here for file

We then used Welch's two sample

Comparison of the numbers of differentially expressed genes estimated from the normalized time course dataset using multiple statistical tests

**Comparison of the numbers of differentially expressed genes estimated from the normalized time course dataset using multiple statistical tests**. The

Ranks of the normalization methods in Welch's

Normalization Method

Rank

Mean Rank

Median Rank

Standard Deviation

1d

3d

7d

14d

30d

Nonorm

2

1

1

1

1

1.2

1

0.4

Median

1

2

3

2

5

2.6

2

1.5

CyclicLoess

5

3

5

**6**

**6**

**5.0**

**5**

1.2

Quantile

3

5

**6**

3

3

4.0

3

1.4

Iset

**6**

**6**

2

5

2

4.2

5

**2.0**

Qspline

4

4

4

4

4

4.0

4

0

Ranks of the normalization methods in Wilcoxon tests (without permutation) in the time course dataset. For each time point, the normalization strategies were ranked based on the numbers of differentially expressed genes estimated from the normalized data using Wilcoxon rank sum tests without permutation. "Rank" and the "Mean", "Median" and "Standard Deviation" of the ranks are defined as described in Table 1. For each column in the table, the highest rank(s), the mean, median and standard deviation of the ranks are shown in bold.

Normalization Method

Rank

Mean Rank

Median Rank

Standard Deviation

1d

3d

7d

14d

30d

Nonorm

**3.5**

1

1

1

1

1.5

1

1.1

Median

**3.5**

**6**

3

2

4

3.7

3.5

1.5

CyclicLoess

**3.5**

5

5

5

**6**

**4.9**

**5**

0.9

Quantile

**3.5**

3

4

3

3

3.3

3

0.4

Iset

**3.5**

2

2

**6**

2

3.1

2

**1.7**

Qspline

3.5

4

6

4

5

4.5

4

1

Ranks of the normalization methods in Welch's

Normalization Method

Rank

Mean Rank

Median Rank

Standard Deviation

1d

3d

7d

14d

30d

Nonorm

**3.5**

1

1

1

1

1.5

1

1.1

Median

**3.5**

2

4

2

5

3.3

3.5

1.3

CyclicLoess

**3.5**

4.5

3

5

**6**

**4.4**

**4.5**

1.2

Quantile

**3.5**

4.5

5

3

3

3.8

3.5

0.9

Iset

**3.5**

**6**

2

**6**

2

3.9

3.5

**2.0**

Qspline

3.5

3

6

4

4

4.1

4

1.1

Ranks of the normalization methods in Wilcoxon tests (with permutation) in the time course dataset. For each time point, the normalization strategies were ranked based on the numbers of differentially expressed genes estimated from the normalized data using Wilcoxon rank sum tests with permutation. "Rank" and the "Mean", "Median" and "Standard Deviation" of the ranks are defined as described in Table 1. For each column in the table, the highest rank(s), the mean, median and standard deviation of the ranks are shown in bold.

Normalization Method

Rank

Mean Rank

Median Rank

Standard Deviation

1d

3d

7d

14d

30d

Nonorm

**3.5**

1

1

1

1

1.5

1

1.1

Median

**3.5**

**6**

3

2

3

3.5

3

**1.5**

CyclicLoess

**3.5**

5

5

**6**

**6**

**5.1**

**5**

1.0

Quantile

**3.5**

3

4

3

4

3.5

3.5

0.5

Iset

**3.5**

2

2

5

2

2.9

2

1.3

Qspline

**3.5**

4

**6**

4

5

4.5

4

1.0

Ranks of the normalization methods in the normalized time course dataset

**Ranks of the normalization methods in the normalized time course dataset**. The mean, median and standard deviation of the ranks of the normalization method are defined in Table 1. The bar plots are visual representation of the results shown in Tables 1–4 (the "Mean Rank", "Median Rank" and "Standard Deviation" columns). In each plot, mean ranks are shown in pink, median ranks are in blue, and standard deviations of the ranks are shown as the error bars on top of the "Mean" rank bars.

Notably, compared to the other normalization methods, Iset has the highest standard deviations of the ranks across all time points in almost all the statistical tests (3 out of the 4 tests) (Tables

In addition to the comparisons between the control group and each test group, we used the Analysis of Variance (ANOVA) to estimate the number of differentially expressed genes whose intensity values varied with the days of the treatment. CyclicLoess reveals the largest number of differentially expressed genes (68 genes) with ANOVA. Iset, Qspline and Quantile reveal slightly fewer numbers of differentially expressed genes (66, 56 and 54 genes, respectively), but they still significantly outperform Median and Nonorm (which reveal only 24 and 9 differentially expressed genes, respectively).

The comparison of the normalization methods in the time course dataset can be summarized as follows. For noise removal and signal retention, CyclicLoess demonstrates the greatest and most consistent improvement over Median; Qspline exhibits consistent yet moderate improvement over Median. Quantile performs consistently better than Median for variability reduction, yet does not do so for signal detection. Iset fails to improve over Median consistently for either noise reduction or signal retention.

Since CyclicLoess, Quantile and Qspline exhibit considerable improvement over Median in the time course dataset, we compared them in greater detail using another dataset, the IPF dataset (see Methods), which contains larger and more balanced numbers of arrays for both the control (control patients) and test groups (pulmonary fibrotic patients). To focus the comparison on these methods, we excluded Nonorm and Iset from further analyses.

Variability of the normalized IPF dataset

Since there were no technical replicates in the IPF dataset, we compared the four normalization methods (CyclicLoess, Quantile, Qspline, and Median) for noise removal using the positive control probes on the CodeLink Bioarrays. The CV of the normalized intensity values across all arrays was calculated for each positive control probe in the IPF dataset processed with the normalization methods. Our results show that CyclicLoess, Quantile and Qspline all have significantly lower mean CVs (79%, 76% and 74%, respectively) than Median (89%). CyclicLoess has a slightly higher, yet not statistically significant mean CV than Quantile and Qspline (Welch's

Signal in the normalized IPF dataset

Table

Numbers of differentially expressed genes estimated from the IPF dataset. Columns show the normalization methods used to process the IPF dataset. Rows show the statistical tests performed to detect the numbers of differentially expressed genes (adjusted

Median

CyclicLoess

Quantile

Qspline

Welch's

220

240

242

**269**

Wilcoxon test

108

**164**

127

136

Welch's

279

**331**

314

319

Wilcoxon test (perm)

227

**297**

259

271

Comparison of the numbers of differentially expressed genes estimated from the normalized IPF dataset using multiple statistical tests

**Comparison of the numbers of differentially expressed genes estimated from the normalized IPF dataset using multiple statistical tests**. The

Overall, comparative results of the four normalization methods in the IPF dataset agree with most of those from the time course dataset: CyclicLoess and Qspline exhibit significant and consistent improvement over Median in both noise reduction and signal retention; CyclicLoess reveals slightly more signal (differentially expressed genes) than Qspline. Quantile outperforms Median for both noise reduction and signal retention in all four statistical tests, which is in contrast to its performance in the time course dataset where it fails to reveal more signal than Median in some tests (

Discussion

CodeLink Bioarrays are recently introduced, single-color oligonucleotide microarrays, which differ from Affymetrix GeneChips in the following aspects

In this study, in order to determine the best normalization method(s) for CodeLink Bioarrays, we compared five existing approaches designed for high-density oligonucleotide microarrays. These methods have been applied previously to Affymetrix GeneChip data. Our goal is to provide a guideline for practitioners in the choice of a 'proper' normalization method that removes variability and retains signal effectively for CodeLink Bioarray data and thus to ensure the validity of downstream data analyses. Using our criteria, the Median normalization method (recommended by the manufacturer) is insufficient for noise removal in the two examined CodeLink Bioarray datasets, whereas CyclicLoess and Qspline show considerable and consistent improvement over Median for both variability reduction and signal retention. CyclicLoess performs slightly better than Qspline for signal retention. Quantile exhibits moderate improvement over Median for variability reduction and signal retention in the IPF dataset, yet it fails to do so for signal retention in the time course dataset. Iset fails either to remove noise or to retain signal more effectively and consistently than Median in the time course dataset.

A major difference between CyclicLoess, Qspline, Quantile, and Iset can be explained as follows. CyclicLoess and Qspline have more relaxed assumptions on microarray data than Quantile and Iset. The former methods require only that genes on the arrays are randomly distributed (

Besides the difference mentioned above, two baseline-array approaches, Qspline and Iset, also differ in the following way. Although both methods use a subset of genes to estimate intensity-dependent differences between a pair of microarrays for normalization, Qspline chooses these genes evenly over the entire range of the genes on the arrays. Iset, however, uses rank invariant genes, which are usually in small numbers (about 300 – 1000 genes in the examined datasets) and thus may be insufficient for estimating intensity effect accurately in some arrays. This may account for the unstable performance of Iset in the time course dataset. Moreover, although intuitively normalization should be more effective if a 'proper' set of 'housekeeping' genes can be selected, the effectiveness of such approach could be limited by the still unanswered question as to whether these genes exists in higher organisms

Since Qspline is a baseline-array approach, a concern could be raised that its performance may depend on the choice of the baseline array. Indeed, it has been shown that the performance of Iset varied when different individual arrays were used as the baseline array

In addition to the examined normalization methods, there are other approaches that can be applied to CodeLink Bioarray data. For example, mean cyclic loess

There are two possible limitations in this study. The first is that in the time course dataset, the sample sizes of the control vs. test groups were not well balanced (

The second possible limitation of this study is that, since no spike-in datasets were available for CodeLink Bioarrays, two real CodeLink Bioarray datasets were used instead. It would be more informative if a fixed number of known differentially expressed genes were present in the data. However, this information is often unknown for real microarray datasets. Although spike-in or dilution data has been shown to be useful for evaluation of normalization methods

Methods

Datasets

Two real datasets were used in this study.

Time course dataset

These data were collected to test the difference between a control group of rats (exposed to a treatment for 0 days) and test groups of rats exposed to a treatment for either 1, 3, 7, 14 or 30 days. The control group contained 14 arrays. The 14-day treatment group contained 6 arrays, and the other test groups contained 4 arrays each. Thus, the dataset contained 36 arrays. For the control group, there were 3 sets of technical replicates, sets TRC1, TRC2 and TRC3, which contained 5, 5 and 4 arrays, respectively. The arrays used in this dataset were CodeLink UniSet Rat I Bioarrays containing pre-validated oligonucleotide probes targeting about 10K transcripts in the rat genome.

IPF dataset

This dataset was generated to compare expression profiles of control lungs vs. lungs from patients with idiopathic pulmonary fibrosis (IPF). A total of 26 microarrays were obtained from 11 controls and 15 patients. The arrays used were CodeLink UniSet Human I Bioarrays containing pre-validated oligonucleotide probes targeting about 10K transcripts in the human genome. The arrays are available at the Gene Expression Omnibus (GEO accession number GSE 2052).

Both of the above CodeLink Bioarray platforms contain 68 bacterial control probes on each array, of which 18 are positive control probes (which can be used to monitor the quality of microarray experiments, see below) and 50 are negative control probes (which can be used to determine the low limit of signal). Each control probe is spotted 6 times on each array.

Microarray protocol

A CodeLink Bioarray experiment involves the following steps. Total RNAs are first prepared from a biological sample. Then a set of bacterial mRNAs of known concentrations (which are provided by the manufacturers and have complementary sequences to the positive control probes on the Bioarrays) are spiked in as positive controls. The mixed mRNAs are reverse transcribed into cDNAs and amplified into cRNAs, using

Since some normalization methods (**pamr/pamr.knnimpute(k = 10) **

Normalization methods

All of the examined normalization methods are available at

Global approaches

We compared three global normalization methods, Median, CyclicLoess and Quantile. For reference, we also included Nonorm, which does not perform any data transformation on raw intensity values, _{f }- _{b}, where _{f }is the fluorescent signal of the spots and _{b }is the local background intensity values of the spots. Median normalization scales the raw intensity values on an array using the median of the raw intensity value, _{n }= _{n }is the normalized intensity values of the spots. Median is recommended by the manufacturer for normalization and serves as a baseline method in this study (Note: this should not be confused with the 'baseline-array' approaches described below). The Median-normalized CodeLink Bioarray data was obtained directly from the software provided by the manufacturer.

CyclicLoess

CyclicLoess uses the MA plot and loess smoothing **affy/normalize.loess(data, epsilon = 1, log.it = F, span = 0.4, maxit = 2) ****normalize.loess **measures intensity-dependent differences in the data and thus serves as a criterion for the procedure to stop iterating. Our results show that when epsilon is smaller than 1, intensity-dependent differences in the data are negligible. The smaller value of epsilon does not produce better results in terms of variability reduction and signal retention (data not shown). It took CyclicLoess two iterations in the time course dataset and one iteration in the IPF dataset to satisfy the stopping criterion (epsilon < 1).

Quantile **affy/normalize.quantiles(data)**.

Baseline-array approaches

We compared two baseline-array methods, Iset and Qspline. These two strategies share several similarities: 1) both need to choose a baseline-array to estimate intensity-dependent differences in target arrays; 2) both use a spline smoothing technique to remove intensity-dependent differences in target arrays; 3) both estimate smoothing curves for normalization using a subset of the genes on the arrays; and 4) both are rank-based methods. However, they differ in their choices of the subset of genes for fitting normalization curves. Iset chooses these genes by selecting a set of rank-invariant genes (or so called "housekeeping genes") in the target array with respect to the baseline array

Iset was implemented in the Bioconductor package/function **affy/normalize.invariantset ("target", "ref", prd.td = c(0.003, 0.007))**, where "target" is the data matrix from the target array and "ref" is the data matrix from the baseline array. We chose the baseline array for Iset as follows: the intensity value of each gene is the median of the intensity values of the same gene across all arrays. Qspline was implemented in the Bioconductor package/function **affy/normalize.qspline (data, fit.iters = 5, min.offset = 5, spar = 0, p.min = 0, p.max = 1.0)**. We chose the baseline array for Qspline using the default option, in which each gene in the baseline array had the intensity value equal to the mean of the intensity values of the same gene across all arrays.

Detection of noise in the normalized datasets

We applied the five normalization methods individually to the time course and IPF datasets. In order to assess the effectiveness of the normalization methods in removing noise, we first used three sets of technical replicate microarrays, sets TRC1, TRC2 and TRC3, from the time course dataset. For each set of the replicates, we used pairwise MA plots to examine intensity-dependent differences in the data that are normalized individually with the normalization methods; we then calculated the coefficient of variation (CV) of the normalized intensity values for each transcript (gene) across all arrays. Specifically, if we let _{k }denote the vector of normalized intensity values for transcript _{k }= (_{k,1},..., _{k,m},..._{k,M}) where _{k }is computed as follows:

CV (_{k}) = standard deviation (_{k})/mean (_{k}) × 100%.

Finally, we exploited the redundancy in the positive control probes on the arrays and measured the CVs for these probes across all arrays in the time course dataset. Similarly as above, we let _{c }denote the vector of normalized intensity values for the positive control probe _{c }= (_{c,1,1},..._{c,p,1},..._{c,6,1},...,_{c,p,n},..._{c,6,N}), where _{c }is computed as follows:

CV (_{c}) = standard deviation (_{c})/mean (_{c}) × 100%.

Since the IPF dataset did not contain technical replicates, we measured and compared the CVs for each positive control probe across all arrays in the IPF data normalized with the normalization methods. The normalized intensity values used in this section for calculating CVs are on the scale of the raw intensity data (

Detection of signal in the normalized datasets

A negative control threshold _{NC }is used (as suggested by CodeLink Bioarrays) to monitor the low limit of signal _{NC }is defined as _{NC }= (80% trimmed mean of negative control probes) + (3 standard deviations of the 80% trimmed population of negative control probes). In order to minimize the effects of low signal probes (whose intensity values are smaller than the negative control threshold) on signal detection, we replaced intensity values of these probes with _{NC}.

Log-transformed, normalized intensity values were used in the analyses described below.

In order to compare the effectiveness of the normalization methods in enhancing signal reproducibility, we detected signal in the normalized datasets. We assume that signal quality can be estimated by the number of differentially expressed genes detected (that is, the more signal retained in the normalized data, the more differentially expressed genes should be revealed). We first developed a simulation model and verified this intuition using simulated data (see

We then used multiple statistical significance tests, both parametric (

In addition to pair-wise comparisons between the control and test groups, we used ANOVA to detect genes whose intensity values varied with the length of the treatment. Since we believe that there was a non-linear relationship between the response- (intensity values of genes) and the explanatory variables (days of the treatment), we used a quadratic regression model to fit these variables. For transcript _{j}, where _{j }is the number of the arrays for treatment group _{k,i,j }denote the intensity value of transcript _{i,j }be the day of treatment for array _{i,j }= {0,1,3,7,14,30}. The quadratic regression model can be written as:

ANOVA was used to estimate statistical significance of the model parameters _{1 }and _{2 }for all transcripts, 1 ≤ _{1 }and _{2 }are non-zero and thus minimize the risk of false positives, a stringent criterion, the adjusted

List of abbreviations

ANOVA: Analysis of Variance;

CV: coefficient of variation;

FDR: false discovery rate;

IPF: idiopathic pulmonary fibrosis;

loess: local regression estimation.

Authors' contributions

WW designed and performed computational experiments, and drafted the manuscript. ND performed microarray experiments. GCT conducted the simulation study and edited the manuscript. TR read and edited the manuscript. EPX and NK participated in experimental design and in drafting the manuscript. All authors contributed to, read and approved the final manuscript.

Acknowledgements

We would like to thank Drs. James Dauber and Kevin Gibson from the Dorothy P. and Richard P. Simmons Center for Interstitial Lung Disease, Division of Pulmonary, Allergy and Critical Care Medicine, University of Pittsburgh Medical Center, for proofreading the manuscript. NK's work is funded by NIH grants HL 073745-01 and HL079394-01. ND's work is funded by the NHLBI 1 F32 HL78164-2 grant.