Abstract
Background
Microarray data normalization is an important step for obtaining data that are reliable and usable for subsequent analysis. One of the most commonly utilized normalization techniques is the locally weighted scatterplot smoothing (LOWESS) algorithm. However, a much overlooked concern with the LOWESS normalization strategy deals with choosing the appropriate parameters. Parameters are usually chosen arbitrarily, which may reduce the efficiency of the normalization and result in nonoptimally normalized data. Thus, there is a need to explore LOWESS parameter selection in greater detail.
Results and discussion
In this work, we discuss how to choose parameters for the LOWESS method. Moreover, we present an optimization approach for obtaining the fraction of data points utilized in the local regression and analyze results for local printtip normalization. The optimization procedure determines the bandwidth parameter for the local regression by minimizing a cost function that represents the meansquared difference between the LOWESS estimates and the normalization reference level. We demonstrate the utility of the systematic parameter selection using two publicly available data sets. The first data set consists of three self versus self hybridizations, which allow for a quantitative study of the optimization method. The second data set contains a collection of DNA microarray data from a breast cancer study utilizing four breast cancer cell lines. Our results show that different parameter choices for the bandwidth window yield dramatically different calibration results in both studies.
Conclusions
Results derived from the self versus self experiment indicate that the proposed optimization approach is a plausible solution for estimating the LOWESS parameters, while results from the breast cancer experiment show that the optimization procedure is readily applicable to reallife microarray data normalization. In summary, the systematic approach to obtain critical parameters in the LOWESS technique is likely to produce data that optimally meets assumptions made in the data preprocessing step and thereby makes studies utilizing the LOWESS method unambiguous and easier to repeat.
Background
DNA microarray technology has become a standard tool in biomedical research for largescale transcriptional monitoring [1]. A growing number of microarray experiments seek to compare samples labeled with two different dyes, such as Cyanine5 (Cy5) and Cyanine3 (Cy3). However, several studies report that the dyes bind on a microarray slide differently due to the variations in their chemical characteristics [26]. In addition, the image scanner settings also affect dye intensity measurements. Should these discrepancies not be corrected, the resulting data may not be useful for analysis purposes. Thus, there is a need for dye normalization for the microarray slide prior to actual data analysis to reduce systematic variability.
Microarray data preprocessing contains three phases: quality control, withinslide normalization, and betweenslide normalization. Withinslide normalization aims to correct dye incorporation differences which affects all the genes similarly, or genes with the same intensity similarly [7]. One scatterplotbased normalization technique that is particularly suitable for balancing the intensities is called locally weighted scatterplot smoothing (LOWESS) and its original application was for smoothing scatterplots in a weighted, leastsquares fashion [8]. This technique is typically chosen to calibrate microarray data because a popular, freely available implementation is available in the statistical software package R [9] and in many commercial microarray analysis software suites such as the Agilent Feature Extraction Software. Moreover, several other freely available microarray data handling packages have incorporated this normalization technique [10,11]. It is noted that many normalization studies simply call the function without rigorous consideration for the actual algorithmic parameters [12,13]. Our analysis reports that the choices of different parameter values drastically affect the quality of the normalization results. The original work on LOWESS clearly mentions the problem of obtaining parameter values and even offers some ideas for finding suitable datadependent choices [8,14]. However, many microarray studies have omitted such rationale and made arbitrary selections for different experimental data sets [13,15,16] and some studies even failed to report their parameter assumptions in their methods [1719]. Although this practice has not lead to significant consequences for most of the parameters in LOWESS, we show that the parameter that represents the fraction f of neighboring samples to be included in the weighted polynomial fit is particularly sensitive and its variation greatly affects the normalization results. This parameter should be carefully chosen through a systematic procedure where experimental assumptions are clearly specified. Benefits in the normalization process may be considered to be small in their own right, but these improvements are extremely meaningful in the context of searching for subtle biological differences in gene expression.
We outline an optimizationbased procedure for obtaining a systematic value for f in printtip LOWESS normalization. Results are compared to common, arbitrary selections of f. The proposed procedure first examines a case study where we have utilized three quality filtered, self versus self hybridization experiments. With self versus self experiments, we are able to clearly detect normalization differences. Such analysis also verifies that the optimized method produces properly calibrated ratios. Our proposed technique is also demonstrated on a typical set of quality filtered microarray data. We utilize a set of breast cancer data that has replicated measurements for four different tumor cell lines [20]. In addition to visual comparisons, we quantitatively assess the performance of the different normalization procedures using a goodnessoffit test. Our results demonstrate that arbitrarily selecting the LOWESS bandwidth parameter produces statistically different results for certain printtips compared to the proposed optimized parameter selection formulation. Moreover, for genes that have been verified using reverse transcriptionpolymerase chain reaction (RTPCR) experiments, we show that calibrated results are substantially affected by the choice of f. Our self versus self data, including the original TIFF images, are available online [21] and the replicated breast cancer data is posted by the original authors of that study [22].
Results and discussion
Withinslide normalization
Withinslide normalization is used to correct the dye intensity errors introduced across one microarray slide. The result of this step provides the normalized, calibrated ratios. Let denote the background corrected selection for the intensity of the jth gene of the Cy3 (green) colored sample. Similarly, let denote the jth gene of the Cy5 (red) colored sample. One key issue for the dyes is that they are consistently imbalanced [12,13]. Different labelling effciency between the two fluorescent dyes exists and in some labelling schemes Cy5 is systematically less intense than Cy3. Normalization techniques are needed in order to render the gene expression levels measured by the two different dyes comparable [23,24]. Dye biases can stem from a wide variety of factors, including physical properties of the dyes, effciency of dye incorporation, and processing errors. Such errors may be introduced by slight variations in the amount of mRNA used to create the target hybridized to each microarray or in the quantity of dye used to fluorescently label each target.
For a single microarray experiment, there are n total gene expression ratios and we denote the observed vector of ratios for a single experiment as r ∈ ℝ^{n × 1}. The calibrated ratio of expression for each gene is obtained by dividing the test by the reference sample intensities with the proper normalization factor in the denominator,
for i = 1, 2..., n, where n is the total number of spots on a microarray. The normalization factor, denoted by Φ(·), is a function of datadependent variables. If the dyes are linearly dependent, it can be assumed that the normalization function is a constant, namely Φ(·) = φ. Many studies have looked at linear dependencies [25], as well as a generalized form of the normalization factor Φ(·) that is a function of an often times unknown number of experimentspecific parameters.
Many studies perform withinslide normalization in a global manner by assuming the error effects are stationary across an entire slide. This is currently true for the cases of Affymetrix GeneChip or Agilent oligonucleotide microarrays. For cDNA microarrays, however, the sources of variation typically originate in a localized or spatial manner [13], mainly from the different print tips for each subarray of the slide [26]. The process of determining the values for Φ(·) is highly dependent on the characteristics of the data for each printtip [12]. For example, some printtips have highly nonlinear effects, while other printtips in the same experiment behave quite differently and may exhibit linear trends in dye bias. Furthermore, the systematic manner in which the experiment has been conducted also influences the results of different slides, but it is our intention that such effects will be satisfactorily captured in the behavior of the printtip statistics. As a consequence, we omit global calibration considerations that neglect printtip distinction and focus solely on scatterplotbased normalization in a termed localized manner.
LOWESS method
One of the most widely used nonlinear correction techniques is the LOWESS method, which was first applied to microarray data by Yang et al. [16]. The main idea behind LOWESS is to utilize a locally weighted polynomial regression of the intensity scatterplot in order to obtain the calibration factor. Compared to other techniques, like housekeepingbased normalization or dyeswap experiments, scatterplotbased normalization is more robust in many types of scenarios where assumptions of constantly expressed genes may break down [23]. Subsequent microarray studies have also chosen this method due to the robustness of fit in the presence of a few extreme outliers. Original studies have examined the (I^{g}, I^{r})scatterplot in log_{2}space for determining the value of Φ(·). It has been suggested in separate works by Dudoit et al. [15] and Yang et al. [16] that a log_{2}based scatterplot of the average fluorescence intensity A versus the transformed ratio M should be used instead of a simple, log_{2}based intensity scatterplot. This type of scatterplot is commonly known as a BlandAltman plot in the statistics literature. The values for A and M are given as,
for i = 1, 2,..., n. Equations (2) and (3) are preferred over the original intensities because the (A, M)scatterplot may reveal artifacts that are not clearly visible in the ordinary intensity scatterplot. Such a transformation represents a scaled, 45° rotation of the (I^{g}, I^{r})coordinate system [16].
The smoothing procedure has been designed to accommodate measured scatterplot data obeying the form M_{j }= g(A_{j}) + ε_{j}, where the jth transformed ratio M_{j }is a function of the corresponding overall intensity A_{j }and a zero mean random variable ε_{j}. The smoothed point at A_{j }using LOWESS with a degree d polynomial is (A_{j}, ), where is the fitted value of the regression. The LOWESS estimate, , is a weighted linear combination of the M_{i}
where the h_{i}(A_{j}) depend on A_{i}, ∀i, but not on the M_{i}. The LOWESS algorithm contains four dataspecific parameters, namely the polynomial order d, the number of LOWESS algorithmic iterations t, the weight function w(·), and the fraction of the data points used in the local regression f. Consequently, these parameters all affect the values of the weights h_{i}(A_{j}) in Eq. (4). For a complete outline of the LOWESS algorithm, consult [8,14,27]. In practice, the polynomial order for DNA microarray data is usually selected as being either d = 0, 1, or 2, depending on the choice of (I^{g}, I^{r}) or (A, M)coordinate systems, the tricube weight function is quite standardized for all types of data [8], and the number of iterations is usually fixed at t = 3. The final parameter must be chosen where f ∈ (0, 1] and it is often times assigned an arbitrary value without any justification. However, since the choice of f ultimately determines the magnitude of calibration, it is essential to put heavy emphasis on choosing this parameter carefully. In the literature, many microarray studies neglect such concerns and arbitrarily select f for different experimental data sets [12,13,16]. Formal consideration of the parameter f is typically glossed over by simply stating that the larger the f value, the smoother the fit. Although this is a true statement, the consequences are deeper than the statement leads on. Different types of data may require smoother fits but DNA microarray data takes all shapes. Also, what defines a smoother fit is also highly subject to interpretation depending on the actual data.
The optimized approach
For a microarray experiment, there are a total of ℓ printtips used on a single slide. In order to reliably determine the value of f for each printtip group, we introduce an optimization approach based on the actual microarray data for each printtip group. We slightly modify our notation to include printtip indices as a subscript k for each transformed ratio. The goal is to select the appropriate values of f_{k }that minimizes the mean squared difference between the LOWESS fit of the ith transformed ratio in the kth printtip group, , and the corresponding normalization reference level, ψ_{k,i}(·). The value of each ψ_{k,i}(·) is a function of experimentspecific parameters such as temperature or other environment settings which may differ from sample to sample in a single experiment. Accordingly, the cost function to be minimized for the kth printtip group across all transformed ratios is
with the constraint that f_{k }∈ (0, 1]. Here, the value n_{k }is the total number of ratios for the kth printtip group. Correspondingly, for a total of ℓ printtip groups, we have . For certain experiments, like self versus self hybridizations, the true expression value is known a priori. If ψ_{k,i}(·) is unknown, reliable estimates that reflect experimentspecific assumptions may be used. Usually there are tens of thousands of genes in a microarray study and a plausible assumption is that the mean of the log_{2}transformed ratios after normalization is zero. Also, in a variety of experiments, platformdependent control transcripts that are known to have certain expression at a constant level may be utilized in the optimized approach. Furthermore, in our breast cancer case study we show how to obtain statistically reliable estimates of ψ_{k,i}(·) from replicate slides. We also show how our approach may be used if replicates are not available for typical microarray studies. Ultimately, the optimized approach requires experimenters to explicitly state their assumptions behind the study, which is systematically better than arbitrarily choosing parameter values. In addition, determining an experimentspecific f_{k }by trial and error may be time consuming and will oftentimes lead to nonoptimal results. The chosen optimization algorithm for minimizing the corresponding cost function is based on a combination of goldensection search and successive parabolic interpolation as outlined by Forsythe et al. [28]. This approach finds the best f_{k }for minimizing δ_{k}(f_{k}) for each printtip, k = 1,..., ℓ within a tolerance of ±0.01. Each printtip, resultingly, may have a different, optimal bandwidth parameter.
Normalization step
After the estimates have been obtained, calibrating the intensities for all the A_{k,i }is given as
for i = 1,..., n_{k}, and k = 1,..., ℓ. For the local LOWESS normalization within each printtip group, the issue of how the total intensities are spread about the sample mean for the group becomes a factor to consider when normalizing the data [16]. After normalization, all the log_{2}ratios from the different printtip groups are usually centered around zero. Some printtips may have larger variances compared to others and an appropriate scale adjustment is needed to account for such differences. One proposed approach is to find the maximum likelihood estimate for the scale of the variance for each printtip group [16]. This method assumes that all log_{2}ratios from the kth printtip group follow a normal distribution with mean zero and variance σ^{2}, where σ^{2 }is the variance of the true log_{2}ratios and is the estimated scale factor for the kth printtip group. However, this is only valid for certain types of data that reasonably follow a normal distribution and in our work we observe that this assumption may often times lead to undesirable results. Refer to [16] for further details.
Another approach proposed here that is able to deal with the variance scaling issue is to introduce a weighting factor in the calibration function that is of the form
for i = 1,..., n_{k}, k = 1,..., ℓ, and where the weight is given as . The biascorrected sample variance for the kth printtip is denoted by and is given as
where denotes the sample mean for printtip k. Furthermore, the minimum sample variance is given as
Compared to the maximum likelihood method outlined by [16], this method stresses higher weighting on printtip groups that exhibit less variance and lower weighting for highly variant printtips. If such a weight is not introduced, the normalization may improperly calibrate highly variant printtip groups that have extreme sample means and many genes may erroneously be considered as differentially expressed as a consequence. Other treatments, such as the one suggested by Quackenbush [12] examine the geometric mean of the tip variances as a scale factor for the normalization estimate. However, such a treatment may not always scale the tips properly since some tips may still be overly compensated. Our proposed scaling factor λ_{k }takes values over (0, 1] while other scaling methods may have larger upper limits. By calibrating data using Eq. (9), we have obtained nearly identical sample means, but less total variance for the resulting data compared to previously published techniques. The computation of λ_{k }is straightforward and easy to calculate but our novel variance stabilization procedure does not take into account any heteroscedasticity in the data, namely observed increasing ratio variance with decreasing measurement intensity A. A rigorous comparison of printtip scaling is beyond the scope of this contribution, but it is noted that the different scaling procedures affect the overall calibration scheme.
Case studies
To demonstrate the utility of our optimized LOWESS normalization procedure, we first utilized a set of three self versus self experiments [21], BT474, MCF7, and HBL100, which were obtained using the protocols delineated in the methods section. In addition, we calibrated a set of four breast cancer cell lines [22], BT474, MCF7, MDAMB436, and MDAMB361, each measured in comparison to the reference cell line HBL100, which were obtained using the protocols outlined by Järvinen et al. [20]. For each cancer cell line, three replicate slide hybridizations were available. In order to reduce the effects of spots whose intensities are not reliable due to experimental or printing errors, we used two separate quality filtering methods and normalized the intensities after discarding values that were detected unreliable. The assessment of ratio quality was performed using the method proposed by Chen et al. [29] and the evaluation of spot quality was performed using the method of Hautaniemi et al. [30]. Optimized parameter selection for f_{k }was performed and printtip LOWESS normalization results are compared to the results using arbitrary choices of the parameter f_{k}. The implementation took a few minutes to run on a standard desktop PC running MATLAB.
Self versus self experiments
Self versus self experiments provide a trivial application to test our method since the amount of mRNA in both the test and the reference samples is the same. Thus, the points of an intensity scatterplot in the log_{2 } log_{2 }space should be distributed along a straight line that intersects zero with a slope of unity. In the (A, M)coordinate system, all values of M should lie on a straight line at M = 0 for all values of A; this means that the calibrated ratios should ideally be unity for all variables. Correspondingly, the cost measure is given when ψ_{k,i}(·) = 0, (∀k, i), in Eq. (5) for the (A, M)coordinate systems. Separate trials were conducted using weighted, zerothorder (d = 0), firstorder (d = 1), and quadratic (d = 2) polynomial fits. For all trials, the number of printtip LOWESS iterations was fixed at t = 3. The weight function used is given by Cleveland [8]. For each experiment, the local printtip groups were separately normalized with their respective, optimized values of f_{k}. As a comparison to arbitrary selections of f_{k}, the printtip normalization was also carried out using f_{k }= 0.2, 0.4, 0.6, and 0.8 in separate trials. Figure 1 shows the (M^{(Arb)}, M^{(Opt)})scatterplot comparison between the calibration results with d = 1 using optimal f_{k }and arbitrary f_{k }for the BT474 self versus self experiment. The points that deviate from the blue line are the genes that are most affected by the choice of f_{k}. The M^{(Arb) }data in this figure was calibrated using f_{k }= 0.4, ∀k.
Figure 1. (M^{(Arb)}, M^{(Opt)})Scatterplot analysis of BT474 self versus self data This plot compares the calibrated ratios obtained by LOWESS (d = 1) with the arbitrary choice of f_{k }= 0.4 for all printtips compared to ratios obtained with optimized f_{k }for each printtip group. The line of unity slope that passes through the origin shows where all the points should lay if both calibration methods produced identical ratios. For this self versus self experiment, the group of points that lay under this line shows that the arbitrary f_{k }may improperly undernormalize these points.
In all three self versus self experiments, the global sample means of M were nearly the same after calibration, regardless of the choice of f_{k}. However, the calibrations that used optimized selections of f_{k }for each printtip resulted in data that contained less overall variance compared to the arbitrary selections. The ultimate goal of calibration is to adjust the dynamic range for the transformed ratios and reduce the variability within the data. By using optimized selection of f_{k}, we outperform all arbitrary formulations to achieve these goals.
Typical microarray experiments
One immediate concern for typical experimental microarray data is that many genes may be over or underexpressed and the true, transformed gene expression ratio ψ_{k,i}(·) surely will not be equal to zero for all genes. Accordingly, implementing the cost function in Eq. (5) becomes an immediate challenge since the normalization reference level of all the genes for a typical microarray experiment may be diffcult to determine with complete accuracy. We note that our cost function still may be used with the assumption that the sample mean for each print tip before log_{2}transformation is unity. In most microarray experiments, many genes may be assumed to have constant RNA concentrations while smaller numbers of genes may be over or under expressed, namely their sample mean over all the genes is zero, . Using this assumption in Eq. (5), our experiments show that by minimizing the cost function in this context, like in the self versus self case study, we are able to systematically choose f_{k }and the only consequence is that the minimum of the cost will not be as low as in the self versus self scenario where all genes should be constantly expressed. The main benefit of utilizing LOWESS for microarray normalization is that it is robust to extreme outliers and the cost function implemented in this fashion further restricts the effects of such extreme points in the regression. Ultimately, this implementation results in reliably calibrated ratios compared to the arbitrary formulation where different choices of f_{k }affect the resulting data.
Since a single microarray experiment represents an observation, multiple observations would be needed to compute a reliable estimate of the true transformed ratio values. The use of only a small number of replicate slides may be satisfactorily used to determine reliable estimates of true gene expression and one study showed that three replicates suffce for significantly reducing experimental variability [31]. With the growing number of publicly available microarray data, conducting replicate experiments is becoming a popular solution to assess experimental errors and reduce noise bias in the measurements [32]. The advantages of replicate slides also greatly help the analysis of betweenslide variability and help address formal statistical considerations when drawing biological conclusions. Here, we show that the optimized normalization approach may be directly extended in an iterative manner to use the estimates of the true ratio values for further specifying f_{k}. After an initial round of optimized LOWESS normalization for each replicate slide with ψ_{k,i}(·) = 0 in Eq. (5), the sample mean for each gene may then be calculated using the replicates. The normalization reference levels ψ_{k,i}(·) were reassigned these average gene expression values in Eq. (5). Each experiment was then separately calibrated a second and final time using the optimization approach and the final results were noticeably different compared to the normalized data using f = 0.2 that Järvinen et al. posted on their website [20]. A noteworthy consideration to address here is the overall effect of an iterative calibration process on the underlying structure of the data. Experimentally, once the optimized LOWESS regression is computed using the average value for each gene and normalization is performed, subsequent calibration attempts using the cost functionbased method do not result in drastically different data. The subsequent regressions are nearly constant lines near M = 0 in the (A, M)scatterplot if the cost function approach is used. Consequently, the calibrated data reach a stable domain with a small dynamic range. Empirically, we found that performing optimized normalization in an iterative manner will not propagate regression effects through to disrupt the underlying structure of the data.
Figure 2 shows the scatterplot comparison between the calibration results using optimal and arbitrary selections of f_{k }for the first replicate BT474 hybridization. Some genes in this plot report 4fold differences and ultimately these differences affect data analysis. Consequently, the errant choice of this parameter f_{k }may have deleterious effects on different biological studies. To illustrate the differences for one representative printtip in this breast cancer study for the first replicate of the BT474 cell line, Figure 3 plots the regressions obtained by both methods. All the data points for this hybridization are shown as a twodimensional histogram [33], while the spots given by printtip k = 16 are highlighted in black. In this plot, we show that the regression obtained by the optimized choice of f_{16 }differs from the one obtained by arbitrarily selecting f_{16 }= 0.2 and the calibration results are thus affected. Figure 4 reports arbitrary calibration results and Figure 5 shows optimized results. The data in Figure 5 has less overall variance when calibrated with the optimized choices of f_{k}.
Figure 2. (M^{(Arb)}, M^{(Opt)})Scatterplot analysis of BT474_01 data This plot compares the calibrated ratios obtained by LOWESS (d = 1) with arbitrary (f_{k }= 0.2) and optimized bandwidth windows for the first replicate hybridization of the BT474 breast cancer cell line. Again, the line of unity slope shows where all the points should lay if both calibration methods produced identically calibrated ratios. Many points deviate from the similarity line in this example and such results are commonly observed for the microarray data used in this study. Consequently, it is clear that the choice of f_{k }greatly affects how the data is calibrated. Points that are furthest away from the similarity line are highly influenced by the choice of f_{k }in LOWESS calibration.
Figure 3. Printtip LOWESS comparisons for BT474_01 data This (A, M)scatterplot shows a twodimensional histogram [33] or all the spots for the first replicate BT474 breast cancer hybridization, where the bright red color indicates a high concentration of spots. Printtip k = 16 is highlighted by black dots. The LOWESS estimates obtained by using f_{16 }= 0.2 are shown by the dark blue line and the estimates using optimal f_{16 }is shown here in light blue. This result is typical for the printtips in this study based on minimizing the cost function given in Eq. (5).
Figure 4. Arbitrary calibration results for BT474_01 data All spots are shown using a twodimensional scatterplot with the spots from printtip k = 16 are highlighted here in black. LOWESS calibration has been performed using the choice of f_{k }= 0.2 for all printtips.
Figure 5. Optimized calibration results for BT474_01 data This scatterplot shows LOWESS calibration after optimized choices of f_{k }have been obtained for all printtips. Compared to the results in Figure 4, the normalized data here has less overall variance. In addition, genes that have been verified experimentally conform in better agreement with the wellknown biology.
As further illustration of the calibration differences between the optimized and arbitrary calibration results, we employ a goodnessoffit test [34]. We wish to make a direct test of the data, independent of any underlying parent distribution of the ratios, and we use the following statistic for the kth printtip group
where M^{(Arb) }and M^{(Opt) }are the arbitrary and optimized calibration results, and the denominator within the summation is simply the variance of the difference between M^{(Arb) }and M^{(Opt)}. The null hypothesis is defined to be H_{0}: the normalized ratios using arbitrary f are comparable to ones using optimized f. We tested against p < 0.05 for the distribution and reported the alternative hypothesis for a few printtip groups on almost all the slides. In this analysis, we compared optimized choices of f for each printtip to the arbitrary choices f = 0.2, 0.4, 0.6, and 0.8. By looking across each replicate of the calibrated data for all four breast cancer cell lines, almost all slides in this study reported at least one printtip to have statistically different calibration results based on the choice of f_{k}. Often times a single slide would report two or three printtip groups that had statistically different calibration results.
In addition to statistical analysis, genes that exhibit known overexpression in the BT474 cell line data [35] were selected here for more detailed analysis. In particular, genes that were verified experimentally using reverse transcriptionpolymerase chain reaction (RTPCR) were of the highest interest. Comparing our optimized calibration results utilizing the replicate data to the normalized data by Järvinen et al. [20], our results conform strongly with most of the overexpressed genes given in a list from a parallel study [35]. Two genes in particular stand out to demonstrate the benefits of utilizing our proposed method: homeo box B7, which was validated with RTPCR [35], and verbb2, which is known to be overexpressed in the BT474 cell line [35]. The results posted by Järvinen et al. [20] for calibrating the homeo box B7 gene shows that it falls within the top 18% of overall gene expression, but by using the optimized approach we report it to be within the top 13%. For the verbb2 gene, both calibration techniques report that this gene falls within the top 1% of the genes in terms of expression. As a result, for the homeo box B7 gene, the calibration factor f_{k }is responsible for about 5% change in the reported gene expression. This is a dramatic result that may influence how the expression for this gene may be interpreted in comparison to the accepted biological knowledge of a certain experiment. As public data from microarray experiments continues to become available, the knowledge of certain genes will undoubtedly be uncovered for wellstudied cell lines and this information will help further assess normalization and microarray quality control tasks.
Conclusions
The LOWESS method has recently been applied in other applications for the biological sciences. Comparative genomic hybridization (CGH) is a molecular cytogenetic method of screening a tumor for genetic changes. The alterations are classified as DNA gains and losses and they reveal a characteristic pattern that includes mutations at chromosomal and subchromosomal levels. Our proposed optimized scheme is directly applicable to the application of calibrating CGH microarray experiments, as well as for data analysis aspects. For example, the work of Clark et al. [36] utilized the LOWESS method for identifying the regions where gene copy numbers were aberrantly high or low in prostate cancer using CGH microarray technology. The parameter f was chosen arbitrarily and its value was not reported in the study. Consequently, reproduction and verification of these results may be diffcult. For instance, some of the important biological findings, such as start and end points of amplifications and deletions, may be adversely affected by different choices of f.
In addition to CGH analysis, LOWESS has found application in casecontrol studies where logistic regression has been used to model the relationship between binary responses and continuous predictor variables [37]. In these types of studies one may use LOWESS to remove systematic trends that contaminate the laboratory measurements of predictor variables. The analysis reported by Borkowf et al. [37] clearly shows that different choices of f result in noticeably different correction effects and the optimization method proposed here may be suitable for enhancing such a study. Adaptations to the cost function may be utilized to handle this type of data. In addition, analysis of other types of scatterplot data by utilizing the LOWESS method with an arbitrary choice for the bandwidth parameter is undoubtedly susceptible to varied interpretations or errant conclusions [38,39].
Another result of this optimized calibration study is that we uncovered a better understanding of choosing the parameter d in the weighted polynomial fit. A higherorder (d > 2), weighted polynomial is rarely needed based on the argument that such an assumption is, to a certain extent, overfitting the data. From the findings of our study, we find that it is better to use a linear estimate based on minimizing the estimate errors across (A, M)scatterplots. Consequently, different choices of d resulted in different optimized values for f. The reason is that for the higherorder polynomial, it is beneficial in general to retain a larger fraction of the values of A for the weight function in computing the polynomial coeffcients. It is very important to carefully select f since ultimately, the bandwidth is a function of the polynomial order.
Here, we also reaffirmed the idea that the quality filtering of ratios and spots is a necessary step that should precede all experimental microarray data handling procedures, whether it is scatterplotbased normalization or any other normalization method, since errant ratios would surely have a deleterious affect on the calibration. For instance, in the BT474 data, the first replicate slide had poor ratio quality for a handful of genes. Calibration without considering or removing these errant spots resulted in less reliable results. This study addresses the issue of locating sources of experimental error for printtips that have high sensitivity for the parameter f . For one, printtips are physically different and they are considered to have different types of errors introduced based on these properties. In the formulation of normalization, it is imperative to address such subtle issues when choosing and implementing any algorithm.
The systematic choice of the parameters in the LOWESS algorithm has not been previously addressed in the microarray literature and the method proposed here may be utilized in different microarray platforms. Such a treatment is also important for a wide variety of applications that employ scatterplotbased regression. The findings of this study illustrate that by choosing different values of f for the LOWESS algorithm results in noticeably different normalization results. This proposed method requires the calibration step to clearly state the assumptions used for withinslide normalization. Our optimization algorithm is more systematic than simply choosing an arbitrary parameter value or through trial and error techniques since the optimized approach relies on the actual underlying structure of the data. We also stress that such an optimization algorithm may also be utilized for other studies in addition to DNA microarray normalization treatments. Proper changes need to be made to Eq. (5) to reflect the ideal model for the data captured in the function ψ_{k,i}(·), but in some studies, such a function may be satisfactorily determined or estimated from the data.
Methods
Data resources
For the self versus self hybridizations, custom cDNA microarray experiments proceed as follows. Altogether, three microarray hybridizations were performed using custom printed cDNA microarray slides from the same print batch. Labelling, hybridization and washing were done as described previously by Monni et al. [40] and Järvinen et al. [20]. Briefly, total RNA was extracted from cell lines BT474, HBL100, and MCF7 and labelled with Cy3dUTP and Cy5dUTP (Amersham Biosciences, Piscataway, NJ). The custom printed cDNA microarrays comprised of 11,520 clones from Incyte Genomics IRAL cDNA library and 1,136 clones from Research Genetics library. Microarrays were printed on polyllysine coated slides using an Omnigrid arrayer (GeneMachines) as described previously [20]. Microarrays were scanned with an Agilent laser confocal scanner (Agilent Technologies, Palo Alto, CA) and gridded using the DEARRAY software developed by Chen et al. [29]. For the four breast cancer cell lines, custom cDNA microarray experiments were provided in a separate contribution by Järvinen et al. [20] and detailed protocols are described in that work. The relevant genes in our study were verified using RTPCR in a parallel study by Hyman et al. [35].
Data quality filtering
All microarray experiments contained in this work were conducted and spotted using groups of ℓ = 32 printtips, with each tip being responsible for either 384 or 420 spots in their respective subarray. In order to reduce the effects of spots whose intensities are not reliable due to experimental or printing errors, we used two separate quality filtering methods and normalized the intensities after discarding values that were detected unreliable. The assessment of ratio quality was performed using the method proposed by Chen et al. [29] and ratios that had a quality value below the threshold 0.5 were discarded from our analysis. This quality cutoff value has, in the past, been shown to represent less reliable cDNA microarray measurements due to either low signal intensity, high local background level, uneven distribution of the target intensity, and/or small target size. The evaluation of spot quality was performed using the method of Hautaniemi et al. [30]. In this Bayesian networksbased method, we utilized the following features in determining spot quality. Bleeding, spot roundness, and spot intensity were assessed for the Cy5 channel and bleeding, spot size, spot roundness, background intensity, and fitting error were evaluated for the Cy3 channel. These features were chosen since this set was found to result in the best classification accuracies [30]. The trained Bayesian network was applied to each slide in this study and all the spots having a quality value of zero were excluded from the subsequent analysis.
Authors' contributions
JAB developed the mathematical formulation of the problem, implemented the optimized normalization algorithm in MATLAB, developed the statistical analysis, and wrote the manuscript. SH developed the LOWESS normalization software in MATLAB, coordinated spot quality filtering, and assisted in drafting the manuscript. AKJ conducted the self versus self microarray experiments and performed ratio quality filtering for data analysis. HE assisted in data preparation and in drafting the manuscript. SKM participated in the design and coordination of the study and assisted in drafting the manuscript. JA reviewed the statistical analysis and participated in the design and coordination of the study. All authors read and approved the final manuscript.
Acknowledgements
The authors thank the anonymous reviewers for their comments and contributions. This work was supported in part by a University of California MICRO grant with matching support from Philips Research Laboratories and in part by Microsoft Corporation and the Academy of Finland.
References

Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray.
Science 1995, 270:467470. PubMed Abstract

Goryachev AB, MacGregor PF, Edwards AM: Unfolding of microarray data.
Journal of Computational Biology 2001, 8:443461. PubMed Abstract  Publisher Full Text

Ideker T, Thorsson V, Siegel AF, Hood LE: Testing for differentiallyexpressed genes by maximumlikelihood analysis of microarray data.
Journal of Computational Biology 2000, 7:805817. PubMed Abstract  Publisher Full Text

Kerr MK, Martin M, Churchill GA: Analysis of variance for gene expression microarray data.
Journal of Computational Biology 2000, 7:819837. PubMed Abstract  Publisher Full Text

Tseng GC, Oh MK, Rohlin L, Liao JC, Wong WH: Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects.
Nucleic Acids Research 2001, 29:25492557. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wang X, Ghosh S, Guo SW: Quantitative quality control in microarray image processing and data acquisition.
Nucleic Acids Research 2001, 19:e75. Publisher Full Text

Dobbin K, Shih JH, Simon R: Statisical design of reverse dye microarrays.
Bioinformatics 2003, 19:803810. PubMed Abstract  Publisher Full Text

Cleveland WS: Robust locally weighted regression and smoothing scatterplots.
Journal of the American Statistical Association 1979, 74:829836.

Ihaka R, Gentleman R: R: a language for data analysis and graphics.
Journal of Computational and Graphical Statistics 1996, 5:299314.

Engelen K, Coessens B, Marchal K, Moor BD: MARAN: normalizing microarray data.
Bioinformatics 2003, 19:893894. PubMed Abstract  Publisher Full Text

Venet D: MatArray: a Matlab toolbox for microarray data.
Bioinformatics 2003, 19:659660. PubMed Abstract  Publisher Full Text

Quackenbush J: Microarray data normalization and transformation.
Nature Genetics 2002, Suppl 32:496501. Publisher Full Text

Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP: Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation.
Nucleic Acids Research 2002, 30:e15. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cleveland WS, Devlin SJ: Locally weighted regression: an approach to regression analysis by local fitting.
Journal of the American Statistical Association 1988, 83:596610.

Dudoit S, Yang YH, Callow MJ, Speed TP: Statistical methods for identifying genes with differential expression in replicated cDNA microarray experiments.

Yang YH, Dudoit S, Luu P, Speed TP: Normalization for cDNA microarray data. In In Microarrays: optical technologies and informatics. Volume 4266. Edited by Bittner M, Chen Y, Dorsel A, Dougherty ER. San Jose, CA, USA: SPIE; 2001::141152.

Bolstad BM, Irizarry RA, Åstrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias.
Bioinformatics 2003, 19:185193. PubMed Abstract  Publisher Full Text

Edwards D: Nonlinear normalization and background correction in onechannel cDNA microarray studies.
Bioinformatics 2003, 19:825833. PubMed Abstract  Publisher Full Text

Wilson DL, Buckley MJ, Helliwell CA, Wilson IW: New normalization methods for cDNA microarray data.
Bioinformatics 2003, 19:13251332. PubMed Abstract  Publisher Full Text

Järvinen AK, Hautaniemi S, Edgren H, Auvinen P, Saarela J, Kallioniemi OP, Monni O: Are data from different gene expression microarray platforms comparable?
Genomics 2004, 84:11641168. Publisher Full Text

Supplementary Webpage (Self vs. Self data) [http://www.ece.ucsb.edu/pubs/bmc/] webcite

Supplementary Webpage (Breast Cancer data) [http://sigwww.cs.tut.fi/TICSP/Jarvinen_et_al_2003/] webcite

Dobbin K, Shih JH, Simon R: Questions and answers on design of duallabel microarrays for identifying differentially expressed genes.
J Nat Cancer Inst 2003, 95:13621369. PubMed Abstract  Publisher Full Text

Duggan DJ, Bittner M, Chen Y, Meltzer P, Trent JM: Expression profiling using cDNA microarrays.
Nature Genetics 1999, 21:1014. PubMed Abstract  Publisher Full Text

Finkelstein D, Ewing R, Gollub J, Sterky F, Cherry JM, Somerville S: Microarray data quality analysis: lessons from the AFGC project.
Plant Molecular Biology 2002, 48:119131. PubMed Abstract  Publisher Full Text

Holloway AJ, van Laar RK, Tothill RW, Bowtell DDL: Options available – from start to finish – for obtaining data from DNA microarrays II.
Nature Genetics 2002, 32:481489. PubMed Abstract  Publisher Full Text

Fan J, Gijbels I: Local Polynomial Modelling and its Applications. London: Chapman and Hall; 1996.

Forsythe GE, Malcolm MA, Moler CB: Computer Methods for Mathematical Computations. Englewood Cliffs, NJ: PrenticeHall, Inc; 1977.

Chen Y, Kamat V, Dougherty ER, Bittner ML, Meltzer PS, Trent JM: Ratio statistics of gene expression levels and applications to microarray data analysis.
Bioinformatics 2002, 18:12071215. PubMed Abstract  Publisher Full Text

Hautaniemi S, Edgren H, Vesanen P, Wolf M, Järvinen AK, YliHarja O, Astola J, Kallioniemi O, Monni O: A Novel Strategy for Microarray Quality Control Using Bayesian Networks.
Bioinformatics 2003, 19:20312038. PubMed Abstract  Publisher Full Text

Lee M, Kuo F, Whitmore G, Sklar J: Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations.
Proc Natl Acad Sci USA 2000, 97:98349839. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yang YH, Speed TP: Design issues for cDNA microarray experiments.
Nature Reviews Genetics 2002, 3:579588. PubMed Abstract  Publisher Full Text

Eilers PHC, Goeman JJ: Enhancing scatterplots with smoothed densities.
Bioinformatics 2004, 20:623628. PubMed Abstract  Publisher Full Text

Bevington PR, Robinson DK: Data Reduction and Error Analysis for the Physical Sciences. 2nd edition. Boston, MA: WCB/McGrawHill; 1992.

Hyman E, Kauraniemi P, Hautaniemi S, Wolf M, Mousses S, Rozenblum E, Ringnér M, Sauter G, Monni O, Elkahloun A, Kallioniemi OP, Kallioniemi A: Impact of DNA amplification on gene expression patterns in breast cancer.
Cancer Research 2002, 62:62406245. PubMed Abstract  Publisher Full Text

Clark J, Edwards S, Feber A, Flohr P, John M, Giddings I, Crossland S, Stratton MR, Wooster R, Campbell C, Cooper CS: Genomewide screening for complete genetic loss in prostate cancer by comparative hybridization onto cDNA microarrays.
Oncogene 2003, 22:12471252. PubMed Abstract  Publisher Full Text

Borkowf CB, Albert PS, Abnet CC: Using lowess to remove systematic trends over time in predictor variables prior to logistic regression with quantile categories.
Statistics in Medicine 2003, 22:14771493. PubMed Abstract  Publisher Full Text

Mazerolle M: Detrimental effects of peat mining on amphibian abundance and species richness in bogs.
Biological Conservation 2003, 113:215223. Publisher Full Text

Hen I, Sakov A, Kafkafi N, Golani I, Benjamini Y: The dynamics of spatial behavior: how can robust smoothing techniques help?
Journal of Neuroscience Methods 2004, 133:161172. PubMed Abstract  Publisher Full Text

Monni O, Bärlund M, Mousses S, Kononen J, Sauter G, Heiskanen M, Paavola P, Avela K, Chen Y, Bittner M, Kallioniemi A: Comprehensive copy number and gene expression profiling of the 17q23 amplicon in human breast cancer.
Proc Natl Acad Sci 2001, 98:57115716. PubMed Abstract  Publisher Full Text  PubMed Central Full Text