Normalization is a basic step in microarray data analysis. A proper normalization procedure ensures that the intensity ratios provide meaningful measures of relative expression values.
We propose a robust semiparametric method in a two-way semi-linear model (TW-SLM) for normalization of cDNA microarray data. This method does not make the usual assumptions underlying some of the existing methods. For example, it does not assume that: (i) the percentage of differentially expressed genes is small; or (ii) the numbers of up- and down-regulated genes are about the same, as required in the LOWESS normalization method. We conduct simulation studies to evaluate the proposed method and use a real data set from a specially designed microarray experiment to compare the performance of the proposed method with that of the LOWESS normalization approach.
The simulation results show that the proposed method performs better than the LOWESS normalization method in terms of mean square errors for estimated gene effects. The results of analysis of the real data set also show that the proposed method yields more consistent results between the direct and the indirect comparisons and also can detect more differentially expressed genes than the LOWESS method.
Our simulation studies and the real data example indicate that the proposed robust TW-SLM method works at least as well as the LOWESS method and works better when the underlying assumptions for the LOWESS method are not satisfied. Therefore, it is a powerful alternative to the existing normalization methods.
Microarray technology has become a useful tool for quantitatively monitoring gene expression patterns and has been widely used in functional genomics [1,2]. In a cDNA microarray experiment, cDNA segments representing a collection of transcripts and Expressed Sequence Tags (ESTs) are amplified by PCR and spotted in high density on glass microscope slides using a robotic system to produce cDNA microarrays. Each microarray contains thousands of such PCR products, named cDNA probes, which serve as reporters for the expression of the respective transcripts that represent the collection of genes or ESTs. The cDNA microarrays are queried in a co-hybridization assay using two fluorescently labeled biosamples derived from RNA obtained from the cell populations of interest. One sample is labeled with fluorescent dye Cy5 (red), and another with fluorescent dye Cy3 (green). Hybridization is assayed using a confocal laser scanner to measure fluorescence intensities, allowing simultaneous determination of the relative expression levels of all the genes represented on the slide .
A basic question in analyzing cDNA microarray data is normalization, the purpose of which is to remove systematic bias in the observed expression values by establishing a normalization curve across the whole dynamic range. A proper normalization method ensures that the normalized intensity ratios provide meaningful measures of relative expression levels. Normalization is needed because many factors, including different efficiency of dye incorporation, difference in the amount of RNA labeled between the two channels, uneven hybridizations, difference in the printing pin heads, among others, may cause bias in the observed expression values. Therefore, proper normalization is a critical component in the analysis of microarray data and can have important impact on higher level analysis such as detection of differentially expression genes, classification, and cluster analysis.
Many normalization methods have been proposed in the literature. The earliest normalization method for cDNA microarray data goes back to Chen et al.  who proposed a ratio-based method. Yang et al.  summarized several normalization methods for cDNA microarray data such as global normalization, dye-swap normalization, block-wise normalization, and scale normalization. They also proposed a locally weighted scatter plot smoothing (LOWESS ) method for intensity dependent normalization. Quackenbush  and Bilban et al.  provided good reviews on normalization methods for cDNA microarray data. Tseng et al.  proposed using a rank based procedure to first select a set of invariant genes that are likely to be constantly expressed and then carrying out LOWESS normalization using this set of genes. But as pointed out by Tseng et al., selected invariant genes may not cover the whole dynamic range of the expression values, and extrapolation is needed to fill in the gaps that are not covered by the invariant genes. Kepler et al.  also first estimated a set of "constantly expressed genes" and then used the LOWESS method. Wang et al.  proposed an iterative normalization method for cDNA microarray data by estimating a normalization coefficient and identifying control genes. Workman et al.  used array signal distribution analysis for a robust non-linear method of normalization. Park et al.  compared a number of normalization methods, including global, linear and LOWESS normalization methods. Wolfinger et al.  used a mixed model for normalization. They proposed a normalization model for normalization and a gene model for inference and these two models are related by the residual terms in the normalization model. A constant normalization factor assumption is needed in this method. Fan et al.  considered a Semi-linear-In-slide Model (SLIM) method using within-array replications. The SLIM method requires replication of a subset of the genes in an array. If the number of replicated genes is small, the expression values of the replicated genes may not cover the entire dynamic range or reflect spatial variation in an array. Fan et al.  generalized the SLIM method to account for across-array information, resulting in an aggregated SLIM, so that replication within an array is no longer required. Huang et al.  proposed a two-way semi-linear model (TW-SLM) for normalization of cDNA microarray data. They used the least squares method for estimating the normalization curves based on B-splines. This method does not require the assumptions required by the LOWESS normalization method, i.e. (i) a small fraction of genes are differentially expressed or (ii) there is symmetry in the expression levels of up- and down-regulated genes.
It is well known that the least squares method is not resistant to outliers which arise often in cDNA microarray experiments because of many sources of variations. In this paper, we propose a robust method for normalization in the framework of the TW-SLM. We conduct simulation studies and use a real cDNA microarray data set to compare the proposed method with the LOWESS normalization method.
Simulation was conducted to compare the mean square errors (MSE) and biases of estimated gene expression levels between the proposed robust TW-SLM and LOWESS normalization methods, between the proposed method and the TW-SLM using OLS. The MSE for the jth gene is calculated as the following:
that is, , where N is the total number of replicates for each simulation, J is the number of unique genes, βj is the true gene expression level (base two log scale) for gene j, is the estimated value for βj, is the mean of for N replicates, j = 1, 2,..., J, where J is the total number of genes. The data simulation procedure is based on the method proposed by Balagurunathan et al. . In each simulation, we generated 10 slides with twelve blocks in each, and 500 genes in each block. We repeated 100 times for each simulation. The simulation procedure can be summarized in the following steps:
1. Simulate true signal intensity for each gene j using the exponential distribution with the mean of 3,000, i.e. Ij ~ exp(λ = 1/3000), for j = 1,..., J;
2. Simulate fluorescent intensity for the Cy5 channel and the Cy3 channel with the normal distribution, respectively. Suppose the coefficients of variation for intensity in the Cy5 channel and the Cy3 channel are αrj and αgj, respectively, then the fluorescent intensity on the two channels can be generated by the normal distribution with mean Ij and standard deviations αrjIj and αgjIj for the red channel and the green channel, respectively. Let Rj and Gj represent simulated fluorescent intensity for the Cy5 channel and the Cy3 channel for gene j, respectively;
3. Simulate differentially expressed genes. Suppose γ × 100% genes are differentially expressed in the whole simulated gene set, then the ratio of the expression level for gene j can be generated by tj = 10±b with b ~ Beta(1.7,4.8). The sign ± will determine if the gene is up- or down-regulated. The probability of the up-regulated genes within those γ × 100% differentially expressed genes is given as an input parameter. For the genes that are not differentially expressed, the b takes value zero;
4. Incorporate the tj into signal intensity of gene j. The Rj and Gj will be adjusted by adding the simulated expression ratio tj through the following formulae: for j = 1,..., J;
5. Simulate a fluorescent system with the imperfect response characteristics. Due to various reasons, such as the unequal amount of mRNA for the two channels, different labeling efficiencies, or uneven laser powers at the scanning stage , actual intensity in the two channels are not exactly the same. More over, fluorescent intensity is not necessarily linearly related to the expression levels. Balagurunathan et al. proposed the following functional family,
to distort the response characteristic functions of observed fluorescent intensity for the two channels, which are expressed as and , respectively. So four parameter values need to be determined for each channel before simulation. Different parameter values in the two channels will control the shape of the ratio vs. signal intensity plots (R-I plots);
6. Simulate background noise for each channel. The mean of background noise is determined by one input parameter: the signal to noise ratio (SNR) and the true mean of signal. The SNR is the ratio between the true mean of the signal and the true mean of background noise. The SNR controls variability of background noise. The normal distribution with a given mean value is used in simulating background noise. Variance of background noise will be controlled by the input parameters αbr and αbg for the Cy5 channel and the Cy3 channel, respectively. These two parameters are the ratios between the mean and the standard deviations of background noise for the two channels, respectively. Simulated signal intensity for the two channels, and , are adjusted by subtracting background noise in each channel. Let and still denote background adjusted signal intensity for the two channels;
7. Add noise to the signal intensity for each channel. Finally, the signal intensity of each channel is generated by
with , where α1 ~ U (a1, b1), α2 ~ U (c1, d1), α3 ~ U (a2, b2), α4 ~ U (c2, d2). The a1, b1, c1, d1, a2, b2, c2, d2 are given as input parameters to control variability of fluorescent signal intensity.
We simulated two situations, one is the no-dye bias case and another one is the shape case (dye bias exists). R-I plots of twelve blocks on one slide for two simulated cases are shown in Figures 2 and 3, respectively. We considered five different percentage levels of differentially expressed genes: 1%, 5%, 10%, 20%, and 40%. The ratio of the up-regulated genes to the down-regulated genes takes three values, i.e., 1:1, 3:1, and 9:1 at each percentage level of differentially expressed genes. In addition, based on the suggestion of a reviewer, we simulated an extreme case for the scenario in Figure 3, in which 70% genes are all up-regulated and the remaining ones are not differentially expressed.
Figure 1. Study designs for cDNA microarray experiments among the human placenta, the universal reference, and the Probe Set.
Figure 2. An example of R-l plots for twelve blocks of slide one with no-dye bias for two channels, 10% genes are differentially expressed.
Figure 3. An example of shaped R-l plots for twelve blocks of slide one with the dye bias for two channels, 10% genes are differentially expressed.
The trend of MSEs and biases of estimated gene expression levels are similar between the robust TW-SLM and the LOWESS normalization methods across different levels of the ratios between the up-regulated genes and the down-regulated genes. This trend also exists in the extreme case. We present the results of the following two scenarios: (a) a 9:1 ratio between the number of the up-regulated genes and that of the down-regulated genes and, (b) the extreme case. Tables 1 and 3 present MSEs, Tables 2 and 4 show biases of estimated gene expression levels. MSEs and biases for the extreme case (70% of the genes are up-regulated) are presented in the bottom of Tables 3 and 4, which are displayed in Figures 6 and 7, respectively. The robust TW-SLM method has smaller means of MSEs than the LOWESS normalization method and the TW-SLM using OLS, respectively. Also the ranges of MSEs for the proposed method are also smaller than those using the LOWESS method and the TW-SLM with OLS, respectively.
Table 1. The mean square errors (MSE) of estimated gene expression levels (up:down = 9:1) for simulated cDNA microarray data with the R-I plots similar to Figure 2.
Table 2. Bias of estimated gene expression levels (up:down = 9:1) for simulated cDNA microarray data with the R-I plots similar to Figure 2.
Table 3. The mean square errors (MSE) of estimated gene expression levels (up:down = 9:1) for simulated cDNA microarray data with the R-I plots similar to Figure 3.
Table 4. Bias of estimated gene expression levels (up:down = 9:1) for simulated cDNA microarray data with the R-I plots similar to Figure 3.
Figure 4. Consistency analysis based on cutoff p-value 10-5. Both x and y axes are estimated log intensity ratios. (a)-(b) between the direct design and the indirect design for the robust TW-SLM and the LOWESS normalization method, respectively; (c)-(d) between the LOWESS method and the robust TW-SLM for the direct design and the indirect design, respectively.
Figure 5. Consistency analysis based on cutoff p-value 10-3. Both x and y axes are estimated log intensity ratios. (a)-(b) between the direct design and the indirect design for the robust TW-SLM and the LOWESS normalization method, respectively; (c)-(d) between the LOWESS method and the robust TW-SLM for the direct design and the indirect design, respectively.
Figure 6. A boxplot for comparing mean square errors among normalization methods for the case that 70% genes are all up-regulated (Table 3).
Figure 7. A boxplot for comparing biases among normalization methods for the case that 70% genes are all up-regulated (Table 4).
Comparing the different robust weight functions, means of MSEs are slightly smaller using Tukey's weight function than that using Huber's weight function. These results are observed across different percentage levels of differentially expressed genes. Biases for estimated gene expression levels distributed similarly between the proposed method and the LOWESS normalization method. But the ranges of the biases for the proposed method are smaller than those of the LOWESS normalization method and the TW-SLM using OLS, respectively. These observations are true in both simulated situations.
The extreme case is an example where the proposed method does better than the LOWESS method (Tables 3 and 4, Figures 6 and 7). Estimates using the LOWESS method are downward biased in this case. This is what we would expect because the LOWESS method fits normalization curves through the majority of genes, which are mostly up-regulated here. In contrast, the TW-SLM method does not need the either of the two assumptions needed by the LOWESS method, neither of which is satisfied here.
The distributions of MSEs and biases between the TW-SLM using OLS and the LOWESS method are similar for cases where there is a relatively small percentage of differentially expressed genes. However, the TW-SLM with OLS performs better than the LOWESS when a larger proportion of genes are differentially expressed. It appears that the more deviation from the two assumptions required by the LOWESS, the better the TW-SLM performs. This trend is consistent with findings in our previous work .
In this section, a real data set was analyzed to compare consistency of the LOWESS normalization method and the proposed robust TW-SLM method. A collection of human placenta cDNAs comprising 7,042 clones was identified and used as the probe set for cDNA microarray fabrication in this study .
Three kinds of RNA samples were used which include: (i) a common reference RNA obtained by in vitro transcription from a pool of cDNAs in equal amount comprising the entire probe set (PS); (ii) the "Universal Human Reference RNA" from Stratagene, a pool of RNAs derived from 10 different cell lines; and (iii) human full-term placenta RNA. The original goal of the study was to evaluate the performance of the PS RNA as a reference RNA in comparison with that of Stratagene's universal reference RNA.
In this study, the Universal Human Reference RNA and the human placenta RNA were treated as two experimental samples. The PS RNA was used as the reference against which the two other bio-samples were compared. In the simple direct comparison, gene expression values were obtained through direct hybridizations between the human placenta RNA and the Universal Human Reference RNA. In the indirect comparison using the PS set as the common reference, hybridizations were performed between the human placenta RNA and the PS reference RNA, and between the Universal Human Reference RNA and the PS reference RNA. The design of this experiment is depicted in Figure 1.
After hybridization, slides were scanned with the Axon instruments 4000B scanner. The 633 and 532 lasers are used for excitation of the Cy5 and Cy3 fluorophores, respectively. For each of the three types of hybridizations (i.e., the human placenta vs. the universal reference, the human placenta vs. the PS reference, and the universal reference vs. the PS reference), there are four slides, including two dye-swapped slides. Each clone was printed three times on different blocks on each slide. Background adjusted medians for the Cy5 and Cy3 channels were used as expression levels. We removed negative controls including "Human Cot1", "PolyA" and "Empty" in the analysis.
To evaluate the proposed method, we compare it with the LOWESS method by examining which method produces more consistent results between the direct comparison and the indirect comparison of human placenta and universal human reference RNA samples as described above (see also Figure 1). The rationale is that the results from the direct comparison design and the indirect comparison design should be similar, because the same RNA samples are compared in both designs, albeit the indirect comparison is through a third common reference. Therefore, a better normalization method is the one that yields more consistent results between the direct and indirect comparison experiments.
The data were normalized using the LOWESS normalization method and the robust TW-SLM with Tukey's robust weight function separately. Significance analysis was carried out for the normalized data for each method by comparing gene expression levels in the human placenta tissue relative to the universal reference. One sample t-test was used for the direct comparison and two-sample t-test was used for the indirect comparison. We used 10-5 and 10-3 as cutoff points for p-values to determine if clones are statistical significant or not. Consistency of estimated relative gene expression levels was compared between the direct design and the indirect design for each method. We also compared overlap between the LOWESS normalization method and the robust TW-SLM for each design. The results are presented in Figures 4 and 5.
We used 10-5 as a cutoff point for p-values in Figure 4. Using the robust TW-SLM normalization and the t-tests, there are 2,907 genes with p-value less than 10-5 in the direct comparison and 2,791 in the indirect comparison. There are 1,713 genes common in these two sets of genes with p-value less than 10-5, which account for about 59% (1713/2907) in the direct comparison and about 61% (1713/2791) in the indirect comparison.
In comparison, using the LOWESS normalization and the t-tests, there are 1,447 genes with p-value less than 10-5 in the direct comparison and 1,045 in the indirect comparison. The number of overlapping genes with p-value less than 10-5 is 467, which is around 32% (467/1447) in the direct comparison and about 44% (467/1045) in the indirect comparison. It is clear that the proposed method performs more consistent between the direct comparison and the indirect comparison.
We also examined overlap between the LOWESS and robust TW-SLM methods for the two comparisons. In the direct comparison, about 79% (1141/1447) of the genes found to be significant based on the LOWESS method are also found to be significant based on the robust TW-SLM method. But they only account for about 40% (1141/2907) of the significant genes detected based on the robust TW-SLM method. In the indirect comparison, about 71% (738/1045) of the significant genes based on the LOWESS method are also found to be significant based on the robust TW-SLM method. But they only account for about 26% (738/2791) significant genes detected based on the robust TW-SLM method.
In our analysis, we used background adjusted intensity values. How to adjust background is an important issue in microarray data analysis. To evaluate if background affects our conclusions, we repeated the comparison analysis without adjusting background for the intensity values in both channels, the results are presented in Tables 5 and 6. We see from these tables that the overall results are similar to those using background adjusted intensity values in normalization. This is what we would expect because of low and uniform distributed background noise in all arrays in this example (data description is not shown).
Table 5. Consistency analysis with and without background adjustment under slide-wised and block-wised normalization strategies (the cutoff p-value is 10-5)
Table 6. Consistency analysis with and without background adjustment under slide-wised and block-wised normalization strategies (the cutoff p-value is 10-3)
Therefore, the robust TW-SLM method yields more consistent results between the direct comparison and the indirect comparison with the human placenta and the universal human reference RNA samples. In addition, the robust TW-SLM method detects more significant genes for a given cutoff p-value. This makes sense biologically because most of the 7,042 genes specifically discovered from human placenta are expected to have differential expressions relative to the universal reference RNAs. We would expect that the similar comparison results will be got if we compare the TW-SLM using OLS or Huber's weight function with the LOWESS method because the normalization curves for the TW-SLMs (TW-SLM:OLS, TW-SLM:Huber, TW-SLM:Tukey) are similar, but all these three curves are different from the LOWESS normalization curve (Figure 8).
Figure 8. Slide-wise normalization curves based on different methods for one slide of the human placenta vs. the probe set hybridization in the example.
We have proposed a robust TW-SLM normalization method for cDNA microarray data. It is interesting to compare the proposed normalization method with the existing methods, such as the widely used LOWESS normalization proposed by Yang et al. (2001)  and further discussed by Tseng et al. (2001) . In the LOWESS method, normalization is done separately by first fitting a separate curve for each slide through the R-I plot of log-intensity ratios versus log-intensity products. In comparison, the proposed method uses all the slides in estimating each normalization curve, using the gene effects βj as the thread linking these slides. In addition, in the proposed method, the normalization curves φi and gene effects βj are estimated simultaneously. With this approach, there is no need to assume that the percentage of genes with differential expression levels is small or the expression levels of up- and down-regulated genes are symmetric, or when one of these assumptions is not satisfied, to use dye-swap normalization, which in turn requires the assumption that the two normalization curves are symmetric. (However, we note that dye-swap as a design strategy is useful to balance the experimental conditions and reduce bias due to different dye incorporation efficiencies.) An underlying condition required for the proposed method is independence of arrays, which is satisfied in a typical microarray experiment. Further theoretical conditions for the TW-SLM can be found in the paper by Huang et al. .
We have only considered the proposed robust TW-SLM method for the simple direct comparison design described in the Methods section. We can easily extend the method to more complicated designs. For example, we can adapt the proposed robust method to the TW-SLM that accommodates the design where a gene is printed multiple times. Such a design is helpful for improving the precision and for assessing the quality of an array using the coefficient of variation (Tseng et al. 2001 ). We can also adapt the robust TW-SLM to incorporate control genes with known concentration ratios in estimating the normalization curves. Model (1) can be easily extended to block-wise normalization by treating different blocks as separate arrays and normalization can be carried out as what we did here. Block-wise normalization considers spatial variation within an array. We did block-wise normalization on the data sets in the example and compared the results with that using the LOWESS method (Tables 5 and 6). The proposed method still outperforms the LOWESS method if we use block-wise normalization in this example.
In our simulation studies, the proposed method performs better than the LOWESS normalization method in terms of MSEs of estimated gene effects in the simulation models we considered. Analysis of the probe set reference data set  shows that the proposed method yields more consistent results between the direct and indirect comparisons than the LOWESS normalization method. In addition, the proposed method is more sensitive in detecting differentially expressed genes than the LOWESS method. Therefore, we believe that the proposed robust TW-SLM method is a powerful alternative to the existing normalization methods. We have coded the proposed method in an R package which is available from the corresponding authors.
We first describe the TW-SLM. For simplicity, we focus on the case of comparing two cell populations, in which two cDNA samples from the respective cell populations are competitively hybridized on the same array. Let n be the number of slides, and J be the number of genes in the study. Let Rij represent background corrected signal intensity from the Cy5 channel and Gij the background corrected signal intensity from the Cy3 channel, and let yij = log2(Rij/Gij), xij = (1/2) log2(Rij × Gij), for gene j on slide i. We assume that there is only one spot for each gene on each slide. The TW-SLM  is
yij = φi (xij) + βj + ∈ij, i = 1,..., n, j = 1,..., J (1)
In this model, the observed log intensity ratio is decomposed into three components. The first component is φi which is the intensity dependent normalization curve for slide i, the second component is βj which represents the relative expression value of the jth gene after normalization, the last one is the residual error term. Let be a robust estimator of the ith normalization curve φi based on this model described above. The normalized data are
Huang et al. (2004)  considered the least squares method for estimating φi and βj in the TW-SLM. However, it is well known that least squares estimates are not robust against outliers which often arise in microarray experiments. Therefore, we propose to use the robust method  for estimating φi and βj. This is done by minimizing the objective function
where ρ is an appropriately chosen function for robust estimation, λ is the collection of the coefficients in the spline representations of φi described below, σ is the scale parameter, and α is a constant to be described below. We note here that estimation of φi, βj are done jointly and uses data from all the arrays. This is different from the LOWESS normalization method in which estimation of normalization curves are done array by array.
We consider two ρ functions: Huber's ρ function and Tukey's biweight function. Huber's ρ function is
Tukey's biweight function is
Two other usefull functions derived from ρ, ψ and χ, will be used repeatedly in the description of the algorithm below. They are defined as
ψ(x) = ρ'(x), χ(x) = xψ(x) - ρ(x). (4)
The expressions of these functions are given in the Appendix. We choose commonly used constants in the literature for Huber's and Tukey's functions, i.e., H = 1.345 and k = 4.685. The influence of the choice of these constants on normalization methods is beyond the scope of this study.
where b(x) = (1, b1(x),..., bK(x))' and λi = (λi0, λi1,..., λiK)'.
We estimate the parameters in model (1) by minimizing objective function (3) using an iterative procedure. Two steps, a location step and a scale step, will be used in the computation.
We use the following vector and matrix notations in describing the location step:
Bi = (b(xi1), b(xi2),..., b(xiJ))',
yi = (yi1, yi2,..., yiJ)'.
for i = 1,..., n. Given the scale parameter σ, and satisfy the equations:
where and because of identifiability requirement in the TW-SLM. We can solve these equations iteratively to obtain and . The derivations of these equations are given in the Appendix.
According to Huber's proposal , the estimation equation for σ is
where rij = yij - b'(xij) , and N is the total number of observations in the data set. In general, equation (8) does not have an explicit solution. So we use the following updating equation to compute the estimated scale parameter σ,
In order to obtain the consistent scale estimator at the normal distribution and obtain the classic estimates when using the least squares objective function, i.e., , we used the constant suggested by Huber ,
where EΦ denotes expectation with respect to the standard normal distribution function Φ.
The procedure described above is called an iterative reweighted least squares (IWLS) algorithm that is used in many non-least squares estimation problems. The implementation of the IWLS algorithm can be carried out using the following steps:
1. Initialize for j = 1,..., J and σ(0) = 1, , for i = 1,..., n, j = 1,..., J;
2. Calculate according to equation (6) given β(m-1), σ(m-1) and for i = 1,..., n, j = 1,..., J, m = 1,...;
3. Check convergence of λi, β, and σ. If the convergence criteria is met, then stop, otherwise continue;
4. Update σ(m) by equation (9) given , , σ(m-1), and , and set σ(m-1) = σ(m);
5. Calculate weight given β(m-1), and σ(m) according to equation (5), and set ;
6. Calculate β(m) given σ(m-1) and using equation (7), and set ;
7. Go to step 2 and iteratively update the estimators of parameters and the weights between steps 2 and 6 until convergence.
DW devised and implemented the procedure described in the paper, drafted and finalized the manuscript. JH helped with writing and revising the manuscript. JH and MBS supervised and provided support for this work. HX and LM conducted the experiment that generated the data set used in the example.
Derivation of and
We derive estimation equations for location parameters presented in the Methods section in this appendix. Again the notations from the Methods section:
b(x) = (1, b1(x),..., bK(x))',
λi = (λi0, λi1,..., λiK)'.
φi(xij) can be approximated by a linear combination of B-spline basis functions, i.e. b'(xij)λi, where bk(xij) is the kth B-spline basis function of xij. Let A = (a1, a2,..., an)', C = (c1, c2,..., cn)', and define
Given scale parameter σ, the first partial derivatives of S(λ, β, σ) (3) with respect to λ and β can be expressed in the matrix form as
where Bi; = (b(xi1), b(xi2),..., b(xiJ))', yi = (yi1, yi2,..., yiJ)', ψ(x) = ρ'(x). As defined in equation (5)
Wi = diag(wi1, wi2,..., wiJ),
Plugging Wi into equations (10) and (11) and setting them to zeros, and solving these two equations and yielding estimation equations for in equation (6) and in equation (7). They are
then equation (7) can also be expressed as
= (Z'WZ)-1 Z'W(y - B).
The solution of (Z'WZ)-1 can be explicitly calculates using the following matrix,
where , and for j = 1,..., J. We can get the explicit solution of after doing some linear algebra. It is
for j = 2,..., J. And because of identifiability requirement in model (1).
Derivation of scale parameter estimator
The ψ and χ functions derived from Huber's ρ(z) function are
The related weight function has the form
where H is a constant.
The constant α used in the scale step for Huber's robust estimation can be calculated as the following
where Φ is the distribution function of the standard normal distribution, N is the total number of observations in the dataset, and p is the total number of parameters in the model.
The ψ and χ functions derived from Tukey's ρ(z) function are
The associated weight function has the form
where k is a constant and the constant a in the scale step takes value
where is the Chi-square probability function with n degrees of freedom evaluated at s.
When Tukey's weight function is used, equation (8) is solved directly for the estimator of σ instead of iteratively updating equation (9) in our R program. It can be shown that equation (8) for Tukey's χ(z) function has an unique real root. This real root is just the solution for the estimator of σ. Let n* be the total number of observations that satisfy the second case of Tukey's χ function, i.e. | z | >k, let J* be the total number of clones that satisfy the first case of the χ, i.e. | z | ≤ k. Replacing z by and plugging Tukey's χ into equation (8), we get
Then equation (12) becomes
ax3 + bx2 + cx + d = 0. (13)
Let x = y - b/3a, divided by a in the both sides of equation (13), and plugs x into equation (13), then we get the Cardan's cubic equation
Let p = c/a - b2/3a2, q = d/a - bc/(3a2) + 2b3/(27a3), the above equation becomes
y3 + py + q = 0. (15)
The determinant for Cardan's equation (15) is
It can be shown that the determined function Δ must be positive. The first term in the determinant equation must be positive because of the square function and q cannot be zero. If we can show that the p is greater or equal to zero, then the Δ must be positive. Because
So the Δ is positive if only if 3ac - b2 is non-negative. We can see that
According to the Cauchy inequality , we have
Therefore, the p must be non-negative and the Δ must be positive. Thus there is only one real root for equation (15), that is
Then the solution for σ in equation (12) is
This work is supported in part by grant HL72288 from the National Heart, Lung and Blood Institute.
Science 1995, 270:467-470. PubMed Abstract
Nature Genetics 1999, 21(suppl 1):33-37. Publisher Full Text
Biotechniques 2000, 29:548-562. PubMed Abstract
Journal of Biomedical Optics 1997, 2(4):364-374. Publisher Full Text
Nat Genet 2002, 32(supplement):496-501. Publisher Full Text
Current Issues in Molecular Biology 2002, 4:57-64. PubMed Abstract
Genome Biol 2002, 3(7):1-12. BioMed Central Full Text
Workman C, Jensen LJ, Jarmer H, Berka R, Gautier L, Nielsen HB, Saxild HH, Nielsen C, Brunak S, Knudsen S: A new non-linear normalization method for reducing variability in DNA microarray experiments.
Genome Biology 2002, 3(9):1-16. BioMed Central Full Text