Abstract
Background
The usefulness of log_{2 }transformation for cDNA microarray data has led to its widespread application to Affymetrix data. For Affymetrix data, where absolute intensities are indicative of number of transcripts, there is a systematic relationship between variance and magnitude of measurements. Application of the log_{2 }transformation expands the scale of genes with low intensities while compressing the scale of genes with higher intensities thus reversing the mean by variance relationship. The usefulness of these transformations needs to be examined.
Results
Using an Affymetrix GeneChip^{® }dataset, problems associated with applying the log_{2 }transformation to absolute intensity data are demonstrated. Use of the spreadversuslevel plot to identify an appropriate variance stabilizing transformation is presented. For the data presented, the spreadversuslevel plot identified a power transformation that successfully stabilized the variance of probe set summaries.
Conclusion
The spreadversuslevel plot is helpful to identify transformations for variance stabilization. This is robust against outliers and avoids assumption of models and maximizations.
Background
Microarrays measure the abundance of thousands of mRNA transcripts in one experiment. Currently, two different microarray technologies dominate gene expression research efforts, namely, custom spotted arrays and Affymetrix GeneChips^{®}. Custom spotted arrays are characterized by long single strands of complimentary DNA (cDNA) affixed to a solid substrate in spots to which two different fluorescently labelled samples are hybridized. Affymetrix GeneChips^{® }are characterized by the use of several (11–20) short oligonucleotide (25mers) probes to interrogate for a single gene and to which one fluorescently labelled sample is hybridized. For both technologies, the fluorescence intensity for each probe/spot is assumed to be indicative of the amount of mRNA transcript in the sample. Since two samples are hybridized to custom spotted arrays, the ratios of the experimental signal relative to a control signal of the resulting intensities are analyzed. On the other hand, for the one sample hybridized to an Affymetrix GeneChip^{®}, the absolute intensities indicative of the number of transcripts are the resulting gene expression measures.
To satisfy assumptions required for statistical analyses, data from both technologies are often transformed by a suitable function [1]. Specifically, the reasons for applying a transformation to a dataset include to achieve stability in variance, or to achieve linearity, additivity and/or normality. Sometimes a transformation is applied because it facilitates interpretation and induces symmetry. Such is the case when applying the log_{2 }transformation to custom spotted microarray (cDNA) data. This transformation defines the relative abundance of a transcript in an experimental sample in comparison to the control sample as the unit of analysis. For genes where the experimental intensity is greater than the control intensity, the ratio could take values in the range (1, ∞); for genes where the experimental intensity is less than the control intensity, the ratio is compressed to the range (0, 1). The log_{2 }transformation of ratio data promotes symmetry by treating under and over expressed genes similarly. A typical example used to elucidate this concept is the one that considers two genes, one with a ratio of 2.0 representing a doubling of intensity and another with a ratio of 0.5 represents a halving of intensity; on a log_{2 }scale these values are symmetric about zero (no change) with values 1 and 1 respectively [1]. The application of the log_{2 }transformation to cDNA data thus has proven useful for achieving symmetry as well as ease of interpretation.
The usefulness of the log_{2 }transformation for cDNA microarray data has led to its widespread application to Affymetrix GeneChip^{® }data as well [24]. While this transformation is appealing to the biologists due to the reasons stated above for cDNA arrays, it neither facilitates interpretation nor does it necessarily render true the assumptions required for statistical analysis, such as equal variance, normality, etc., in Affymetrix GeneChip^{® }data. When data are amounts or counts, as with Affymetrix GeneChip^{® }data, where the intensities are assumed to represent amounts of mRNA transcripts, there is often a systematic relationship between the variance of the measurements and magnitude of the measurements. The log_{2 }transformation expands the scale of genes with low absolute intensities while compressing the scale of genes with higher intensities; it essentially reverses the direction of the relationship between the variance and the mean expression level. That is, after the transformation lowly expressed genes have a higher variance and highly expressed genes have a lower variance [5].
Recently, other variance stabilizing transformation methods for microarray gene expression data have been introduced [68]. The gene expression data have been modelled as
y = α + μe^{η }+ ε, (1)
where y represents the measured intensity, α represents the average background, μ represents the true gene expression level, with normally distributed error terms η and ε with zero mean and differing nonzero variances [6]. For this model, when μ is large, y is distributed approximately as a lognormal random variable. Therefore, a log transformation of y results in observations with constant variance when μ is large. In this case the generalized logarithm transformation is
g(y) = ln (y  α + sqrt [(y  α)^{2 }+ c]), (2)
where , stabilizes the asymptotic variance [6] for large samples. Rocke and Durbin [7] compared three logarithmicbased transformations (the generalized logarithm, the started logarithm, and the loglinear hybrid) using a simulation study and in application to an existing dataset. They found the generalized logarithm resulted in better overall performance in achieving variance stabilization.
In this paper, an Affymetrix GeneChip^{® }HGU133A dataset consisting of 16 technical replicates (QAQC Dataset), where the Microarray Suite Software (version 5.0) was used to derive the expression summaries for all probe sets, is used to demonstrate some of the problems associated with applying the log_{2 }transformation to absolute intensity data. Another approach to identify an appropriate variance stabilizing transformation using the spreadversuslevel plot [9] is proposed. The spreadversuslevel plot plots the log of the median on the xaxis () against the log of the fourthspread on the yaxis. The slope of the spreadversuslevel plot (b) can be used to suggest a power transformation that will most appropriately stabilize the variance. When using a spreadvslevel plot, the suggested power of the transformation is p = 1  b.
The effectiveness of using the spreadversuslevel plot to identify a suitable monotone transformation that appropriately stabilizes the variance for probe set expression summaries is demonstrated using the QAQC dataset.
Results
To demonstrate the dependence of the variance on level of gene expression, a plot of the mean for each probe set across the 16 technical replicates by the associated variance was constructed (Figure 1). It is apparent from the plot that the variance increases with increasing gene expression. Figure 2 represents the plot of the mean versus the corresponding variance for the log_{2 }transformation. This plot demonstrates, as previous authors have found [4,5], that the log transformation increases the variability on the lower end of the range while compressing the variability among the highly expressed genes. In other words, clearly, the log transformation has not stabilized the variance in this case.
Figure 1. Mean versus variance plot. Plot of the mean probe set MAS 5.0 signal intensities by the corresponding variance across the 16 HGU133A GeneChips^{®}.
Figure 2. Mean versus variance plot of log_{2 }transformed data. Plot of the mean log_{2 }transformed MAS 5.0 probe set signal intensities by their associated variance for the 16 HGU133A GeneChips^{®}.
In Figure 3 the spreadversuslevel plot for the same data is presented. The estimated slope of the linear regression model fit to the spreadvslevel plot was = 0.569. Therefore, the estimate of the power transformation turned out to be, = 1  0.569 = 0.431. Typically, for simplicity the parameter estimate is rounded to the nearest halfinteger, which in this case would lead to = 0.5. The family of power transformations is defined as
Figure 3. Spread versus level plot. Spreadversuslevel plot for 16 HGU133A GeneChips^{® }using MAS 5.0 probe set expression summaries; parameter estimates from least squares regression: = 0.052, = 0.57.
where, x is the observed data, p is a number to be determined and Φ is the monotonic transformation. Substituting the p obtained from the plot yields,
Φ_{0.5}(x) = (x^{0.5 } 1)/0.5 = .
Applying this transformation to the signal intensities in the QAQC dataset and plotting mean versus the variance as before (Figure 4) shows that stabilization of the variance is achieved. There are a few outlying probe sets identifiable in Figure 4. One of the advantages of using the estimated slope from the spreadversuslevel plot to identify a power transformation is that this plot uses robust measures of location (median) and spread (fourthspread). For this dataset, the "outlying" probe sets were defined as those with a variance greater than 50. The estimated slope after removing the outliers was 0.568; as expected, the slope of the linear regression fit to the spreadversuslevel plot is fairly robust against outliers. Thus, the transformation identified by the spreadversuslevel plot is not affected by these few outlying probe sets.
Figure 4. Mean versus variance plot of power transformed data. Plot of the mean of the probe set signal intensities after applying the transformation by the associated variance for the 16 HGU133A GeneChips^{®}.
Figure 4, however, is not on the same scale in comparison to Figures 1 and 2. Therefore, a plot of the rank of the median probe set signal intensities versus the associated fourthspread across the 16 HGU133A GeneChips^{® }(Figure 5) is considered [10]. This plot more clearly demonstrates the improvement in variance stabilization achieved by this transformation. The fitted lowess regression curve is overlaid, where the span is 1/3, is roughly horizontal, indicating constant variance.
Figure 5. Rank versus spread plot of power transformed data. Plot of the rank of the median probe set signal intensities after applying the transformation by the associated fourthspread across the 16 HGU133A GeneChips^{®}. The fitted lowess regression curve is overlaid where the span is 1/3.
As noted in this figure, there are still a few probe sets at the highest intensity level that suffer from increased variability. These probe sets are among those expression summary values that are greater than the expression summary values for the most concentrated hybridization spikein control (that is, the AFFXCreX and AFFXr2P1cre probe sets, with a mean of 2796 on MAS 5.0 scale, 103.7 on the transformed scale). From our quality assessment of the hybridizations, there is linearity over the range of concentrations covered by all the hybridization spikein controls. However, for signal intensities outside of this range, linearity cannot be determined. In a statistical analysis, these may be deemed outliers and may need to be examined separately from the remaining probe sets.
After applying the generalized logarithm transformation [6,8] to the QAQC Dataset using the Bioconductor vsn() library [8] and plotting the rank of the median probe set signal intensities by the associated fourthspread (Figure 6) reveals the retention of increased variability among genes with smaller signal intensities, though vastly improved in comparison to the log_{2 }transformation. The fitted lowess regression curve is overlaid where the span is 1/3. The coefficient of variation for the fitted values from the lowess regression for the transformed data was 0.15 while the coefficient of variation of the fitted values from the lowess regression for the generalized logarithm transformation was 0.22. Bootstrap percentilebased confidence intervals for the coefficient of variation were estimated by resampling the paired median and spread observations, fitting a lowess regression to the bootstrapped data, then estimating the coefficient of variation using the lowess fitted values. The 99% confidence interval for the vsn() transformed data was (0.2109, 0.2265); the 99% confidence interval for the transformed data was (0.1431, 0.1644). Thus, it may be concluded that the variance was better stabilized by the transformation.
Figure 6. Rank versus spread plot of generalized logarithm transformed data. Plot of the rank of the median probe set signal intensities after applying the generalized logarithm transformation by the associated fourthspread across the 16 HGU133A GeneChips^{®}. The fitted lowess regression curve is overlaid where the span is 1/3.
Discussion
Reasons for transforming data often include the desire to find a scaling of the data that would simplify the analysis, for example, by making the data meet assumptions such as symmetry, equal variance, straight line relationship, and/or additivity of effect. One common assumption of many traditional statistical methods is that the variance is constant over the levels of gene expression. Since the absolute intensities are positive, this assumption is generally not valid because the variance increases with increasing intensities. The log_{2 }transformation, which is routinely applied to absolute intensities that result from Affymetrix GeneChip^{® }experiments [2,3], unfortunately does not achieve the desired results. As shown (Figure 2) the transformation only reverses the meanvariance relationship. To overcome this drawback of the log_{2 }transformation, methods such as the Local Pooled Error test have been suggested [4] to more accurately estimate the variance for lowly expressed log_{2 }signal intensities. In this article, as one alternative, the spreadversuslevel plot to identify more appropriate monotonic transformations that stabilize the variance was proposed. The spreadvslevel plot proved useful for identifying a transformation useful for simplifying an analysis by meeting assumptions such as symmetry and equal variance.
For the QAQC Dataset, the spreadversuslevel plot led to a square root transformation, rather than the log_{2 }transformation. This dataset consisted of 16 technical replicates; since the spreadvslevel plot is used to empirically identify a power transformation, the most appropriate power transformation to stabilize the variance is completely data specific. Therefore, the transformation suggested should be treated as an illustration and is not necessarily the most appropriate one for all data sets. In fact, there may be situations in which our method may lead to the more commonly used log transformation itself, while for other datasets it may not lead to any power transformation that could stabilize the variance. We note, however, that in a serial dilution experiment conducted to assess linearity of signal [11], the authors found the square root transformation by trial and error to be an appropriate transformation. Thus this empirically derived transformation obtained through the use of the spreadversuslevel plot confirmed a similar finding.
Conclusions
The estimated slope from the spreadversuslevel plot could be used to identify a power transformation that will appropriately transform gene expression data to stabilize the variance. It is computationally easy to implement as it avoids the requirement of a statistical model and maximization of a likelihood. Moreover, since this plot uses robust measures of location and spread, the slope of the linear regression fit to the spreadversuslevel plot is fairly robust against outliers.
Methods
Transformation
One of the most commonly used families of variance stabilizing transformations is the BoxCox transformation [12]. This family of transformations is defined by equation (3). Box and Cox selected p as the value that maximized the likelihood under any given model. A graphical technique based on the spreadversuslevel plot may also be used to empirically estimate p [9]. This method is robust against outliers and avoids the need for maximization of the likelihood. The spreadversuslevel plot plots the log of the median (xaxis) by the log of the fourthspread (yaxis). The slope b of the spreadversuslevel plot identifies the approximate value of the exponent for the power transformation, where p = 1  b. When the slope b is close to 1, p is close to 0 and therefore the transformation would be ln(x).
Spreadversuslevel plot
To construct a spreadversuslevel plot, three order statistics, namely the median (M) the upper fourth (F_{U}) and the lower fourth (F_{L}) must be identified. These order statistics may be found using the following algorithm. First, sort the data in ascending order. On the basis of ordering, the rank of an observation can be defined as an upward rank (determined by counting up from smallest to largest) or a downward rank (determined by counting down from largest to smallest). Considering both the upward and the downward ranks together, the depth of a data value in a sample is defined as the smaller of its upward rank and its downward rank. For example, the depth of the median is (n + 1)/2. (When the number of observations n is even, the median is interpolated to lie between the two middle values.) The fourths are those values with a depth equal to ([depth of median] + 1)/2. Therefore, each fourth is halfway between the median and the corresponding extreme [13]. The fourthspread (d_{F}) is a robust measure of spread and is estimated by (upper fourth)  (lower fourth), denoted by d_{F }= F_{U } F_{L}. The interquartile range is nearly the same as the fourthspread but differs slightly when the number of observations is even.
To understand why a spreadversuslevel plot is useful in identifying an appropriate variance stabilizing transformation [9], let X be a real random variable from a population with median M_{X}, lower fourth F_{L}, upper fourth F_{U}, and fourth spread d_{F}. The spreadversuslevel plot identifies a transformation y = Φ(x) for which the fourthspread of Y is constant. Assuming the second derivative Φ''(x) and higher order derivatives exist for all x, a Taylor series expansion of Φ(x) around M_{X }is
Let λ(M_{X}) represent the distance from M_{X }to F_{U }as a proportion of the distance d_{F}, or λ(M_{X}) = (F_{U } M_{X})/d_{F}; then [1  λ(M_{X})] would represent the proportionate distance from M_{X }to F_{L}. Therefore, the median and upper and lower fourths for the data transformed by y = Φ(x) are
M_{Y }= Φ(M_{X}), (5)
(F_{U})_{Y }= Φ(F_{U}) = Φ [M_{X }+ λ(M_{X})d_{F}], (6)
(F_{L})_{Y }= Φ(F_{L}) = Φ{M_{X } [1  λ(M_{X})]d_{F}}. (7)
Substituting these expressions into equation (4) to get (F_{U})_{Y }and (F_{L})_{Y}, then taking the difference to get the fourthspread for the transformed data yields,
Recall that λ(M_{X}) represent the distance from M_{X }to F_{U }as a proportion of the distance d_{F}; thus the term [2* λ(M_{X})  1] will range from [0,1]. Moreover, if the upper and lower fourths are roughly equal distant from the median (i.e., x is roughly symmetric), this quantity will be zero. Additionally, since Φ''(M_{X}) is a measure of concavity of the tangent line at the median, it will generally be small. Therefore, the quadratic term of (d_{F})_{Y }is generally small enough to be negligible compared to the leading term, thus the approximation
(d_{F})_{Y }≈ d_{F}· Φ'(M_{X}) (9)
is reasonable. Now, if the righthand side of this equation is held constant, it leads to a transformed variable Y with constant variance. Therefore, supposing that the relationship between spread d_{F }and level X is a power transformation of the form d_{F}(x) = k·x^{b }for some constants k and b,
The spreadversuslevel plot plots the log of the median on the xaxis () against the log of the fourthspread on the yaxis. (Recall, d_{F}(x) = k· so that log(d_{F}) = log(k) + b· ). Therefore, the slope of the spreadversuslevel plot estimates b. Hence the suggested power transformation would be x^{p }= x^{1  b}.
Data
An experiment using 16 technical replicates in which data were obtained using the Affymetrix GeneChip^{® }technology is considered to illustrate the transformation method suggested in the previous section. Five identical aliquots of 40 μg of the Universal Human Reference RNA (Stratagene, La Jolla, CA), isolated from 10 human cell lines, were used as template for cDNA synthesis, cRNA in vitro transcription, and labelling reactions. The final fragmented cRNA aliquots were pooled together. The pooled Reference RNA was hybridized to 16 HGU133A GeneChips^{® }(Dumur et al, personal communication) to study GeneChip^{® }reproducibility. Hybridization was performed using the same lot number of the HGU133A arrays, which contains 22,283 probe sets. Immediately after the scanning of every chip, a visual quality control of the image was performed in order to ensure a successful hybridization. Moreover, several parameters produced by the Affymetrix Microarray Suite Software (version 5.0) were monitored for quality control purposes, including: the scaling factor, which is used to scale all probes sets to a predetermined target value (in this study the target value was set at 100); the percentage of probe sets called 'Present' using the Affymetrix Detection Call algorithm [14]; and the 3'/5' ratios of signal intensity values for two housekeeping genes, GAPDH and βactin. The Microarray Suite Software (version 5.0) was also used to derive the expression summaries for all probe sets. The present study was performed on the 22,215 probe sets designed to match human transcripts; we eliminated those probe sets that query for bacterial spikein controls from our analyses.
Authors' contributions
KA conceived the transformation study, performed the bioinformatical analysis and drafted the manuscript. CD participated in the experimental design of the microarray study and performed the GeneChip hybridizations. VR participated in manuscript preparation. All authors read and approved the final manuscript.
Acknowledgements
This research was supported by the Commonwealth Technology Research Fund (CTRF #SE2002_02). The authors would like to thank the two anonymous reviewers for their helpful insights and recommendations.
References

Quackenbush J: Microarray data normalization and transformation.
Nature Genetics 2002, 32(Supplement):496501. Publisher Full Text

Hubbell E, Liu WM, Mei R: Robust estimators for expression analysis.
Bioinformatics 2002, 18(12):15851592. Publisher Full Text

Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data.
Nucleic Acids Research 2003, 31:e15. PubMed Abstract  Publisher Full Text

Jain N, Thatte J, Braciale T, Ley K, O'Connell M, Lee JK: Localpoolederror test for identifying differentially expressed genes with a small number of replicated microarrays.
Bioinformatics 2003, 19:19451951. PubMed Abstract  Publisher Full Text

Kohane IS, Kho AT, Butte AJ: Microarrays for an integrative genomics. Cambridge, MA: MIT Press;; 2003.

Durbin BP, Hardin JS, Hawkins DM, Rocke DM: A variancestabilizing transformation for geneexpression microarray data.
Bioinformatics 2002, 18(Suppl. 1):S105S110. PubMed Abstract  Publisher Full Text

Rocke D, Durbin B: Approximate variancestabilizing transformations for geneexpression microarray data.
Bioinformatics 2003, 19(8):966972. Publisher Full Text

Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M: Variance stabilization applied to microarray data calibration and to the quantification of differential expression.
Bioinformatics 2002, 18(Suppl. 1):S96S104. PubMed Abstract  Publisher Full Text

Emerson JD: Mathematical Aspects of Transformation. In Understanding Robust and Exploratory Data Analysis. . Edited by Hoaglin DC, Mosteller F, Tukey JW. New York: John Wiley & Sons;; 1987:247282.

Geller SC, Gregg JP, Hagerman P, Rocke D: Transformation and normalization of oligonucleotide microarray data.
Bioinformatics 2003, 19(14):18171823. Publisher Full Text

Ramdas L, Coombes KR, Baggerly K, Abruzzo L, Highsmith WE, Krogmann T, Hamilton SR, Zhang W: Sources of nonlinearity in cDNA microarray expression measurements.

Box GEP, Cox DR: An analysis of transformations.
Journal of the Royal Statistical Society, Series B 1964, 26(2):211252.

Hoaglin DC: Letter Values: A Set of Selected Order Statistics. In Understanding Robust and Exploratory Data Analysis. Edited by Hoaglin DC, Mosteller F and Tukey JW. New York, John Wiley & Sons; 1987:3357.

Liu WM, Mei R, Di X, Ryder TB, Hubbell E, Dee S, Webster TA, Harrington CA, Ho MH, Baid J, Smeekens SP: Analysis of high density expression microarrays with signedrank call algorithms.
Bioinformatics 2002, 18(12):15931599. Publisher Full Text