Abstract
Background
In a high throughput setting, effective flow cytometry data analysis depends heavily on proper data preprocessing. While usual preprocessing steps of quality assessment, outlier removal, normalization, and gating have received considerable scrutiny from the community, the influence of data transformation on the output of high throughput analysis has been largely overlooked. Flow cytometry measurements can vary over several orders of magnitude, cell populations can have variances that depend on their mean fluorescence intensities, and may exhibit heavilyskewed distributions. Consequently, the choice of data transformation can influence the output of automated gating. An appropriate data transformation aids in data visualization and gating of cell populations across the range of data. Experience shows that the choice of transformation is data specific. Our goal here is to compare the performance of different transformations applied to flow cytometry data in the context of automated gating in a high throughput, fully automated setting. We examine the most common transformations used in flow cytometry, including the generalized hyperbolic arcsine, biexponential, linlog, and generalized BoxCox, all within the BioConductor flowCore framework that is widely used in high throughput, automated flow cytometry data analysis. All of these transformations have adjustable parameters whose effects upon the data are nonintuitive for most users. By making some modelling assumptions about the transformed data, we develop maximum likelihood criteria to optimize parameter choice for these different transformations.
Results
We compare the performance of parameteroptimized and defaultparameter (in flowCore) data transformations on real and simulated data by measuring the variation in the locations of cell populations across samples, discovered via automated gating in both the scatter and fluorescence channels. We find that parameteroptimized transformations improve visualization, reduce variability in the location of discovered cell populations across samples, and decrease the misclassification (misgating) of individual events when compared to defaultparameter counterparts.
Conclusions
Our results indicate that the preferred transformation for fluorescence channels is a parameter optimized biexponential or generalized BoxCox, in accordance with current best practices. Interestingly, for populations in the scatter channels, we find that the optimized hyperbolic arcsine may be a better choice in a highthroughput setting than current standard practice of no transformation. However, generally speaking, the choice of transformation remains datadependent. We have implemented our algorithm in the BioConductor package, flowTrans, which is publicly available.
Background
Flow cytometry (FCM) is increasingly moving towards automated methods to deal with the quantities of data generated by high throughput, highcontent screening [111]. An appropriate, automated data preprocessing pipeline, including automated gating and matching of corresponding cell populations across replicated or similar samples is important for the accuracy of downstream analysis. However, accurate automated gating of FCM data is complicated by asymmetric and overlapping cell populations, frequent outlier events, cell populations whose variance depend on their mean fluorescence intensity, and multiplicative errors in the fluorescence channels. All of these characteristics can influence the output of both manual and automated gating, and subsequent downstream analysis. Data transformation plays an important role in mitigating some of these effects, both in manual and automated analysis setting. In a manual analysis setting, a transformation is typically chosen to facilitate cell population visualization for the purposes of gating. Generally, a set of common transformation parameters are chosen across multiple samples to ensure that they are on a common scale and facilitate comparison. Carefully chosen data transformations and corresponding parameters have been suggested to overcome some of the problems surrounding manual FCM analysis and gating [12,13]. Data transformation plays an even more important role in an automated, high throughput setting since the scale and distribution of the transformed data can influence downstream analysis procedures. Some automated gating methods include data transformation as part of the gating algorithm [6,8,10]. Other methods function under the assumption that the data have been appropriately transformed prior to the gating step [1,2,7,9]. Methods for intersample normalization in downstream preprocessing steps can allow for different data transformations per sample. By loosening the requirements of a common transformation across samples, we can explore the optimization of data transformations for automated gating. In such a setting, the impact of data transformation has received relatively little attention [6,14,15]. Lo et al. propose the estimation of a generalized BoxCox transformation embedded within a mixture modelling framework to simultaneously gate and transform skewed cell populations [6]. Their approach works very well in practice, but FCM data are generally still subjected to a global transformation prior to automated gating. What is clear is that the choice of transformation is datadriven and involves multiple considerations. The underpinning principle is to choose a transformation that facilitates cell population gating, visualization, and intersample comparison, by obtaining a representation wherein cell populations are well resolved across the full range of the data [6,12,1517].
There are many transformations in common use for flow cytometry data, including the logarithm and related transformations such as the linearlogarithmic and hyperlog transforms, power transformations such as the generalized BoxCox, which includes the logarithm as a special case, and the biexponential and related transformations such as the logicle and generalized arcsinh [6,12,13,17]. The log transformation can often stabilize the variance of cell populations in the fluorescence channels across nearly the full range of intensities but cannot represent negative data values of unstained cell populations, leading to compression of data against the axes and poor visual representation of low intensity or unstained populations [12]. To deal with this, other transformations have been suggested, including the linearlogarithmic (linlog) transformation, the biexponential (logicle), and generalized arcsinh transformations. These all improve upon the log by allowing negative values, providing a linear representation of data around zero and a logarithmic representation of the data at higher intensity values, with a smooth transition between the two extremes. The hyperlog has been proposed specifically for compensated data and is also capable of dealing with nonpositive values [13]. The biexponential transform provides additional flexibility by allowing the linear portion of the scale to be asymmetric around zero; the logarithmic scale can similarly be tuned by adjustable parameters. The generalized BoxCox has been adapted and applied to FCM data in the context of automated gating within a multivariatet mixture modelling framework [4,6,18]. Lo et al. proposed to select the generalized BoxCox transformation parameter maximizing the likelihood of individual cell populations being generated by a mixture of multivariatet or multivariatenormal distributions on the transformed scale [6]. All of these transformations have one or more parameters (with the exception of the log) that can be adjusted in a datadependent manner to modify the representation of the data. We perform our analysis within the flowCore framework in BioConductor, which is the predominant tool set in use for automated, high throughput flow data analysis. The default parameters of the transformations within flowCore are rarely adjusted in practice, and are almost certainly not the best possible parameter choices for all data sets. Here we examine the impact of the chosen transformation and its parameters upon the accuracy of automated gating as well as the ability to match gated cell populations across samples.
We make several important comments about notation. In this paper we refer frequently to transformations and inverse transformations. Mathematically, a function and its inverse are denoted f(.) and f ^{1}(.), respectively. However, in FCM data, the biexponential transformation is actually the inverse of f(y) = a exp(b(y  w))  c exp(d(y  w)) + f, where f(y) is the biexponential function. For clarity of exposition, in this paper, we will refer to the biexponential transformation as f ^{1}(y) above, and the inversebiexponential transformation as f(y), above. Although counterintuitive to the mathematical definition of the biexponential function, this nomenclature is accepted in the FCM community [12]. The notation in Table 1 is in accordance with this nomenclature.
Table 1. Summary of transformations
Methods
Parameter Estimation
We use maximum likelihood methods to estimate the parameters of each transformation (generalized arcsinh, generalized BoxCox, linlog, and biexponential) following established methods [19]. If Y is an n × d data matrix, we represent the data as a sequence of n ddimensional vectors Y_{i }={y_{i, 1},..., y_{i, d}} and the transformed data as , where f is the transformation function, and θ is the vector of transformation parameters. For simplicity we model the transformed data by a multivariate normal distribution Y^{(θ) }~ N_{d}(μ, Σθ). The likelihood of the parameters, given the data and a fixed θ is
where is the Jacobian term that accounts for the change of scale under different transformation parameters. It follows that for a fixed θ, the mean and covariance of the transformed data can be estimated by the sample mean and sample covariance, as follows,
Now we can substitute these two expressions into (1) and maximize over θ After some simple algebra, it can be shown that it is equivalent to minimizing:
where is the geometric mean of the squared Jacobian terms. Thus explicit derivation of the functions to be minimized for each transformation, S(θ), only involve deriving the Jacobian of the transformation. These transformations, their inverses, and their Jacobians are summarized in Table 1, and presented in Additional File 1 with common parameter values. When no closed form exists, we use numerical optimization routines in R to optimize (1) over θ [20]. We note that the translation parameter is fixed at f = 0 in the biexponential transformation in order to resolve identifiability issues with the full parameterization (see Additional File 2). The decision to model the transformed data as a multivariate Gaussian distribution is motivated by the implicit assumption of a common error model across all cells when performing a global data transformation. We note that the form of this implicit error model is not known. The multivariate Gaussian assumption is strictly a computationally convenient choice that has the effect of making the marginal distribution of the data more symmetric and reducing the influence of outliers at the subsequent gating step. For the purposes of gating, however, flow cytometry data is better represented as a mixture of distributions, which is done explicitly by flowClust and flowMerge.
Additional file 1. Examples of common transformations with typical parameters. Examples of the generalized BoxCox (blue, θ = 0.158), linlog (red, θ = 56.9), biexponential (black, a = 0.49, b = 0.99, c = 1, d = 0.01, f = 0, w = 2.3), and generalized arcsinh (green, a = 1, b = 0.052) transformations using common parameter values. The chosen parameters are selected from an optimized fit of each transformation to FSC vs SSC follicular lymphoma data.
Format: EPS Size: 5.3MB Download file
Additional file 2. The biexponential transformation is weakly identifiable. The biexponential transformation with full parameterization is weakly identifiable. a) A bivariate normal distribution on the original scale. b) Original data transformed with the inversebiexponential using parameters a = 1, b = 1, c = 1, d = 1 10^{10}, f = 0, w = 0. c) Inversetransformed data transformed with the biexponential using the true parameters a = 1; b = 1, c = 1, d = 1 × 10^{10}, f = 0, w = 0. d) Inversetransformed data transformed with the biexponential using alternate parameters c' = 1 × 10^{10}, d' = 1 × 10^{10}, f' = 1. When parameters c' and d' are near zero, if f' ≈ c and d' is near zero, the two transformations are virtually indistinguishable.
Format: EPS Size: 405KB Download file
Follicular Lymphoma Data Set
We examined a subset of a clinical FCM data set derived from lymph node biopsies from 10 individuals diagnosed with follicular lymphoma. Each sample was fivedimensional, labeled with CD5, CD19, and CD3, in addition to the FSC (forward scatter) and SSC (side scatter) channels. The ten individuals sampled consisted of five males and five females, of median age 59.5 years, (range 4082). Five individuals had stage 4 disease, three had stage 3 disease, and two had stage 1 disease. Overall survival for the group ranged from 0.45 years to 14.66 years.
Evaluating the Suitability of a Data Transformation
To evaluate the performance of different transformations, we measured the variation in the location of cell populations identified by automated gating using the flowClust/flowMerge frameworks [4,6]. The variation in the position of gated cell populations was measured across biological replicates, and compared for each transformation.
The FSC and SSC channels were analyzed separately from the fluorescence channels, according to procedures established elsewhere [6]. Data was normalized on the scale on which it would be visualized, therefore the scatter channels were normalized prior to transformation, whereas the fluorescence channels were normalized after data transformation (Figure 1a and 1b) [21]. In the scatter channels, parameteroptimized transformations (generalized arcsinh, generalized BoxCox, linlog, biexponential) were compared against defaultparameter counterparts (generalized arcsinh and biexponential), as well as the standard procedure of no transformation (Figure 1a path 1 and 2). For fluorescence channels, default and optimized parameter transformation were compared against together with the log transformation (Figure 1b, path 1 and 2). Populations in the scatter and fluorescence channels were automatically gated using the flowClust and flowMerge framework [4,6]. These discovered populations were metaclustered, and intracluster variability was compared either on the transformed scale (fluorescence channels), or on the original scale (scatter channels).
Figure 1. Flowchart of the analysis pipeline. Flowchart describing our analysis pipeline. a) Procedure for analyzing FSC vs SSC channels. Standard data analysis procedures are depicted by path 1, whereas procedures applying parameteroptimized transformations are depicted by path 2. b) Procedure for analyzing fluorescence channels. Standard procedures are depicted by path 1). Procedures utilizing optimized transformations are depicted by path 2. The default transformation depicted in 1) is the generalized arcsinh with default parameters (a = 1, b = 1, c = 0), as defined in the flowCore package. Normalization follows transformation in b) to ensure that the transformed data are on a common scale.
Cell Population Identification, Metaclustering and Metrics
Automated gating was performed using the flowClust and flowMerge packages in BioConductor [4,6,22]. Lymphocyte cell populations were identified manually from among the gated populations, and corresponded to the most dense clusters of cells in the FSC vs SSC dimensions. We evaluated the performance of each transformation by examining the variation in the position of the lymphocyte cell populations across biological replicates. The variation was measured as the sum of squared deviations from the mean lymphocyte cell population position across the samples. Cell populations identified in the fluorescence channels by automated gating were matched across samples using a metaclustering approach. This involved a modified single linkage agglomerative clustering approach to group corresponding populations across samples [23]. We provide an informal description of the algorithm, followed by a formal definition. Informally, we assume that cell populations in each sample are unique, therefore no more than one population from a sample can belong to a metacluster. We choose the sample with the largest number of cell populations and assign each population to unique metacluster. We then iterate through each cell population in the remaining samples and assign it to its nearest metacluster (measured by the Mahalanobis distance), with the restriction that no more than one cell population from a sample may belong to the same metacluster.
More formally, we let S = {s_{1}, . . , s_{n}, . . , s_{N }} be the set of all samples, indexed by n. Following automated gating, a sample is clustered into some number, π_{n}, of discovered cell populations. We let be the set of discovered cell populations within the nth sample, such that the cardinality of P^{n }is P^{n} = π_{n}. On the transformed scale, each discovered cell population is summarized by a multivariatet distribution. The parameters of the multivariatet distribution representing the ith cell population from the nth sample are , the ddimensional vector of means, , the d × d covariance matrix, and , the degrees of freedom, as defined in Lo et. al. [6].
The set of metaclusters is denoted ℳ = {M_{1}, M_{2}, ..., M_{K}}. Each metacluster is itself a set of populations with the constraint that no metacluster can contain more than one population from a given sample. For each population, , we define a label which can take a value between 1 and K, where K = max_{n}(π_{n}). If = k, population belongs to metacluster k. Additionally the distance between a metacluster M_{k }and a population is denoted , which is the minimum distance between population and all populations that are already assigned to metacluster k. To enforce the one sample per metacluster constraint we set if for some m. We define the distance between population i in the nth sample and population j in the mth sample as , which is the Mahalanobis distance, assuming that the first argument () is a population assigned to metacluster k [24].
To construct the metaclusters we:
1. Set the number of metaclusters to K = max_{n}(π_{n})
2. Initialize the metaclusters with the K populations from sample where n_{0 }= arg max_{n}(π_{n}). If there is more than one such sample, pick the one with the best likelihood or best separation of clusters (entropy).
3. Let , assign to metacluster k.
4. Repeat until each population is assigned to a metacluster.
Simulation Study
To further test our algorithm, we simulated ten data sets of N = 15000 events in threedimensions from nine cell populations modeled as a multivariatet mixture distribution with four degrees of freedom, following the approach of Lo et al., and fixed proportions drawn from a Dirichlet distribution with parameter α = (1, 1, 1, 1, 1, 1, 1, 1, 10) [6]. The population proportions were,
Eight cell populations were distributed in three dimensional space at the eight corners of a cube, with one cell population located in the center of the cube (Figure 2a, top row). Real FCM data frequently contain one cell population that has higher density than the other cell populations in the mixture. We simulated this characteristic of FCM data by assigning higher weight to the central population in our simulated data and it is reflected in the parameters of the Dirichlet. Simulated cell population locations ranged from zero to seven in arbitrary units, and corresponding populations has variance σ^{2 }= 0.25 across the ten data sets. The simulated data were transformed (Figure 2a, bottom row) by the inverse biexponential using different, randomly chosen transformation parameters for each sample ({a, c} ~ U(0, 1) and {b, d} U(0, 2)), where {X, Y} ~ U(p, q) denotes that variables X and Y are independently drawn from a Uniform distribution over the interval [p, q]. We applied our algorithm to this inversetransformed data, optimizing transformation parameters for the generalized arcsinh, biexponential, generalized BoxCox, and linlog transforms in order to recover Y^{(θ)}. We then compared the output of these transformations against the defaultparameter versions of the biexponential and generalized arcsinh transforms, as well as the original untransformed data. To assess the performance of different transformations the data were normalized, gated using flowClust/flowMerge, and the discovered populations were clustered across data sets (metaclustered) [21]. We then measured the resulting intrametacluster variability as well as the misclassification rate for individual events in the discovered populations, relative to their true class membership (Figure 2bd) [4].
Figure 2. Simulation study results. Results of transformations on simulated data. a) A single simulated sample is shown as a series of bivariate dot plot projections. Data are presented on the original scale (top row), and on the scale of the inverse biexponential transform (bottom row). Points represent individual events. b) Boxplots representing the distribution of misclassification rates of flowMerge models with K = 9 components fitted to the simulated data set under different transformations. c) Intracluster variability measured as the total sum of squared deviations for metaclustered populations identified by flowMerge under different transformations. d) Example bivariate projections of metaclusters for untransformed data (top row), default biexponential (second row), optimized biexponential (third row), optimized generalized arcsinh (fourth row), and default generalized arcsinh (fifth row). Corresponding metaclusters were selected where possible. Metaclusters are labeled as +/+, /, /+ for artificial markers A and B. Ellipses represent 90th quantile contours of subpopulations.
Results
Follicular Lymphoma Data
Our approach to data analysis of fluorescence and scatter data differs slightly, in that scatter data are normalized prior to transformation, while fluorescence data are normalized posttransformation, in accordance with common practice (Figure 1ab). We examined the effects of parameteroptimized transformations compared to their defaultparameter counterparts on visualization of cell populations in the scatter and fluorescence channels (Figure 3a, b). For scatter channels, differences between parameteroptimized, default, and untransformed data are clearly visible (Figure 3a). The optimized version of the biexponential, generalized arcsinh, and generalized BoxCox, all provide improved visualization of cell populations than the defaultparameter biexponential, generalized arcsinh, or the untransformed data. For the fluorescence channels, the data are put on a common scale following transformation (see Materials and Methods), and distinct differences can be seen between the optimized and default transformations (Figure 3b). Populations are better resolved following some transformations than others. The optimized biexponential improves visualization of cell populations compared to the biexponential with default parameters, while there is no observable difference between the default and optimized generalized arcsinh transformation. Other transformation show similarly variable results. Although differences between optimized and default transformations are subtle under visual inspection, these subtleties can have significant effects on model fitting and model selection during automated gating if they lead to violations of model assumptions (Figure 4a, b). For example, small deviations from symmetry in the shape of cell subpopulations can lead to selection of models with more components, more parameters, or different degrees of freedom, leading to different final gates.
Figure 3. Visualization of FCM data under different transformations. Visualization of flow cytometry data under different transformations. a) The FSC and SSC dimensions of a representative sample transformed using the biexponential, generalized arcsinh, generalized BoxCox, and linlog transformations with optimized parameters. Untransformed and defaultparameter generalized arcsinh and biexponential transformed data are shown for comparison. Some parameteroptimized transformations (biexponential and generalized BoxCox in this example) improve visualization and resolution of the lymphocyte cell population when compared to default or untransformed data. b) Comparison of a fluorescence channel data under different parameter optimized and default transformations. In this example, the optimized biexponential and optimized linlog improve the resolution of the two populations in the CD19 vs CD5 channels, compared to the default generalized arcsinh or default biexponential. Points represent individual events, contours represent the twodimensional kernel density estimate of the data.
Figure 4. Variation in cell subpopulation locations. Variation in population location measured as the sum of squared distances between population centers for a) lymphocyte populations in the FSC vs SSC channels. b) metaclusters of cell populations in the fluorescence channels.
To obtain a quantitative measure of the effect of different transformations on the automated analysis of realworld flow cytometry data, we proceeded to perform automated gating populations in the scatter and fluorescence channels, as described in the Methods. We measured the variability between discovered populations by metaclustering them across samples, and measuring the intrametacluster variability as the sum of squared deviations (Figure 4a, b). There were minimal differences in the variability of the lymphocyte populations between different transformations applied to the scatter channels (Figure 4a). The optimized generalized arcsinh and optimized generalized BoxCox transformations had the lowest variation in metaclusters, performing better than the default generalized arcsinh, or the standard of no transformation. In contrast, larger differences in variability were observed between transformations for populations in the fluorescence channels (Figure 4b). The optimized biexponential, optimized generalized BoxCox, and default biexponential exhibited the lowest intrametacluster variation, whereas the optimized linlog, log, and default generalized arcsinh exhibited the highest intrametacluster variation. However, on the scale of the scatter data, it appears that the differences between transformations are not large.
The differences between lymphocyte metaclusters in the scatter dimensions are readily seen to be minimal when visualizing the metaclusters directly on the original scale (Figure 5, contour show's 90th percentile of the population). Each metacluster contains all ten lymphocyte populations from the ten samples in the data set. Similarly, metaclusters in the fluorescence channels are visualized directly on the transformed scale (Figure 6ag, contours show 90th percentile of the populations). The metaclusters of the primary cell populations in the samples are shown (CD3/CD19+/CD5+ and CD3+/CD19/CD5), represented as a series of bivariate projections for each transformation examined. Interestingly, with the exception of the default biexponential transform, all metaclusters contain representative cell populations from all samples in the data set. Additionally, the total number of metaclusters varies between transformations (shown in brackets), indicating that the principal source of variation in the metaclustering is due to cell populations represented as outliers in the flowClust/flowMerge gating, rather than due differences in the welldefined, primary cell populations in the samples.
Figure 5. Metaclusters in the scatter channels. Metaclusters of lymphocyte populations in the FSC and SSC dimensions for four optimized and three default transformations. We observe little difference between the transformations in the forward and side scatter dimensions for gating lymphocyte populations, suggesting that the primary benefit in the FSC vs SSC dimensions is for data visualization. Contours represent the 90th quantiles of the cell subpopulation distributions.
Figure 6. Metaclusters in the fluorescence channels. Metaclusters of cell populations defined in the fluorescence channels of the lymphoma data under different transformations. Only the primary cell populations are shown for comparison (CD19+/CD3/CD5+ and CD19/CD3+/CD5). The number of metaclusters for the transformed data is shown in brackets. The number of cell populations in a metacluster is shown above the plot. Contours represent the 90th percentiles of the cell subpopulation distributions.
Simulation
Simulated data allows access to the "true" class memberships of individual events that are not available with realworld data, and can help to better understand the effects of transformation and parameter selection on automated gating. We again briefly describe our approach here. We simulated ten data sets, transforming each with the inverse of the biexponential transform using randomly chosen parameter values, as described in the methods (Figure 2a). We then applied our algorithm to the inverse transformed data to estimate optimal parameters under different transformations and again transformed the inversetransformed data using those estimated parameters and the appropriate transformation. The data output by this process, which should reflect the distribution of the original data, was subjected to automated gating. By comparing the true class membership of events in the original data against the class membership of events gated by flowClust/flowMerge, we computed the misclassification rate of the gating model under different optimized and default transformations (Figure 2b). When the optimized transformation was of the same family as the inverse transformation (i.e. generalized arcsinh or biexponential) we observed misclassification rates comparable to the rates obtained for the untransformed data (i.e. the gold standard data transformed using the correct transformation parameters) (mean misclassification rate of 10.7%, 10.0% and 10.1% for untransformed data, optimized biexponential and optimized generalized arcsinh, respectively) (Figure 2b). In contrast, the optimized generalized BoxCox, optimized linlog, as well as the default generalized arcsinh and biexponential had considerably increased misclassification rates with higher variability across the simulated samples (mean misclassification rate of 23.3%, 11.8%, 12.7%, and 13.2% for optimized generalized BoxCox, optimized linlog and default generalized arcsinh, default biexponential, respectively) (Figure 2b). The poor performance of the optimized generalized BoxCox is not surprising in this case, since the biexponential inversetransformation applied to the data was quite different from the generalized BoxCox transform. This demonstrates that optimization of the transformation parameters together with selection of an appropriate transformation can lead to significant improvements in model fitting during automated gating that compare favourably to results obtained for the untransformed data. Furthermore, the estimated transformation parameters compare favourable with the true transformation parameters (Additional File 3), demonstrating that our assumption of a global multivariate Gaussian distribution is acceptable for the purpose of transformation.
Additional file 3. Residuals of estimated biexponential parameters. Boxplots showing the difference between true and estimated biexponential parameters on ten simulated data sets. Although the optimized estimates are variable for parameters b, d, the misclassification rates of fitted models demonstrate that this bias doesn't negatively impact the subsequent gating.
Format: EPS Size: 481KB Download file
Following extraction of discovered cell populations and metaclustering, we observed the lowest intracluster variability for the optimized generalized arcsinh and optimized biexponential transformations, followed by the baseline of no data transformation, indicating that parameter optimization can reduce intersample variability and aid population matching (Figure 2c). The largest intersample variability following metaclustering was observed for the optimized generalized BoxCox, default generalized arcsinh, and optimized linlog transformations (Figure 2c). Selected metaclusters obtained from default and parameteroptimized transformations demonstrate that parameter optimized transformations can lead to better population identification in the automated gating step, and consequently lower variability metaclusters (Figure 2d).
Software and Availability
We have implemented parameteroptimization routines for the biexponential, linlog, generalized arcsinh, and generalized BoxCox transformations in the R package, flowTrans. This package integrates with the existing FCM data analysis tools in BioConductor and uses existing data structures data manipulation paradigms from the flowCore framework [22,25]. The package is freely available at http://www.bioconductor.org/packages/release/bioc/html/flowTrans.html webcite.
Discussion
Ideally, all cell subpopulations in an FCM sample would be wellseparated to facilitate gating. In practice, this is rarely, if ever the case. Real flow data typically consists of a mixture of complicated distributions that are asymmetric, frequently overlapping, with cell populations whose variances are dependent on their mean fluorescence intensities. A variety of automated gating algorithms have been proposed to identify distinct cell subpopulations [1,610]. Most of these automated approaches function under the assumption that the data has been transformed prior to the automated gating step. Others even include data transformation as part of the gating procedure [6,8,10]. The global data transformation step treats all cells and cell subpopulations equally, and hides an implicit assumption of a common error model across all cells and cell subpopulations. The problem is that we do not know what the correct error model is for FCM data, and so it is simply ignored. In this paper, our approach to parameter optimization attempts to make this assumption more explicit. We assume that, on the transformed scale, the global data distribution can be approximated by a multivariate Gaussian distribution. Although the data are clearly not multivariate normal, as typical FCM data are multimodal and best represent by a mixture model, we transform the data towards normality to make the distribution more symmetric, mitigate the impact of outliers, and to have an objective criterion to use for estimating transformation parameters. Our simulation study shows that our assumption of a common Gaussian distribution allows us to obtain reasonable estimates of the optimal transformation parameters even when the Gaussian assumption is violated. Only subsequently do we take on the task accurately modelling the data using a modelbased automated gating algorithm. We leave the work of fitting a mixture model to resulting transformed data to the automated gating algorithm.
We have examined the impact of the choice of transformation and its corresponding parameters on the automated gating procedure. Transforming the data towards normality reduces the influence of outlier events and our simulations have shown that optimizing transformation parameters in this way can improve gating and cell subpopulation discovery when compared to applying a naive transformation with default parameter values. Optimization of the parameters leads to lower misclassification rates, improved cell population identification, lower intersample variability and fewer outliers than blindly applying default transformation parameters. In our simulations, data were transformed with the biexponential function using randomly chosen parameters. Only parameteroptimized transformations from the same family (optimized generalized arcsinh and optimized biexponential) regenerated a data distribution similar to the untransformed data, as exhibited by comparable gating misclassification rates (Figure 2b). The optimized linlog and generalized BoxCox transformations did not have the flexibility to transform the data back to the original distribution, leading to biases in the automated gating step. Although the differences in misclassification rate are only a few percent, they could introduce significant variation on downstream analysis, especially in largesample situations, where many data sets need to be analyzed in an automated manner.
Our analysis of a subset of realworld flow cytometry data set derived from lymph node biopsies of individuals with follicular lymphoma demonstrated similar results to those obtained using simulated data. Optimizing transformation parameters in order to make the transformed FCM data more normallike can improve data visualization and cell subpopulation identification in certain cases. However the improvement in performance is data dependent, and it is unclear how to determine in advance which samples benefit from such an approach as compared to applying a standard FCM transformation. Addressing this problem is the subject of future work. In the cases examined in this study, transformation of the scatter dimensions via a parameteroptimized generalized arcsinh or biexponential transformation generally improves cell subpopulation visualization (Figure 3). Therefore, in situations where scatter data are to be gated manually we argue in favour of such an approach. However, we note that under an automated gating scheme, the differences between transformations are marginal on the scale of the data, and the benefits are minimal, particularly when FSC and SSC cell populations are well defined. This is in accordance with what is typically seen in FCM data analysis [14].
Cell subpopulation identification in the fluorescence channels benefits more from parameter optimized transformation than in the scatter channels. However, again the choice of transformation is datadependent. While the greatest improvement in metacluster variability was observed for the parameteroptimized generalized arcsinh transformation in the case of scatter channels, here the greatest improvement is observed for the parameteroptimized biexponential transformation and parameteroptimized generalized BoxCox transformation. The intrametacluster variation for these transforms is lower than for the log transformation, but only marginally lower than for the default parameter biexponential transformation. Despite this, closer examination of the metaclusters generated from the default biexponential transformed data shows that one of the CD3/CD5+/CD19+ cell subpopulations was not captured by the CD3/CD5+/CD19+ metacluster. Additionally, the default biexponential transformed data leads to seven metaclusters, compared to six metaclusters obtained for the optimized biexponential transformed data. These additional metaclusters capture outlier cell populations which are not of interest in this particular experiment. In general, the metaclusters representing the primary cell subpopulations in this data all capture representatives from each of the ten samples. The predominant source of intrametacluster variability is derived from the metaclusters corresponding to outlier cell populations. Therefore, in situations where an automated gating and analysis approach is undertaken, we recommend applying a parameteroptimized data transformation such as the optimized biexponential, rather than the default biexponential, since it has the potential to improve cell subpopulation discovery and matching across multiple samples. This is likely to be a greater concern in largesample situations than when a small number of samples are to be analyzed manually.
Our metaclustering approach has been designed to work specifically for data gated using the flowClust/flowMerge algorithms. The constraint limiting each metacluster to one cell population per sample is predicated on the assumption that the gating algorithm (flowMerge in our case) represents each distinct cell population by a unique mixture component. Generally speaking, this is a safe assumption in the case of the flowMerge algorithm, which has been designed to identify and represent distinct cell populations by unique mixture components [4]. However, the metaclustering algorithm would have to be modified if the gating algorithm were changed.
We explored alternative approaches to parameter estimation that included preselection of a subpopulation of events and optimizing transformation parameters with respect to the subpopulation. However, this approach proved ineffective, since preselection effectively removed outlier events. It is these outliers that cause problems for downstream automated gating approaches, and thus should be considered in the transformation step. The decision to transform the data towards a multivariate Gaussian distribution is supported by our ability to accurately recover the true transformation parameters in our simulation study, even though the original data distribution is clearly derived from a multivariate mixture model. Our approach could be extended in a number of ways analogous to flowClust, either by embedding the transformations within the gating step, thus selecting transformation parameters maximize the likelihood of a Kcomponent mixture rather than a single component density, or by modelling the data with a more robust distribution such as the multivariatet. This approach could have the benefit of generating less variable parameter estimates, though likely at the expense of computation time.
Conclusions
Although the idea of optimizing transformations for FCM data is not new, to date, there has been no systematic comparison of FCM data transformations examining their performance in an automated data analysis setting [6]. We have developed criteria for optimizing the parameters of transformations commonly used for preprocessing and visualization of FCM data, designed to transform the data towards a more multivariate normal and symmetric global data distribution. We have shown that these parameteroptimized transformations can improve data visualization, population discovery, and metaclustering, relative to their defaultparameter counterparts in certain samplespecific cases. Parameter optimization of the generalized BoxCox transform has been previously implemented within an automated gating framework implementing a mixture modelling approach (flowClust) [6]. While it would be of interest to implement a similar strategy for the other commonly used flow cytometry data transformations, our goal here has been to examine the influence of global transformation on the automated gating step of high throughput FCM analysis. Our software allows the user to quickly examine the effects of different parameteroptimized transformations on the data, and defer the computationintensive gating step to downstream analysis. Our findings showed that the optimized generalized arcsinh transformation had the lowest intrasample variability between populations for the scatter channels. However, substantive differences in variation were subtle and rare across the different transformations in the scatter channels. In contrast, the optimized biexponential transform had the lowest variability for the fluorescence channels. This transformation is in accordance with current best practices. Therefore we would recommend, in the absence of additional knowledge to suggest a given data transformation, the parameteroptimized versions of the biexponential transform for fluorescence channel data, over the defaultparameter counterpart. In situations where many samples need to be processed in an automated manner, the parameteroptimized generalized arcsinh may be a better choice over the current standard practice of not transforming the scatter data, thus reaping the benefits of occasional improvements in the automated gating step. Due to its simplicity, our optimization algorithm could be readily implemented in other widely available tools such as FlowJo and WinList, and indeed, efforts are currently underway to tie FlowJo more closely to the R computing environment [11].
Authors' contributions
GF designed experiments, performed analysis, and contributed to manuscript preparation. RG designed experiments and contributed to manuscript preparation. JMP contributed to manuscript preparation. AW contributed the clinical FCM data.
Acknowledgements
The authors wish to acknowledge the help of Dr. Randy Gascoyne and Ryan Brinkman for providing the lymphoma data set. This work was funded by a CIHR bridge funding, NIH grant R01EB008400 and an establishment grant from the Fonds de la recherche en santé du Quebec (FRSQ). We thank Drs. Thomas Lumley and Ryan Brinkman for their feedback during manuscript preparation. We also sincerely thank the reviewers for their comments and suggestions towards improving the manuscript.
References

Walther G, Zimmerman N, Moore W, Parks D, Meehan S, Belitskaya I, Pan J, Herzenberg L: Automatic clustering of flow cytometry data with densitybased merging.
Adv Bioinformatics 2009, 686759. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier LM, BaecherAllan C, McLachlan GJ, Tamayo P, Hafler DA, Jager PLD, Mesirov JP: Automated highdimensional flow cytometric data analysis.
Proc Natl Acad Sci USA 2009, 106(21):851924. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Rogers WT, Holyst HA: FlowFP: A Bioconductor Package for Fingerprinting Flow Cytometric Data.
Adv Bioinformatics 2009, 193947. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Finak G, Bashashati A, Brinkman R, Gottardo R: Merging mixture components for cell population identification in flow cytometry.
Advances in Bioinformatics 2009, 247646. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gottardo R, Brinkman RR, Luta G, Wand MP: Recent bioinformatics advances in the analysis of high throughput flow cytometry data.
Adv Bioinformatics 2009, 461763. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lo K, Brinkman R, Gottardo R: Automated gating of flow cytometry data via robust modelbased clustering.
Cytometry Part A 2008, 73(4):321. Publisher Full Text

Boedigheimer M, Ferbas J: Mixture modeling approach to flow cytometry data.
Cytometry A 2008, 73(5):4219. PubMed Abstract  Publisher Full Text

Naumann U, Luta G, Wand MP: The curvHDR method for gating flow cytometry samples.
BMC Bioinformatics 2010, 11:44. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Rogers W, Moser A, Holyst H, Bantly A: Cytometric fingerprinting: Quantitative characterization of multivariate distributions. [http://www.ncbi.nlm.nih.gov/pubmed/18383310] webcite
Cytometry. Part A: the journal of the International Society ... 2008.

Naumann U, Wand M: Automation in highcontent flow cytometry screening.

Gosink JJ, Means GD, Rees WA, Su C, Rand HA: Bridging the Divide between Manual Gating and Bioinformatics with the Bioconductor Package flowFlowJo.
Adv Bioinformatics 2009, 809469. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Parks D, Roederer M, Moore W: A new" Logicle" display method avoids deceptive effects of logarithmic scaling for low signals and compensated data.
Cytometry Part A 2006, 69(6):541. Publisher Full Text

Bagwell C: Hyperloga flexible loglike transform for negative, zero, and positive valued data.
Cytometry A 2005, 64:3442. PubMed Abstract  Publisher Full Text

Bashashati A, Brinkman RR: A survey of flow cytometry data analysis methods.
Adv Bioinformatics 2009, 584603. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Novo D, Wood J: Flow cytometry histograms: transformations, resolution, and display.
CytometryPart A 2008, 73(8):685692. Publisher Full Text

Herzenberg L, Tung J, Moore W, Herzenberg L, Parks D: Interpreting flow cytometry data: a guide for the perplexed.
Nature Immunology 2006, 7(7):681. PubMed Abstract  Publisher Full Text

Dvorak J, Banks S: Modified boxcox transform for modulating the dynamic range of flow cytometry data.
Cytometry Part B: Clinical Cytometry 1989, 10(6):811813. Publisher Full Text

Lo K, Hahne F, Brinkman R, Gottardo R: flowClust: a Bioconductor package for automated gating of flow cytometry data.
BMC Bioinformatics 2009, 10:145. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Box G, Cox D: An analysis of transformations.
Journal of the Royal Statistical Society. Series B (Methodological) 1964, 26(2):211252.

Ihaka R, Gentleman R: R: A language for data analysis and graphics.
Journal of computational and graphical statistics 1996, 5(3):299314. Publisher Full Text

Hahne F, Khodabakhshi A, Bashashati A, Wong C, Gascoyne R, Weng A, SeyfertMargolis V, Bourcier K, Asare A, Lumley T, Gentleman R, Brinkman R: Perchannel basis normalization methods for flow cytometry data. [http:/ / www.google.ca/ search?client=safari&rls=enus&q=Pe rchannel+basis+normalization+ meth ods+for+flow+cytometry+data&ie=UTF 8&oe=UTF8&redir_esc=&ei=vYi8S4fpEY mC9ASLqaH0Bw] webcite

Gentleman R, Carey V, Bates D, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J: Bioconductor: open software development for computational biology and bioinformatics.
Genome Biol 2004, 5(10):R80. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Johnson S: Hierarchical clustering schemes.
Psychometrika 1967, 32(3):241254. PubMed Abstract  Publisher Full Text

Mahalanobis P: On the generalized distance in statistics.
Proceedings of the National Institute of Science, Calcutta 1936, 12:49.

Hahne F, LeMeur N, Brinkman R, Ellis B, Haaland P, Sarkar D, Spidlen J, Strain E, Gentleman R: flowCore: a Bioconductor package for high throughput flow cytometry.
BMC Bioinformatics 2009, 10:106. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text