Department of Statistics, University of British Columbia, 333-6356 Agricultural Road, Vancouver, BC, V6T1Z2, Canada

Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, WA 98109, USA

Terry Fox Laboratory, BC Cancer Research Center, 675 West 10th Avenue, Vancouver, BC, V5Z1L3, Canada

Institut de recherches cliniques de Montreal, 110, avenue des Pins Ouest, Montreal, QC, H2W 1R7, Canada

Département de biochimie, Université de Montreal, 2900, boul Edouard-Montpetit, Montreal, QC, H3T 1J4, Canada

Abstract

Background

As a high-throughput technology that offers rapid quantification of multidimensional characteristics for millions of cells, flow cytometry (FCM) is widely used in health research, medical diagnosis and treatment, and vaccine development. Nevertheless, there is an increasing concern about the lack of appropriate software tools to provide an automated analysis platform to parallelize the high-throughput data-generation platform. Currently, to a large extent, FCM data analysis relies on the manual selection of sequential regions in 2-D graphical projections to extract the cell populations of interest. This is a time-consuming task that ignores the high-dimensionality of FCM data.

Results

In view of the aforementioned issues, we have developed an **R **package called **flowClust **to automate FCM analysis. **flowClust **implements a robust model-based clustering approach based on multivariate **flowClust **has been adapted for the current FCM data format, and integrated with existing Bioconductor packages dedicated to FCM analysis.

Conclusion

**flowClust **addresses the issue of a dearth of software that helps automate FCM analysis with a sound theoretical foundation. It tends to give reproducible results, and helps reduce the significant subjectivity and human time cost encountered in FCM analysis. The package contributes to the cytometry community by offering an efficient, automated analysis platform which facilitates the active, ongoing technological advancement.

Background

Flow cytometry (FCM) is a high-throughput technology that offers rapid quantification of a set of physical and chemical characteristics for a large number of cells in a sample. FCM is widely used in health research and treatment for a variety of tasks, such as providing the counts of helper-T lymphocytes needed to monitor the course and treatment of HIV infection, in the diagnosis and monitoring of leukemia and lymphoma patients, the evaluation of peripheral blood hematopoietic stem cell grafts, and many other diseases

Currently, FCM can be applied to analyze thousands of samples per day. Nevertheless, despite its widespread use, FCM has not reached its full potential due to the lack of an automated analysis platform to parallel the high-throughput data-generation platform. In contrast to the tremendous interest in the FCM technology, there is a dearth of statistical and bioinformatics tools to manage, analyze, present, and disseminate FCM data. There is considerable demand for the development of appropriate software tools, as manual analysis of individual samples is error-prone, non-reproducible, non-standardized, not open to re-evaluation, and requires an inordinate amount of time, making it a limiting aspect of the technology

One core component of FCM analysis involves gating, the process of identifying cell populations that share a set of common properties or display a particular biological function. Currently, to a large extent, gating relies on the sequential application of a series of manually drawn gates (i.e., data filters) that define regions in 1- or 2-D graphical projections of FCM data. This process is manually time-consuming and subjective as researchers have traditionally relied on intuition rather than standardized statistical inference

Recently, a suite of several **R **packages providing infrastructure for FCM analysis have been released though Bioconductor **flowCore ****flowViz ****flowQ **provides quality control and quality assessment tools for FCM data. Finally, **flowUtils **provides utilities to deal with data import/export for **flowCore**. In spite of these low-level tools, there is still a dearth of software that helps automate FCM gating analysis with a sound theoretical foundation

In view of these issues, based on a formal statistical clustering approach, we have developed the **flowClust **package (Additional file **flowClust **implements a robust model-based clustering approach **flowClust **has included options allowing for a cluster-specific estimation of the Box-Cox transformation parameter and/or the degrees of freedom parameter; the Implementation section and the Results and Discussion section provide a detailed account of these extensions.

**A copy of the flowClust package**. The zip file contains the source code of the **flowClust **package (version 2.2.0) as a gzipped tarball for direct installation into R from a command-line interface. This current release is also available from Bioconductor at

Click here for file

Implementation

The model

In statistics, model-based clustering **y**_{1}, **y**_{2},...,**y**_{n}, and denoting by

where _{g }is the probability that an observation belongs to the _{p}(·|_{g}, **Σ**_{g}, _{g}) is the _{g }(_{g }> 1), covariance matrix _{g }(_{g }- 2)^{-1 }**Σ**_{g }(_{g }> 2) and _{g }degrees of freedom. **y**_{i }with the Box-Cox parameter _{g}; the transformation used is a variant of the original Box-Cox transformation which is also defined for negative-valued data **Ψ **= (**Ψ**_{1},...,**Ψ**_{G}) where **Ψ**_{g }= (_{g}, _{g}, **Σ**_{g}, _{g}, _{g}).

The EM algorithm needs to be initialized. By default, random partitioning is performed 10 times in parallel, and the one delivering the highest likelihood value after a few EM runs will be selected as the initial configuration for the eventual EM algorithm.

Note that, in the model originally proposed in

When the number of clusters is unknown, we use the Bayesian Information Criterion (BIC)

The package

With the aforementioned theoretical basis, we have developed **flowClust**, an **R **package to conduct an automated FCM gating analysis and produce visualizations for the results. Its source code is written in C for optimal utilization of system resources and makes use of the Basic Linear Algebra Subprograms (BLAS) library, which facilitates multithreaded processes when an optimized library is provided.

**flowClust **is released through Bioconductor **R **packages mentioned in the Background section. The GNU Scientific Library (GSL) is needed for successful installation of **flowClust**. Please refer to the vignette that comes with **flowClust **for details about installation; Windows users may also consult the README file included in the package for procedures of linking GSL to **R**.

The package adopts a formal object-oriented programming discipline, making use of the S4 system

To enhance communications with other Bioconductor packages designed for the cytometry community, **flowClust **has been built with the aim of being highly integrated with **flowCore**. Methods in **flowClust **can be directly applied on a **R **implementation of a Flow Cytometry Standard (FCS) file defined in **flowCore**; FCS is the typical storage mode for FCM data. Another step towards integration is to overload basic filtering methods defined in **flowCore **(e.g., **flowClust**.

Results and discussion

Analysis of real FCM data

In this section, we illustrate how to use **flowClust **to conduct an automated gating analysis of real FCM data. For demonstration, we use the graft-versus-host disease (GvHD) data (Additional file ^{+}CD4^{+}CD8^{+ }cell population, a distinctive feature found in GvHD-positive samples. We have adopted a two-stage strategy

**A copy of the GvHD data file used in this article**. The zip file contains the data file in FCS format used in the GvHD analysis. Interested readers may go to

Click here for file

At the initial stage, we extract the lymphocyte population using the forward scatter (

To estimate the number of clusters, we run

A plot of BIC against the number of clusters for the first-stage cluster analysis

**A plot of BIC against the number of clusters for the first-stage cluster analysis**. The BIC curve remains relatively flat beyond four clusters, suggesting that the model fit using four clusters is appropriate.

The estimate of the Box-Cox parameter

Note that, by default,

The

**A graph with two BIC curves corresponding to the settings with a common λ and cluster-specific λ respectively for the first-stage cluster analysis**. Little difference in the BIC values between the two settings is observed. In accordance with the principle of parsimony which favors a simpler model, we opt for the default setting here.

Click here for file

Graphical functionalities are available to users for visualizing a wealth of features of the clustering results, including the cluster assignment, outliers, and the size and shape of the clusters. Figure

A scatterplot revealing the cluster assignment in the first-stage analysis

**A scatterplot revealing the cluster assignment in the first-stage analysis**. Clusters 1, 3 and 4 correspond to the lymphocyte population, while cluster 2 is referred to as the dead cell population. The black solid lines represent the 90% quantile region of the clusters which define the cluster boundaries. Points outside the boundary of the cluster to which they are assigned are called outliers and marked with "+".

See Additional file

**Result summary of the first-stage analysis with four clusters of the GvHD data**. The rule used to identify outliers is

Click here for file

Clusters 1, 3 and 4 in Figure

The subsetting method

In the second-stage analysis, in order to fully utilize the multidimensionality of FCM data we cluster the lymphocyte population using all the four fluorescence parameters, namely, anti-CD4 (

The BIC curve remains relatively flat beyond 11 clusters (Figure ^{+}CD4^{+}CD8^{+ }cell population. A corresponding image plot is given by Figure

**Code to produce the plots in this article**. R code to produce the plots in the GvHD analysis.

Click here for file

A plot of BIC against the number of clusters for the second-stage cluster analysis

**A plot of BIC against the number of clusters for the second-stage cluster analysis**. The BIC curve remains relatively flat beyond 11 clusters, suggesting that the model fit using 11 clusters is appropriate.

A contour plot superimposed on a scatterplot of CD8^{+ }population

**A contour plot superimposed on a scatterplot of CD8 β against CD4 for the CD3 ^{+ }population**. The red and purple clusters at the upper right correspond to the CD3

An image plot of CD8^{+ }population

**An image plot of CD8 β against CD4 for the CD3 ^{+ }population**. The five clusters corresponding to the CD3

The example above shows how an FCM analysis is conducted with the aid of **flowClust**. When the number of cell populations is not known in advance, and the BIC values are relatively close over a range of the possible number of clusters, the researcher may be presented with a set of possible solutions instead of a clear-cut single one. In such a case, the level of automation may be undermined as the researcher may need to select the best one based on his expertise. We acknowledge that more effort is needed to extend our proposed methodology towards a higher level of automation. Currently, we are working on an approach which successively merges the clusters in the solution as suggested by the BIC using some entropy criterion to give a more reasonable estimate of the number of clusters.

Integration with flowCore

As introduced in the Background section, **flowClust **has been built in a way such that it is highly integrated with the **flowCore **package. The core function **flowCore **(e.g.,

used in the first-stage analysis of the GvHD data may be replaced by:

The use of a dedicated **flowCore**. Users may apply various subsetting operations defined for the

outputs a

We realize that occasionally a researcher may opt to combine the use of **flowClust **with filtering operations in **flowCore **to define the whole sequence of an FCM gating analysis. To enable the exchange of results between the two packages, filters created by **flowCore**; users of **flowCore **will find that filter operators, namely, **flowClust **package. For instance, suppose the researcher is interested in clustering the CD3^{+ }cell population which he defines by constructing an interval gate with the lower end-point at 270 on the CD3 parameter. He may use the following code to perform the analysis:

The constructors

Conclusion

**flowClust **is an **R **package dedicated to FCM gating analysis, addressing the increasing demand for software capable of processing and analyzing the voluminous amount of FCM data efficiently via an objective, reproducible and automated means. The package implements a statistical clustering approach using multivariate **flowClust **extends the one originally proposed in

Availability and requirements

Project name: flowClust

Project homepage:

Operating systems: Platform independent

Programming language: C, R

Other requirements: GSL, R, Bioconductor

License: Artistic 2.0

Any restrictions to use by non-academics: **flowClust **depends on the **mclust **software, the use of which needs to abide by the terms stated in

Authors' contributions

KL and RG developed the methodology and software, and performed the analyses. FH participated in the development of the software. RRB and RG conceived of the study, and participated in its design and coordination. FH, RRB and RG helped KL draft the manuscript. All authors read and approved the final manuscript.

Acknowledgements

The authors thank Martin Morgan, Patrick Aboyoun and Marc Carlson for their advice on the technical issues of building the **flowClust **package, and the two reviewers for suggestions that improved an earlier draft of the article. This work was supported by the NIH grants EB005034 and EB008400, and by the Michael Smith Foundation for Health Research.