Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720 USA

Genomics Division West, Lawrence Berkeley National Laboratory, Berkeley, CA 94720 USA

Department of Basic Medical Science, Purdue University, West Lafayette, IN 47907 USA

Janelia Farm Research Campus, Howard Hughes Medical Institute, Ashburn, VA 20147 USA

Abstract

Background

The distribution of chromatin-associated proteins plays a key role in directing nuclear function. Previously, we developed an image-based method to quantify the nuclear distributions of proteins and showed that these distributions depended on the phenotype of human mammary epithelial cells. Here we describe a method that creates a hierarchical tree of the given cell phenotypes and calculates the statistical significance between them, based on the clustering analysis of nuclear protein distributions.

Results

Nuclear distributions of nuclear mitotic apparatus protein were previously obtained for non-neoplastic S1 and malignant T4-2 human mammary epithelial cells cultured for up to 12 days. Cell phenotype was defined as S1 or T4-2 and the number of days in cultured. A probabilistic ensemble approach was used to define a set of consensus clusters from the results of multiple traditional cluster analysis techniques applied to the nuclear distribution data. Cluster histograms were constructed to show how cells in any one phenotype were distributed across the consensus clusters. Grouping various phenotypes allowed us to build phenotype trees and calculate the statistical difference between each group. The results showed that non-neoplastic S1 cells could be distinguished from malignant T4-2 cells with 94.19% accuracy; that proliferating S1 cells could be distinguished from differentiated S1 cells with 92.86% accuracy; and showed no significant difference between the various phenotypes of T4-2 cells corresponding to increasing tumor sizes.

Conclusion

This work presents a cluster analysis method that can identify significant cell phenotypes, based on the nuclear distribution of specific proteins, with high accuracy.

Background

Histological classification of biopsied breast tissue plays a key role in mammary cancer detection and in determining patient treatment. Current methods rely on gross signatures of cellular and tissue organization including tubular formation, nuclear pleomorphism and mitotic activity. To aid the early detection and diagnosis of mammary tumors, quantitative techniques are highly needed that could not only help automate the classification process but also provide subcellular information that could be used to reveal new subclasses of tumor within each pathological grade.

Increasing evidence has shown that chromatin-associated proteins are important in directing nuclear functions involved in the control of cell proliferation and differentiation

Based on these findings, Knowles et al then developed an image-based technique, called local bright feature (LBF) analysis

Here we report a cluster analysis approach, based on the distribution of nuclear proteins, that robustly calculates the statistical significance between cell phenotypes, which are defined by the behavior of the cells in 3D culture. The method first groups LBF distributions into clusters using multiple traditional clustering methods. The results are then combined by a probabilistic ensemble approach into a set of consensus clusters that can be used to reliably define all possible LBF distributions that exist within a data set. This then allows cluster histograms to be computed which show how the LBF distributions in individual cells from a group are distributed over the consensus clusters. These cluster histograms represent a new way of linking the phenotype of groups of phenotypically similar cells, defined by their behavior in 3D culture, with their LBF distributions, quantified microscopically. Further, by grouping the LBF cluster histograms in multiple ways, the method is then able to build a phenotype tree and to calculate the statistical significance between each grouping. Each level of the tree corresponds to a different phenotype division of the cells and provides a way to predict which of the cell phenotypes, or grouping of cell phenotypes are significantly different from each other. These methods were then applied to the LBF distributions of NuMA in S1 and T4-2 cells, previously reported in Knowles et al

Results

Dataset

As described in

We used three image datasets to test our phenotype clustering approach. The first dataset contains 2673 non-neoplastic S1 cells taken from 77 confocal images. Images 1–25, 26–45, 46–61, and 62–77 are S1 cells cultured for 12 days, 10 days, 5 days, and 3 days respectively. The second dataset contains 3535 malignant T4-2 cells taken from 44 images. Images 1–14, 15–26, 27–36, and 37–44 are T4-2 cells cultured in 5 days, 10 days, 11 days, and 4 days respectively. The third dependent dataset contains both malignant T4-2 and non-neoplastic S1 cells taken from the direct combination of all the 121 images. The time points were selected to span the growth progression of the non-neoplastic cultured cells. Optical sections from 3D images of individual nuclei, showing representative NuMA staining for each of the phenotypes, are displayed in the Methods section.

Clustering LBF distributions using traditional approaches

Using an automated image analysis method developed earlier

Using traditional approaches of fuzzy C-means clustering, Gaussian mixture model clustering (with a spherical kernel), K-means, hierarchical clustering (with a complete link scheme), and spectral clustering

Clustering 2673 non-neoplastic S1 cells into 8 clusters according to the similarities of their LBF distributions

**Clustering 2673 non-neoplastic S1 cells into 8 clusters according to the similarities of their LBF distributions**. Rows from the top to the bottom are the results of Gaussian mixture model clustering with spherical kernel (GM), fuzzy C-means clustering (Fuzzy), hierarchical clustering with complete link (Hier), K-means, and spectral clustering respectively (Spectral). Each cluster is represented by the centroid (curve) and the standard deviation (small vertical bar) of the LBF distributions in the cluster. The horizontal axis of each of the 5 × 8 panels is the normalized distance from the nucleus perimeter, the range being [0,1]. The vertical axis is the normalized bright feature density, the range being [0,2]. Also see Methods for the description of the LBF analysis.

Table

Pair-wise

GM

Fuzzy

Hier

Kmeans

Spectral

GM

1.0000

0.8837

0.5205

0.6296

0.6286

Fuzzy

0.8837

1.0000

0.5270

0.6932

0.6177

Hier

0.5205

0.5270

1.0000

0.4543

0.5365

Kmeans

0.6296

0.6932

0.4543

1.0000

0.6253

Spectral

0.6286

0.6177

0.5365

0.6253

1.0000

Finding consensus LBF clusters using probabilistic ensemble clustering

As shown in Table

Several different ensemble-clustering methods have become available. In

In this work, we used a probabilistic ensemble approach based on Bayesian latent variable induction

Using the probabilistic ensemble clustering approach (see Methods for detail), we derived the statistically optimal consensus from different data partition results generated by the five traditional clustering methods mentioned above. Figure

Consensus clusters of the five clustering results in Figure 1, generated by probabilistic ensemble clustering approach

**Consensus clusters of the five clustering results in Figure 1, generated by probabilistic ensemble clustering approach**. The number clusters, i.e., 16, is automatically determined by the algorithm. Like Figure 1, each curve represents the centriod of the cluster. The vertical bar represents the standard variation on the corresponding bin. The horizontal axis of each panel is the normalized distance from nucleus perimeter, the range being [0,1], and the vertical axis is the normalized bright feature density with the range being [0,2].

Table

Number of clusters (the second row) predefined in the individual clustering methods (i.e., Gaussian mixture model, fuzzy C-means, hierarchical clustering, K-means and spectral clustering) and those automatically determined by the probabilistic ensemble clustering method for both S1 and T4-2 cells (the third row).

Methods

Number of Clusters

Traditional methods

4

6

8

10

12

14

16

18

20

22

24

26

Probabilstic ensemble-clustering

19

18

18

16

19

20

19

20

22

22

23

25

Computing cluster histograms

With clusters reliably determined, we then calculated the number of LBF distributions falling into each cluster for each of the 8 populations of cells, i.e., non-neoplastic S1 cells cultured for 3 days, 5 days, 10 days, and 12 days, as well as malignant T4-2 cells cultured for 4 days, 5 days, 10 days, and 11 days. By doing so, we obtained a cluster histogram for each of the 8 populations of cells. Figure ^{th }to the 20^{th }clusters the peak location shifts from the left to the right. Figure

LBF distribution clusters and cluster histograms for 6208 S1 and T4-2 cells cultured for different numbers of days

**LBF distribution clusters and cluster histograms for 6208 S1 and T4-2 cells cultured for different numbers of days**. (a) Twenty LBF distribution clusters automatically determined by probabilistic ensemble clustering of the results generated by Gaussian mixture model, fuzzy C-means, hierarchical clustering, K-means, and spectral clustering. The number of the clusters predefined for these baseline methods is 14. The clusters are ordered from the left to the right and the top to the bottom according to their peak locations. (b) From the left to right and the top to the bottom: cluster histograms of non-neoplastic S1 cells cultured in 3 days, 5 days, 10 days, and 12 days, and of malignant T4-2 cells cultured in 4 days, 5 days, 10 days, and 11 days.

Constructing phenotype trees

Using the approach introduced in the Methods section, we have constructed phenotype trees to show how the phenotypes, defined by the behavior of the cells in 3D culture, can be hierarchically grouped and the statistical significance of each grouping calculated. Figure

Phenotype trees constructed for (a) non-neoplastic S1 cells, (b) malignant T4-2 cells, and (c) both S1 and T4-2 cells cultured for a different number of days

**Phenotype trees constructed for (a) non-neoplastic S1 cells, (b) malignant T4-2 cells, and (c) both S1 and T4-2 cells cultured for a different number of days**. The certainty of hierarchically grouping the cells of the predefined phenotypes (indicated by the leaf nodes in the highest level of the tree) into statistically more significant groups of the phenotypes is indicated by the

Using the same approach, we constructed the phenotype trees for malignant T4-2 cells and for the combination of S1 and T4-2 cells, as shown in Figure

Discussion and conclusions

We have developed a cluster analysis approach that can robustly link any given set of multivariate features measured on a per cell basis to the phenotype of the cells as defined by their macroscopic biology. The technique uses a probabilistic ensemble approach to group the measured multivariate features into a set of consensus clusters. This method provides a novel way of linking the phenotypes of groups of cells to cluster histograms that describe the distribution of the measured features across the consensus clusters. Then, by forming various groupings of the cluster histograms, the technique permits the formation of a phenotype tree and calculations of the statistical significance between each of the groups. If two groups of cells are found to be significantly different, one can conclude that the features measured in the cells can distinguish the groups that are indeed different. If the two groups are not significantly different, one can only conclude that the measured feature does not change between these groups. It does not imply that that the groups are necessarily identical.

The phenotype tree is a hierarchical representation of the possible grouping of the defined cell phenotypes. As such, a node in the tree at level

Illustration of the inconsistent phenotype grouping between successive levels

**Illustration of the inconsistent phenotype grouping between successive levels**. Each solid rectangle represents a phenotype node. A dashed line indicates combination operation. Phenotype groupings at level l and l+1 are inconsistent as the node BC at level l+1 is formed by breaking node AB and node CD at level l into two parts and combining one part of each node. In this case, the hierarchical structure cannot be represented as a tree.

We have shown how the cluster analysis technique can be applied to the radial LBF distributions of a chromatin-associated protein, NuMA

Collectively our data demonstrate the quantitative ability of clustering-based analysis to link microscopically measurable features with the behavior of the cells. The methods described demonstrate that it is possible to distinguish populations of cells based on the nuclear organization of a chromatin-associated protein, NuMA. This work paves the way for our longer term goal of producing a method capable of turning high resolution fluorescence images of human mammary epithelial tissue into tissue-maps that report the probable non-neoplastic, premalignant and malignant phenotype at cellular resolution.

Methods

Our phenotype clustering approach contains four steps (Figure

Diagram of the phenotype clustering algorithm

**Diagram of the phenotype clustering algorithm**. Details of the image acquisition and the extraction of the LBF for each nucleus is described in [5].

Extracting LBF distributions from nuclei

Using Zeiss 410 confocal laser-scanning microscope with planapochromatic 63×, 1.4 numerical aperture lens, we acquired hundreds of 3D images of non-neoplastic S1 and malignant T4-2 cells cultured for up to 12 days. Figure

Fluorescence micrographs showing representative NuMA staining patterns in individual nuclei for eight different phenotypes

**Fluorescence micrographs showing representative NuMA staining patterns in individual nuclei for eight different phenotypes**. In previous work [5] the radial nuclear distribution of NuMA was analyzed from 3D multichannel fluorescence images of thousands of individual nuclei. The human mammary epithelial cells were either non-neoplastic (top row) or malignant (bottom row) and were cultured in Matrigel™ (3D culture) for up to 12 days. Optical sections from 3D images, taken through the approximate midplane of individual nuclei are displayed. The optical sections were chosen to show representative features of the NuMA staining pattern. Panels a, b, c and d, show NuMA staining from non-neoplastic cells cultured for 3, 5, 10 and 12 days, representing cells present in incremental differentiation steps, respectively. Panels e, f, g, and h, show NuMA staining from malignant cells cultured for 4, 5, 10 and 11 days, representing cells present in tumors of increasing sizes, respectively. Notice that the nuclei of malignant cells are consistently larger than the nuclei of non-neoplastic cells. The bar represents 5 microns.

In an earlier study, an image analysis method was developed to extract the local bright staining features of NuMA protein and quantify their radial distribution in each individual nucleus (

LBF analysis of the distribution of NuMA from 3D images

**LBF analysis of the distribution of NuMA from 3D images**. (a) Fluorescence micrograph of Texas red-immunolabeled NuMA from a single optical section, in differentiated non-neoplastic S1 cells. (b) The corresponding processed image section showing a composite view of the detected local bright features (light gray) of NuMA, extracted by the local bright feature analysis overlaid on the nuclear segmentation mask (dark gray). (c) Concentric terraces resulting from the application of the distance transform on the segmentation mask, which allows the radial distribution of NuMA to be calculated. (d) A set of LBF distribution profiles of NuMA calculated from differentiated non-neoplastic S1 cells. The relative density of NuMA bright features (ordinate) is plotted as a function of the relative distance from the perimeter (0.0) to the center (1.0) of the nuclei (abscissa).

Clustering LBF distributions using traditional approaches

Our phenotype clustering algorithm is based on the radial distribution of LBFs. To group the LBF distribution of thousands of nuclei into clusters of similar patterns, we first tested traditional clustering approaches, including the most widely used K-means, fuzzy C-means clustering, Gaussian mixture model (with a spherical kernel), hierarchical clustering (with the complete link scheme), and the spectral clustering methods

Since different clustering methods generate different clusters, we computed the pair-wise _{i}, and the _{j}. The proportion of data in _{i }that is also in _{j }is _{i }⋂ _{j}|/|_{i}|, and the portion of data in _{j }that is also in _{i }is _{i }⋂ _{j}|/|_{j}|. Define _{0 }= [Σ|_{i}|_{j}F(_{i}|], where |_{i}| is the number of data point in _{i}. To make it symmetrical, the final _{0}+_{0}')/2, where _{0}' denotes the transpose of _{0}.

Probabilistic ensemble clustering

The probabilistic ensemble clustering approach we used to derive the consensus clusters from multiple clustering results is based on general Bayesian latent variable induction _{i }(_{i}. We notice that one simple yet reasonable assumption is that we can treat all the _{1},..., _{M }as independent samples drawn from the same underlying distribution _{1},..., _{M }are conditionally independent of each other given the latent variable

Let us suppose the _{i }clusters, then each _{i }has _{i }states (categorical labels), i.e., 1,..., _{i}. Initially the consensus _{i}, it takes a specific state value on _{i}. Denote _{1 }= _{1}, _{2 }= _{2},...., _{M }= _{M}), where _{i }(_{i}.

Upon initialization of the latent variable _{i }by the _{i}, we derive its probability of taking state label

where _{i }= _{i}|_{i }by the clustering method _{i}, given the data is assigned the state label

We observe that when the data samples (LBFs) are independent of each other, the likelihood of the latent variable

It is apparent that we can maximize the likelihood in Eq. (2) to find the best

Computing cluster histograms for cells of different phenotypes

Once we obtained reliable clusters of LBF distributions of individual nuclei, we analyzed how the cells belonging to different phenotypes, defined by the behavior of the cells, (i.e., S1 and T4-2 cells cultured in different days) were distributed across the various LBF clusters. For this purpose, we counted the number of nuclei whose LBF distribution fell into each cluster for each phenotype, i.e., S1 cells cultured for 3, 5, 10, and 12 days, and T4-2 cells cultured for 4, 5, 11, and 12 days. By doing so, we obtained the cluster histogram of each phenotype, represented by the percentile of nuclei as a function of clusters. The cluster histograms do not only directly link to predefined phenotypes (as shown in Figure

Constructing the phenotype tree

Taking the non-neoplastic S1 cells cultured for different days as an example, our method in constructing the tree is as follows. For all the ^{th }bin corresponds to dividing the cells into 4 groups.

An illustration of phenotype tree construction process

**An illustration of phenotype tree construction process**. (a) Images 1–25, 26–45, 46–61, and 62–77 correspond to non-neoplastic S1 cells cultured for 12 days, 10 days, 5 days, and 3 days respectively. There are 7 possible ways of grouping the phenotypes. Each row corresponds to one possible way. Different colors represent different phenotype groups. The first 3 rows correspond to grouping the 4 predefined phenotypes into 2 groups. The next 3 rows correspond to grouping the phenotypes into 3 groups, and the last row correspond to 4 groups. (b) Taking the 4 phenotype group case (last row in (a)) as an example, we used traditional clustering methods to divide the cluster histogram of the image (one cluster histogram per image) into the same number of clusters (i.e., 4 in this example). Each row corresponds to the clustering result of one method. (c) The

Our next step is to determine the likelihood of these potential groupings. Assume we want to divide the predefined phenotypes into

To further test the sensitivity of this method to the number of clusters predefined when generating the clusters of LBF distributions using the five traditional clustering approaches, we repeated the process for different numbers of clusters predefined for the traditional methods and obtained a set of confidence values for each phenotype grouping case as indicated by the colored dots in each bin of Figure

Given ^{th }and 7^{th }bin in Figure ^{th }and 7^{th }row in Figure

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

This work was supported by the Department of Defense-Breast Cancer Research Program/DOD-BCRP (DAMD-170210440 to D.W.K.), the National Institutes of Health, National Cancer Institute (1 R33 CA118479-01 to D.W.K.), and a grant from the "Friends For An Earlier Breast Cancer Test" Foundation to S.A.L.

This article has been published as part of