<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
	<ui>1471-2121-8-S1-S3</ui>
	<ji>1471-2121</ji>
	<fm>
		<dochead>Research</dochead>
		<bibl>
			<title>
				<p>Phenotype clustering of breast epithelial cells in confocal images based on nuclear protein distribution analysis</p>
			</title>
			<aug>
				<au id="A1">
					<snm>Long</snm>
					<fnm>Fuhui</fnm>
					<insr iid="I1"/>
					<insr iid="I4"/>
					<email>longf@janelia.hhmi.org</email>
				</au>
				<au id="A2">
					<snm>Peng</snm>
					<fnm>Hanchuan</fnm>
					<insr iid="I2"/>
					<insr iid="I4"/>
					<email>pengh@janelia.hhmi.org</email>
				</au>
				<au id="A3">
					<snm>Sudar</snm>
					<fnm>Damir</fnm>
					<insr iid="I1"/>
					<email>dsudar@lbl.gov</email>
				</au>
				<au id="A4">
					<snm>Leli&#232;vre</snm>
					<mi>A</mi>
					<fnm>Sophie</fnm>
					<insr iid="I3"/>
					<email>lelievre@purdue.edu</email>
				</au>
				<au id="A5" ca="yes">
					<snm>Knowles</snm>
					<mi>W</mi>
					<fnm>David</fnm>
					<insr iid="I1"/>
					<email>dwknowles@lbl.gov</email>
				</au>
			</aug>
			<insg>
				<ins id="I1">
					<p>Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720 USA</p>
				</ins>
				<ins id="I2">
					<p>Genomics Division West, Lawrence Berkeley National Laboratory, Berkeley, CA 94720 USA</p>
				</ins>
				<ins id="I3">
					<p>Department of Basic Medical Science, Purdue University, West Lafayette, IN 47907 USA</p>
				</ins>
				<ins id="I4">
					<p>Janelia Farm Research Campus, Howard Hughes Medical Institute, Ashburn, VA 20147 USA</p>
				</ins>
			</insg>
			<source>BMC Cell Biology</source>
			<supplement>
				<title>
					<p>2006 International Workshop on Multiscale Biological Imaging, Data Mining and Informatics</p>
				</title>
				<editor>Manfred Auer, Hanchuan Peng and Ambuj Singh</editor>
				<note>Research</note>
			</supplement>
			<conference>
				<title>
					<p>2006 International Workshop on Multiscale Biological Imaging, Data Mining and Informatics</p>
				</title>
				<location>Santa Barbara, CA, USA</location>
				<date-range>7&#8211;8 September 2006</date-range>
				<url>http://www.bioimageinformatics.org/2006</url>
			</conference>
			<issn>1471-2121</issn>
			<pubdate>2007</pubdate>
			<volume>8</volume>
			<issue>Suppl 1</issue>
			<fpage>S3</fpage>
			<url>http://www.biomedcentral.com/1471-2121/8/S1/S3</url>
			<xrefbib>
				<pubidlist><pubid idtype="pmpid">17634093</pubid><pubid idtype="doi">10.1186/1471-2121-8-S1-S3</pubid>
				</pubidlist></xrefbib>
		</bibl>
		<history>
			<pub>
				<date>
					<day>10</day>
					<month>7</month>
					<year>2007</year>
				</date>
			</pub>
		</history>
		<cpyrt>
			<year>2007</year>
			<collab>Long et al; licensee BioMed Central Ltd.</collab>
			<note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
		</cpyrt>
		<abs>
			<sec>
				<st>
					<p>Abstract</p>
				</st>
				<sec>
					<st>
						<p>Background</p>
					</st>
					<p>The distribution of chromatin-associated proteins plays a key role in directing nuclear function. Previously, we developed an image-based method to quantify the nuclear distributions of proteins and showed that these distributions depended on the phenotype of human mammary epithelial cells. Here we describe a method that creates a hierarchical tree of the given cell phenotypes and calculates the statistical significance between them, based on the clustering analysis of nuclear protein distributions.</p>
				</sec>
				<sec>
					<st>
						<p>Results</p>
					</st>
					<p>Nuclear distributions of nuclear mitotic apparatus protein were previously obtained for non-neoplastic S1 and malignant T4-2 human mammary epithelial cells cultured for up to 12 days. Cell phenotype was defined as S1 or T4-2 and the number of days in cultured. A probabilistic ensemble approach was used to define a set of consensus clusters from the results of multiple traditional cluster analysis techniques applied to the nuclear distribution data. Cluster histograms were constructed to show how cells in any one phenotype were distributed across the consensus clusters. Grouping various phenotypes allowed us to build phenotype trees and calculate the statistical difference between each group. The results showed that non-neoplastic S1 cells could be distinguished from malignant T4-2 cells with 94.19% accuracy; that proliferating S1 cells could be distinguished from differentiated S1 cells with 92.86% accuracy; and showed no significant difference between the various phenotypes of T4-2 cells corresponding to increasing tumor sizes.</p>
				</sec>
				<sec>
					<st>
						<p>Conclusion</p>
					</st>
					<p>This work presents a cluster analysis method that can identify significant cell phenotypes, based on the nuclear distribution of specific proteins, with high accuracy.</p>
				</sec>
			</sec>
		</abs>
	</fm>
	<bdy>
		<sec>
			<st>
				<p>Background</p>
			</st>
			<p>Histological classification of biopsied breast tissue plays a key role in mammary cancer detection and in determining patient treatment. Current methods rely on gross signatures of cellular and tissue organization including tubular formation, nuclear pleomorphism and mitotic activity. To aid the early detection and diagnosis of mammary tumors, quantitative techniques are highly needed that could not only help automate the classification process but also provide subcellular information that could be used to reveal new subclasses of tumor within each pathological grade.</p>
			<p>Increasing evidence has shown that chromatin-associated proteins are important in directing nuclear functions involved in the control of cell proliferation and differentiation <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp>. Using tissue models, formed by culturing human mammary epithelial cells (HMECs) from the HMT-3522 cancer progression series in Matrigel&#8482; (3D culture), earlier studies showed that the distribution of Nuclear Mitotic Apparatus (NuMA) protein was remarkably different in non-neoplastic cells that were proliferating compared to those that had completed acinar morphogenesis by forming polarized glandular tissue structures <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. For instance, during the 10-day in vitro morphogenesis process, NuMA staining was reported as diffusely distributed within the nuclei of proliferating cells, and had aggregated into foci of increasing size as cells arrested proliferation and completed acinar morphogenesis <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>.</p>
			<p>Based on these findings, Knowles et al then developed an image-based technique, called local bright feature (LBF) analysis <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. The technique uses fluorescence images of total DNA and specifically stained nuclear proteins and calculates the radial distribution of the density of bright immunostained features as a function of the distance from the perimeter of the nucleus to its center. The LBF analysis was used to quantify the distribution of fluorescently stained NuMA from confocal images of non-neoplastic (S1) and malignant (T4-2) HMT-3522 HMECs, cultured in 3D for up to 12 days <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. By averaging the LBF distributions over populations of cells with the same phenotype, the study showed that the LBF analysis reproducibly captured changes in NuMA distribution along the morphogenic process in non-neoplastic S1 cells. It also revealed that the NuMA distribution in malignant T4-2 cells was diffuse and independent of the number of days the cells were in culture <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>.</p>
			<p>Here we report a cluster analysis approach, based on the distribution of nuclear proteins, that robustly calculates the statistical significance between cell phenotypes, which are defined by the behavior of the cells in 3D culture. The method first groups LBF distributions into clusters using multiple traditional clustering methods. The results are then combined by a probabilistic ensemble approach into a set of consensus clusters that can be used to reliably define all possible LBF distributions that exist within a data set. This then allows cluster histograms to be computed which show how the LBF distributions in individual cells from a group are distributed over the consensus clusters. These cluster histograms represent a new way of linking the phenotype of groups of phenotypically similar cells, defined by their behavior in 3D culture, with their LBF distributions, quantified microscopically. Further, by grouping the LBF cluster histograms in multiple ways, the method is then able to build a phenotype tree and to calculate the statistical significance between each grouping. Each level of the tree corresponds to a different phenotype division of the cells and provides a way to predict which of the cell phenotypes, or grouping of cell phenotypes are significantly different from each other. These methods were then applied to the LBF distributions of NuMA in S1 and T4-2 cells, previously reported in Knowles et al <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. The resulting cluster histograms clearly showed that the distribution of NuMA changes during the morphogenic process as non-neoplastic S1 cells growth arrest and differentiate. The resulting phenotype tree showed that non-neoplastic S1 cells could be distinguished from malignant T4-2 cells with 94.19% accuracy; that proliferating S1 cells could be distinguished from differentiated S1 cells with 92.86% accuracy; and clearly indicated that NuMA distribution was unchanged in the various phenotypes of malignant T4-2 cells.</p>
		</sec>
		<sec>
			<st>
				<p>Results</p>
			</st>
			<sec>
				<st>
					<p>Dataset</p>
				</st>
				<p>As described in <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, non-neoplastic HMT-3522 S1 cells were cultured in 3D in the presence of Matrigel&#8482; for up to 12 days to induce acinar morphogenesis. Malignant HMT-3522 T4-2 cells were cultured under similar conditions for a maximum of 11 days to avoid the overgrowth of tumor nodules. DNA was stained with DAPI to visualize the limits of the nuclear volume and NuMA proteins were labeled with Texas red. Three-dimensional images were acquired using a Zeiss 410 confocal laser-scanning microscope with planapochromatic 63&#215;, 1.4 numerical aperture lens. The resulting voxel dimensions of the 3D images were 0.08 &#215; 0.08 &#956;m in the plane of the slide and 0.5 &#956;m along the optical direction.</p>
				<p>We used three image datasets to test our phenotype clustering approach. The first dataset contains 2673 non-neoplastic S1 cells taken from 77 confocal images. Images 1&#8211;25, 26&#8211;45, 46&#8211;61, and 62&#8211;77 are S1 cells cultured for 12 days, 10 days, 5 days, and 3 days respectively. The second dataset contains 3535 malignant T4-2 cells taken from 44 images. Images 1&#8211;14, 15&#8211;26, 27&#8211;36, and 37&#8211;44 are T4-2 cells cultured in 5 days, 10 days, 11 days, and 4 days respectively. The third dependent dataset contains both malignant T4-2 and non-neoplastic S1 cells taken from the direct combination of all the 121 images. The time points were selected to span the growth progression of the non-neoplastic cultured cells. Optical sections from 3D images of individual nuclei, showing representative NuMA staining for each of the phenotypes, are displayed in the Methods section.</p>
			</sec>
			<sec>
				<st>
					<p>Clustering LBF distributions using traditional approaches</p>
				</st>
				<p>Using an automated image analysis method developed earlier <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, we extracted the local bright staining features of NuMA protein and quantified their radial distribution in each nucleus in all the 121 S1 and T4 images. In this way, we obtained 2673 and 3535 LBF distributions for S1 and T4 cells respectively. Each distribution is represented by the normalized density of bright NuMA protein feature as a function of the normalized distance from the perimeter of the nucleus to its center (see Methods for further details).</p>
				<p>Using traditional approaches of fuzzy C-means clustering, Gaussian mixture model clustering (with a spherical kernel), K-means, hierarchical clustering (with a complete link scheme), and spectral clustering <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr></abbrgrp>, we divided the dataset into a number of clusters according to the similarities of their LBF distributions. Figure <figr fid="F1">1</figr> shows the results for each of these traditional approaches when the dataset of 2673 non-neoplastic S1 cells is divided into 8 clusters. The final result, as we show below, is not dependent on the number of clusters. Each cluster is represented by the centroid (curve) and standard deviation (small vertical bar) of the LBF distributions in the cluster. Clearly, the different methods cluster the data in different ways.</p>
				<fig id="F1">
					<title>
						<p>Figure 1</p>
					</title>
					<caption>
						<p>Clustering 2673 non-neoplastic S1 cells into 8 clusters according to the similarities of their LBF distributions</p>
					</caption>
					<text>
						<p><b>Clustering 2673 non-neoplastic S1 cells into 8 clusters according to the similarities of their LBF distributions</b>. Rows from the top to the bottom are the results of Gaussian mixture model clustering with spherical kernel (GM), fuzzy C-means clustering (Fuzzy), hierarchical clustering with complete link (Hier), K-means, and spectral clustering respectively (Spectral). Each cluster is represented by the centroid (curve) and the standard deviation (small vertical bar) of the LBF distributions in the cluster. The horizontal axis of each of the 5 &#215; 8 panels is the normalized distance from the nucleus perimeter, the range being [0,1]. The vertical axis is the normalized bright feature density, the range being [0,2]. Also see Methods for the description of the LBF analysis.</p>
					</text>
					<graphic file="1471-2121-8-S1-S3-1"/>
				</fig>
				<p>Table <tblr tid="T1">1</tblr> shows the consistencies between these clustering results evaluated by pair-wise <it>F</it>-measure (see Methods). The results show that quantitatively the consistencies between the clusters produces from each approach are unsatisfactory. For instance, the <it>F</it>-measures between the hierarchical clustering and the Gaussian mixture model, fuzzy C-means, K-means, and spectral clustering are 0.5205, 0.5270, 0.4543, and 0.5365 respectively (the fourth row in Table <tblr tid="T1">1</tblr>). The <it>F</it>-measures between the spectral clustering and the Gaussian mixture model, fuzzy C-menas, hierarchical clustering, and K-means are 0.6282, 0.6177, 0.5365, and 0.6253 respectively (the sixth row in Table <tblr tid="T1">1</tblr>).</p>
				<tbl id="T1">
					<title>
						<p>Table 1</p>
					</title>
					<caption>
						<p>Pair-wise <it>F</it>-measures for the clustering results generated by the five traditional clustering approaches, as shown in Figure 1.</p>
					</caption>
					<tblbdy cols="6">
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>GM</p>
							</c>
							<c ca="center">
								<p>Fuzzy</p>
							</c>
							<c ca="center">
								<p>Hier</p>
							</c>
							<c ca="center">
								<p>Kmeans</p>
							</c>
							<c ca="center">
								<p>Spectral</p>
							</c>
						</r>
						<r>
							<c cspan="6">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="center">
								<p>GM</p>
							</c>
							<c ca="center">
								<p>1.0000</p>
							</c>
							<c ca="center">
								<p>0.8837</p>
							</c>
							<c ca="center">
								<p>0.5205</p>
							</c>
							<c ca="center">
								<p>0.6296</p>
							</c>
							<c ca="center">
								<p>0.6286</p>
							</c>
						</r>
						<r>
							<c ca="center">
								<p>Fuzzy</p>
							</c>
							<c ca="center">
								<p>0.8837</p>
							</c>
							<c ca="center">
								<p>1.0000</p>
							</c>
							<c ca="center">
								<p>0.5270</p>
							</c>
							<c ca="center">
								<p>0.6932</p>
							</c>
							<c ca="center">
								<p>0.6177</p>
							</c>
						</r>
						<r>
							<c ca="center">
								<p>Hier</p>
							</c>
							<c ca="center">
								<p>0.5205</p>
							</c>
							<c ca="center">
								<p>0.5270</p>
							</c>
							<c ca="center">
								<p>1.0000</p>
							</c>
							<c ca="center">
								<p>0.4543</p>
							</c>
							<c ca="center">
								<p>0.5365</p>
							</c>
						</r>
						<r>
							<c ca="center">
								<p>Kmeans</p>
							</c>
							<c ca="center">
								<p>0.6296</p>
							</c>
							<c ca="center">
								<p>0.6932</p>
							</c>
							<c ca="center">
								<p>0.4543</p>
							</c>
							<c ca="center">
								<p>1.0000</p>
							</c>
							<c ca="center">
								<p>0.6253</p>
							</c>
						</r>
						<r>
							<c ca="center">
								<p>Spectral</p>
							</c>
							<c ca="center">
								<p>0.6286</p>
							</c>
							<c ca="center">
								<p>0.6177</p>
							</c>
							<c ca="center">
								<p>0.5365</p>
							</c>
							<c ca="center">
								<p>0.6253</p>
							</c>
							<c ca="center">
								<p>1.0000</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
			</sec>
			<sec>
				<st>
					<p>Finding consensus LBF clusters using probabilistic ensemble clustering</p>
				</st>
				<p>As shown in Table <tblr tid="T1">1</tblr>, different clustering methods may generate different results for the same dataset and the agreement between them can be low. This is because each clustering method assumes certain data distributions and cluster characteristics. For instance, the Gaussian mixture model assumes clusters satisfy the Gaussian distribution. K-means works well for clusters of convex shapes. Thus, some algorithms might perform well for specific datasets and not for others. In general, no single clustering method can successfully handle different types of cluster structure. In addition, even different initializations and parameter settings of the same method, for instance, K-means and Gaussian mixture model, may generate different clustering results. As a result, selecting an optimal clustering method is non-trivial or even impossible in many cases. A reasonable way to get a reliable partition of a dataset is to derive a consensus from multiple clustering results, the assumption being that the judgment made by a committee is more robust and unbiased than those made by individuals. This idea, called ensemble clustering, has been investigated in some literatures and several major benefits have been identified <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr></abbrgrp>. First, ensemble-clustering can improve the robustness of clustering. The clusters generated tend to be less sensitive to noise, outliers, initialization, or sampling variations compared to individual clustering methods. Second, ensemble clustering does not need <it>a priori </it>information about the number of clusters, but can effectively determine the most probable number of clusters. Third, ensemble clustering can detect outliers. This ability is closely associated with the ability of determining the number of clusters.</p>
				<p>Several different ensemble-clustering methods have become available. In <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, a voting algorithm based on hierarchical clustering of the co-association matrix (which represents how often each pair of data appears in the same cluster) is used to derive the consensus clusters. In <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>, Strehl and Ghosh developed an evidence accumulation and a hypergraph representation ensemble clustering method. In <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>, Topchy et al proposed a mutual-information-based method. In <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>, Fischer and Buhmann developed a bootstrap algorithm by first relabeling the data in each clustering result to find the correspondence and then using a voting scheme to find consensus.</p>
				<p>In this work, we used a probabilistic ensemble approach based on Bayesian latent variable induction <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr></abbrgrp> (see Methods). Assuming that the clustering results generated by individual methods, i.e., Gaussian mixture model, fuzzy C-means, K-Means, hierarchical clustering, and spectral clustering, are independent of each other, the Bayesian latent variable induction method is able to obtain the statistically optimal combination of individual clustering results as shown by Chickering and Heckerman in <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. A similar probabilistic ensemble approach has also been adopted by Topchy in <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> where accurate consensus was obtained from unreliable individual clustering results.</p>
				<p>Using the probabilistic ensemble clustering approach (see Methods for detail), we derived the statistically optimal consensus from different data partition results generated by the five traditional clustering methods mentioned above. Figure <figr fid="F2">2</figr> shows the result of combining the clusters generated by the five traditional approaches as shown in Figure <figr fid="F1">1</figr> using the probabilistic ensemble approach. The number of clusters, 16, is automatically determined as a result of finding the consensus.</p>
				<fig id="F2">
					<title>
						<p>Figure 2</p>
					</title>
					<caption>
						<p>Consensus clusters of the five clustering results in Figure 1, generated by probabilistic ensemble clustering approach</p>
					</caption>
					<text>
						<p><b>Consensus clusters of the five clustering results in Figure 1, generated by probabilistic ensemble clustering approach</b>. The number clusters, i.e., 16, is automatically determined by the algorithm. Like Figure 1, each curve represents the centriod of the cluster. The vertical bar represents the standard variation on the corresponding bin. The horizontal axis of each panel is the normalized distance from nucleus perimeter, the range being [0,1], and the vertical axis is the normalized bright feature density with the range being [0,2].</p>
					</text>
					<graphic file="1471-2121-8-S1-S3-2"/>
				</fig>
				<p>Table <tblr tid="T2">2</tblr> further shows the comparison of our method with traditional methods in terms of the number of clusters predefined in individual clustering methods (the second row) and those automatically determined by the probabilistic ensemble clustering approach (the third row) for the dataset containing both S1 and T4-2 cells. Clearly, the number of clusters automatically determined by the probabilistic ensemble approach does not vary significantly with the number of clusters predefined for individual clustering methods. When the number of clusters predefined changes from 8 to 26, the number of clusters identified by the probabilistic ensemble clustering approach is much more stable, ranging from 19 to 25.</p>
				<tbl id="T2">
					<title>
						<p>Table 2</p>
					</title>
					<caption>
						<p>Number of clusters (the second row) predefined in the individual clustering methods (i.e., Gaussian mixture model, fuzzy C-means, hierarchical clustering, K-means and spectral clustering) and those automatically determined by the probabilistic ensemble clustering method for both S1 and T4-2 cells (the third row).</p>
					</caption>
					<tblbdy cols="13">
						<r>
							<c ca="left">
								<p>Methods</p>
							</c>
							<c cspan="12" ca="center">
								<p>Number of Clusters</p>
							</c>
						</r>
						<r>
							<c cspan="13">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Traditional methods</p>
							</c>
							<c ca="left">
								<p>4</p>
							</c>
							<c ca="left">
								<p>6</p>
							</c>
							<c ca="left">
								<p>8</p>
							</c>
							<c ca="left">
								<p>10</p>
							</c>
							<c ca="left">
								<p>12</p>
							</c>
							<c ca="left">
								<p>14</p>
							</c>
							<c ca="left">
								<p>16</p>
							</c>
							<c ca="left">
								<p>18</p>
							</c>
							<c ca="left">
								<p>20</p>
							</c>
							<c ca="left">
								<p>22</p>
							</c>
							<c ca="left">
								<p>24</p>
							</c>
							<c ca="left">
								<p>26</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Probabilstic ensemble-clustering</p>
							</c>
							<c ca="left">
								<p>19</p>
							</c>
							<c ca="left">
								<p>18</p>
							</c>
							<c ca="left">
								<p>18</p>
							</c>
							<c ca="left">
								<p>16</p>
							</c>
							<c ca="left">
								<p>19</p>
							</c>
							<c ca="left">
								<p>20</p>
							</c>
							<c ca="left">
								<p>19</p>
							</c>
							<c ca="left">
								<p>20</p>
							</c>
							<c ca="left">
								<p>22</p>
							</c>
							<c ca="left">
								<p>22</p>
							</c>
							<c ca="left">
								<p>23</p>
							</c>
							<c ca="left">
								<p>25</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
			</sec>
			<sec>
				<st>
					<p>Computing cluster histograms</p>
				</st>
				<p>With clusters reliably determined, we then calculated the number of LBF distributions falling into each cluster for each of the 8 populations of cells, i.e., non-neoplastic S1 cells cultured for 3 days, 5 days, 10 days, and 12 days, as well as malignant T4-2 cells cultured for 4 days, 5 days, 10 days, and 11 days. By doing so, we obtained a cluster histogram for each of the 8 populations of cells. Figure <figr fid="F3">3a</figr> shows the 20 clusters automatically determined by combining the clustering results of Gaussian mixture model, fuzzy C-means, hierarchical clustering, K-means, and spectral clustering using the probabilistic ensemble clustering for the dataset containing 2673 non-neoplastic S1 cells and 3535 malignant T4-2 cells. The number of the clusters predefined for these baseline methods is 14 (as shown in Table <tblr tid="T2">2</tblr>). In fact, the cluster histograms and the phenotype trees built in later step are insensitive to the number of clusters predefined for traditional clustering methods as will be shown in the Methods section. The 20 clusters in Figure <figr fid="F3">3a</figr> are ordered from the left to the right and the top to the bottom according to their peak locations. The first 8 clusters are approximately flat. In the 9<sup>th </sup>to the 20<sup>th </sup>clusters the peak location shifts from the left to the right. Figure <figr fid="F3">3b</figr> shows the cluster histograms for the 8 populations of cells. For S1 cells, the cluster histograms (the top row in Figure <figr fid="F3">3b</figr>) are remarkably different between the early stage (e.g. S1 Day 3) and the completion of acinar morphogenesis (e.g., S1 Day 12). The peak of the histogram gradually shifts from the left to the right as the number of days in culture increases, indicating a gradual modification during the 12-day <it>in vitro </it>morphogenesis process. This is consistent with the fact that NuMA staining is diffusely distributed within the nuclei of proliferating cells, but aggregates into foci of increasing size as cells arrest proliferation and complete acinar morphogenesis. Therefore, the cluster histograms statistically reflect the phenotype of non-neoplastic S1 cells. Moreover, the peak of the histogram profile does not change significantly for malignant T4-2 cells cultured for different numbers of days (bottom row in Figure <figr fid="F3">3b</figr>). This is also consistent with the fact that NuMA staining is diffusely distributed within T4-2 nuclei despite the number of days in culture. Interestingly, the cluster histograms of malignant T4-2 cells differ significantly from those of non-neoplastic S1 cells. The consistency of cluster histograms and cell types indicates that it is meaningful to develop a method to predict cell phenotypes and their sub-categories based on cluster histograms.</p>
				<fig id="F3">
					<title>
						<p>Figure 3</p>
					</title>
					<caption>
						<p>LBF distribution clusters and cluster histograms for 6208 S1 and T4-2 cells cultured for different numbers of days</p>
					</caption>
					<text>
						<p><b>LBF distribution clusters and cluster histograms for 6208 S1 and T4-2 cells cultured for different numbers of days</b>. (a) Twenty LBF distribution clusters automatically determined by probabilistic ensemble clustering of the results generated by Gaussian mixture model, fuzzy C-means, hierarchical clustering, K-means, and spectral clustering. The number of the clusters predefined for these baseline methods is 14. The clusters are ordered from the left to the right and the top to the bottom according to their peak locations. (b) From the left to right and the top to the bottom: cluster histograms of non-neoplastic S1 cells cultured in 3 days, 5 days, 10 days, and 12 days, and of malignant T4-2 cells cultured in 4 days, 5 days, 10 days, and 11 days.</p>
					</text>
					<graphic file="1471-2121-8-S1-S3-3"/>
				</fig>
			</sec>
			<sec>
				<st>
					<p>Constructing phenotype trees</p>
				</st>
				<p>Using the approach introduced in the Methods section, we have constructed phenotype trees to show how the phenotypes, defined by the behavior of the cells in 3D culture, can be hierarchically grouped and the statistical significance of each grouping calculated. Figure <figr fid="F4">4a</figr> shows the phenotype tree built for non-neoplastic S1 cells. At the first level in this figure, the four phenotypes of S1 cells were divided into two groups. Of the multiple ways to create two groups from four phenotypes, our method found that having S1 cells at day 12 and day 10 in one group and S1 cells at day 3 and day 5 in the other resulted in the highest confidence value, of 0.9286 (Figure <figr fid="F4">4a</figr>). In the second level of the tree, our method divided S1 cells into three phenotype groups. The results showed that having S1 cells at day 12 and day 10 as one group, S1 cells at day 5 as the second group, and S1 cells at day 3 as the third provided the highest confidence value of 0.8511. This was lower than the confidence of dividing S1 cells into two groups. Finally, the method divided S1 cells into four groups which resulted in a confidence value of 0.6822 (Figure <figr fid="F4">4a</figr>). This phenotype tree indicates we can distinguish S1 cells at day 3 and 5 from those cultured at day 10 and 12 days with high confidence.</p>
				<fig id="F4">
					<title>
						<p>Figure 4</p>
					</title>
					<caption>
						<p>Phenotype trees constructed for (a) non-neoplastic S1 cells, (b) malignant T4-2 cells, and (c) both S1 and T4-2 cells cultured for a different number of days</p>
					</caption>
					<text>
						<p><b>Phenotype trees constructed for (a) non-neoplastic S1 cells, (b) malignant T4-2 cells, and (c) both S1 and T4-2 cells cultured for a different number of days</b>. The certainty of hierarchically grouping the cells of the predefined phenotypes (indicated by the leaf nodes in the highest level of the tree) into statistically more significant groups of the phenotypes is indicated by the <it>confidence </it>values at each level of the tree.</p>
					</text>
					<graphic file="1471-2121-8-S1-S3-4"/>
				</fig>
				<p>Using the same approach, we constructed the phenotype trees for malignant T4-2 cells and for the combination of S1 and T4-2 cells, as shown in Figure <figr fid="F4">4b</figr> and Figure <figr fid="F4">4c</figr> respectively. Figure <figr fid="F4">4b</figr> shows that we can distinguish T4-2 cells cultured at day 4, day 5, day 10 from those cultured at day 11 in relatively high confidence (0.8591; the first level of Figure <figr fid="F4">4b</figr>). However, if we want to distinguish T4-2 cells cultured for different numbers of days, the confidence drops to 0.5748. Figure <figr fid="F4">4c</figr> shows that we can distinguish S1 and T4-2 cells with very high confidence (0.9419; see the first level of Figure <figr fid="F4">4c</figr>). However, the confidence drops as level increases. The certainty in distinguishing all the 8 phenotypes drops to 0.5508 at the highest level of the tree. In general, the phenotype trees provide us a way to evaluate how the phenotypes, defined by the behavior of the cells in 3D culture, can be hierarchically grouped and the statistical significance between each grouping calculated.</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Discussion and conclusions</p>
			</st>
			<p>We have developed a cluster analysis approach that can robustly link any given set of multivariate features measured on a per cell basis to the phenotype of the cells as defined by their macroscopic biology. The technique uses a probabilistic ensemble approach to group the measured multivariate features into a set of consensus clusters. This method provides a novel way of linking the phenotypes of groups of cells to cluster histograms that describe the distribution of the measured features across the consensus clusters. Then, by forming various groupings of the cluster histograms, the technique permits the formation of a phenotype tree and calculations of the statistical significance between each of the groups. If two groups of cells are found to be significantly different, one can conclude that the features measured in the cells can distinguish the groups that are indeed different. If the two groups are not significantly different, one can only conclude that the measured feature does not change between these groups. It does not imply that that the groups are necessarily identical.</p>
			<p>The phenotype tree is a hierarchical representation of the possible grouping of the defined cell phenotypes. As such, a node in the tree at level <it>l </it>can be spitted into at most two nodes at level <it>l</it>+1. However, the method used in building the tree does not prevent inconsistent group divisions between level <it>l </it>and <it>l</it>+1. Thus a node at level <it>l</it>+1 can be a combination of two partial nodes at level <it>l</it>, as shown in Figure <figr fid="F5">5</figr>. As a result, the hierarchical structure cannot be represented as a tree. To solve the problem, we can add a consistency constrain to make the phenotype groups, between different tree levels, coherent. Alternatively, we can use directed acyclic graphs (DAG) to represent the hierarchical structure of cell phenotype without adding any consistency constrain.</p>
			<fig id="F5">
				<title>
					<p>Figure 5</p>
				</title>
				<caption>
					<p>Illustration of the inconsistent phenotype grouping between successive levels</p>
				</caption>
				<text>
					<p><b>Illustration of the inconsistent phenotype grouping between successive levels</b>. Each solid rectangle represents a phenotype node. A dashed line indicates combination operation. Phenotype groupings at level l and l+1 are inconsistent as the node BC at level l+1 is formed by breaking node AB and node CD at level l into two parts and combining one part of each node. In this case, the hierarchical structure cannot be represented as a tree.</p>
				</text>
				<graphic file="1471-2121-8-S1-S3-5"/>
			</fig>
			<p>We have shown how the cluster analysis technique can be applied to the radial LBF distributions of a chromatin-associated protein, NuMA <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>, measured on a per cell basis from non-neoplastic S1 and malignant T4-2 HMECs, cultured in a 3D environment for up to 12 days. The results showed, that for this measured feature, the method can distinguish the non-neoplastic S1 cells and malignant T4-2 cells with 94.19% accuracy, and proliferating S1 cells from S1 cells differentiated into acinar structures with 92.86% accuracy. The phenotype tree also shows that the method only distinguishes the four phenotypes of S1 cells with 68.22% accuracy. However, when the two phenotypes S1-day 10 and S1-day 12 are considered as one group, the ability to distinguish that group from S1-day 5 and S1-day 3 jumps to 85.11%. This result demonstrates the power of the phenotype tree, which in this case shows that the distribution of NuMA changes moderately between the phenotypes S1-day3 and S1-day 5, markedly between the phenotypes S1-day 5 and S1-day 10 but then does not changed significantly in S1 cells at 10 days compared to 12 days in culture. These results correlate with the behavior of cultured S1 cells and clearly show that the reorganization of NuMA that occurs during the morphogenic process of these cells is almost complete at 10 days of culture. In other words, S1-day 10 and S1-day 12 are not significantly different phenotypes, based on NuMA distribution. These results are echoed by the cluster histograms for the S1 cells. Clearly marked differences are seen between cluster histograms of the phenotypes S1-day 5 and S1-day 10 and not between the phenotypes S1-day 10 and S1-day 12. Further, the method only distinguishes the four phenotypes of T4-2 cells with 57.48% accuracy. This result also correlates with the behavior of these malignant cells that continue to proliferate throughout the 12 day culture period. This result simply demonstrates that based on NuMA distribution, the phenotypes T4-2-day 4, T4-2-day 5, T4-2-day 10 and T4-2-day 11 are not significantly different. It does not rule out the possibility that introducing other measured features could reveal differences between such phenotypes.</p>
			<p>Collectively our data demonstrate the quantitative ability of clustering-based analysis to link microscopically measurable features with the behavior of the cells. The methods described demonstrate that it is possible to distinguish populations of cells based on the nuclear organization of a chromatin-associated protein, NuMA. This work paves the way for our longer term goal of producing a method capable of turning high resolution fluorescence images of human mammary epithelial tissue into tissue-maps that report the probable non-neoplastic, premalignant and malignant phenotype at cellular resolution.</p>
		</sec>
		<sec>
			<st>
				<p>Methods</p>
			</st>
			<p>Our phenotype clustering approach contains four steps (Figure <figr fid="F6">6</figr>). Firstly, we used a previously developed image analysis method <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> to analyze each fluorescence image acquired by the Zeiss 410 3D confocal microscope, and obtained LBF distributions for all nuclei within many images. Secondly, we grouped thousands of nuclei into clusters based on the similarities between their LBF distributions. For this purpose, we tested K-means clustering, fuzzy C-means clustering, Gaussian mixture model, spectral clustering, and hierarchical clustering methods <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr></abbrgrp> and found that the consistency between the different clustering results, evaluated by an <it>F</it>-measure, were relatively low. Because it is difficult to choose the best approach, we developed a probabilistic ensemble approach based on Bayesian latent variable induction to combine the different clustering results into a set of consensus clusters of LBF distributions. We then analyzed how nuclei were distributed across the consensus clusters, and obtained a cluster histogram for cells of each defined phenotype. Finally, we constructed hierarchical phenotype trees to show how the predefined phenotypes could be hierarchically grouped and the statistical significance of each grouping calculated. The trees were structured so that nodes at lower levels correspond to phenotype groups with larger statistical difference.</p>
			<fig id="F6">
				<title>
					<p>Figure 6</p>
				</title>
				<caption>
					<p>Diagram of the phenotype clustering algorithm</p>
				</caption>
				<text>
					<p><b>Diagram of the phenotype clustering algorithm</b>. Details of the image acquisition and the extraction of the LBF for each nucleus is described in [5].</p>
				</text>
				<graphic file="1471-2121-8-S1-S3-6"/>
			</fig>
			<sec>
				<st>
					<p>Extracting LBF distributions from nuclei</p>
				</st>
				<p>Using Zeiss 410 confocal laser-scanning microscope with planapochromatic 63&#215;, 1.4 numerical aperture lens, we acquired hundreds of 3D images of non-neoplastic S1 and malignant T4-2 cells cultured for up to 12 days. Figure <figr fid="F7">7</figr> shows optical sections from the middle of 3D images of individual nuclei, showing representative NuMA staining for each of the phenotypes described in this work.</p>
				<fig id="F7">
					<title>
						<p>Figure 7</p>
					</title>
					<caption>
						<p>Fluorescence micrographs showing representative NuMA staining patterns in individual nuclei for eight different phenotypes</p>
					</caption>
					<text>
						<p><b>Fluorescence micrographs showing representative NuMA staining patterns in individual nuclei for eight different phenotypes</b>. In previous work [5] the radial nuclear distribution of NuMA was analyzed from 3D multichannel fluorescence images of thousands of individual nuclei. The human mammary epithelial cells were either non-neoplastic (top row) or malignant (bottom row) and were cultured in Matrigel&#8482; (3D culture) for up to 12 days. Optical sections from 3D images, taken through the approximate midplane of individual nuclei are displayed. The optical sections were chosen to show representative features of the NuMA staining pattern. Panels a, b, c and d, show NuMA staining from non-neoplastic cells cultured for 3, 5, 10 and 12 days, representing cells present in incremental differentiation steps, respectively. Panels e, f, g, and h, show NuMA staining from malignant cells cultured for 4, 5, 10 and 11 days, representing cells present in tumors of increasing sizes, respectively. Notice that the nuclei of malignant cells are consistently larger than the nuclei of non-neoplastic cells. The bar represents 5 microns.</p>
					</text>
					<graphic file="1471-2121-8-S1-S3-7"/>
				</fig>
				<p>In an earlier study, an image analysis method was developed to extract the local bright staining features of NuMA protein and quantify their radial distribution in each individual nucleus (<abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, also see Figure <figr fid="F8">8</figr>). The technique first used a model-based method to automatically segment individual nuclei in the DAPI-stained channel of the confocal images. It then divided the brightness at each point within a nucleus by the local average brightness in a region surrounding that point in the NuMA-stained channel, thus isolating the local brightness features (LBF) of each nucleus. Then, the radial distribution of these bright features was computed using a distance transform. The transform calculates the shortest distance of each point within a nucleus to the nuclear boundary and in doing so, divides each nucleus into a set of concentric terraces of equal thickness. In each terrace, the density of local bright features was calculated as the number of bright pixels divided by the total number of pixels. To account for variations in the number of terraces per nucleus due to variations in nucleus size and shape, the density per terrace was normalized so that the average density of bright features was 1 for each nucleus, and the distances from nuclear perimeter were also normalized to the range of [0, 1.0]. Through the above process, a radial distribution of LBF was derived for each nucleus, represented by the normalized density of bright features as a function of the normalized distance from the perimeter of the nucleus to its center.</p>
				<fig id="F8">
					<title>
						<p>Figure 8</p>
					</title>
					<caption>
						<p>LBF analysis of the distribution of NuMA from 3D images</p>
					</caption>
					<text>
						<p><b>LBF analysis of the distribution of NuMA from 3D images</b>. (a) Fluorescence micrograph of Texas red-immunolabeled NuMA from a single optical section, in differentiated non-neoplastic S1 cells. (b) The corresponding processed image section showing a composite view of the detected local bright features (light gray) of NuMA, extracted by the local bright feature analysis overlaid on the nuclear segmentation mask (dark gray). (c) Concentric terraces resulting from the application of the distance transform on the segmentation mask, which allows the radial distribution of NuMA to be calculated. (d) A set of LBF distribution profiles of NuMA calculated from differentiated non-neoplastic S1 cells. The relative density of NuMA bright features (ordinate) is plotted as a function of the relative distance from the perimeter (0.0) to the center (1.0) of the nuclei (abscissa).</p>
					</text>
					<graphic file="1471-2121-8-S1-S3-8"/>
				</fig>
			</sec>
			<sec>
				<st>
					<p>Clustering LBF distributions using traditional approaches</p>
				</st>
				<p>Our phenotype clustering algorithm is based on the radial distribution of LBFs. To group the LBF distribution of thousands of nuclei into clusters of similar patterns, we first tested traditional clustering approaches, including the most widely used K-means, fuzzy C-means clustering, Gaussian mixture model (with a spherical kernel), hierarchical clustering (with the complete link scheme), and the spectral clustering methods <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr></abbrgrp>.</p>
				<p>Since different clustering methods generate different clusters, we computed the pair-wise <it>F</it>-measure score to evaluate the consistencies between different clustering results. The <it>F</it>-measure is defined as follows. For any two data partition <it>U </it>and <it>V</it>, denote the <it>i</it>th cluster in partition U as <it>u</it><sub><it>i</it></sub>, and the <it>j</it>th cluster in partition V as <it>v</it><sub><it>j</it></sub>. The proportion of data in <it>u</it><sub><it>i </it></sub>that is also in <it>v</it><sub><it>j </it></sub>is <it>R </it>= |<it>u</it><sub><it>i </it></sub>&#8898; <it>v</it><sub><it>j</it></sub>|/|<it>u</it><sub><it>i</it></sub>|, and the portion of data in <it>v</it><sub><it>j </it></sub>that is also in <it>u</it><sub><it>i </it></sub>is <it>P </it>= |<it>u</it><sub><it>i </it></sub>&#8898; <it>v</it><sub><it>j</it></sub>|/|<it>v</it><sub><it>j</it></sub>|. Define <it>F</it>(<it>i</it>, <it>j</it>) = 2<it>PR</it>/(<it>P</it>+<it>R</it>). The score to measure the consistency of the partition <it>V </it>with partition <it>U </it>is <it>F</it><sub>0 </sub>= [&#931;|<it>u</it><sub><it>i</it></sub>|<it>max</it><sub><it>j</it></sub>F(<it>i</it>, <it>j</it>)]/[&#931;|<it>u</it><sub><it>i</it></sub>|], where |<it>u</it><sub><it>i</it></sub>| is the number of data point in <it>u</it><sub><it>i</it></sub>. To make it symmetrical, the final <it>F</it>-measure is defined as <it>F </it>= (<it>F</it><sub>0</sub>+<it>F</it><sub>0</sub>')/2, where <it>F</it><sub>0</sub>' denotes the transpose of <it>F</it><sub>0</sub>.</p>
			</sec>
			<sec>
				<st>
					<p>Probabilistic ensemble clustering</p>
				</st>
				<p>The probabilistic ensemble clustering approach we used to derive the consensus clusters from multiple clustering results is based on general Bayesian latent variable induction <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr></abbrgrp>. Let us suppose we have <it>M </it>different clustering approaches, generating <it>M </it>data partition <it>C</it><sub><it>i </it></sub>(<it>i </it>= 0,..., <it>M</it>) of the same dataset <it>D </it>containing <it>N </it>data points. Our purpose is to infer the optimal consensus data partition <it>L </it>from the multiple partitions <it>C</it><sub><it>i</it></sub>. We notice that one simple yet reasonable assumption is that we can treat all the <it>M </it>clustering results <it>C</it><sub>1</sub>,..., <it>C</it><sub><it>M </it></sub>as independent samples drawn from the same underlying distribution <it>L</it>. In another words, we can assume that the distributions of <it>C</it><sub>1</sub>,..., <it>C</it><sub><it>M </it></sub>are conditionally independent of each other given the latent variable <it>L</it>. This assumption allows us consider the following Bayesian latent variable induction model.</p>
				<p>Let us suppose the <it>i</it>th clustering approach divides the dataset into <it>r</it><sub><it>i </it></sub>clusters, then each <it>C</it><sub><it>i </it></sub>has <it>r</it><sub><it>i </it></sub>states (categorical labels), i.e., 1,..., <it>r</it><sub><it>i</it></sub>. Initially the consensus <it>L </it>may divide the dataset into <it>k </it>clusters (the final value <it>k* </it>is automatically determined; see below), then <it>L </it>has <it>k </it>states, i.e., 1,..., <it>k</it>. Since each LBF distribution vector in the dataset is assigned a cluster label by <it>C</it><sub><it>i</it></sub>, it takes a specific state value on <it>C</it><sub><it>i</it></sub>. Denote <it>s </it>= (<it>C</it><sub>1 </sub>= <it>c</it><sub>1</sub>, <it>C</it><sub>2 </sub>= <it>c</it><sub>2</sub>,...., <it>C</it><sub><it>M </it></sub>= <it>c</it><sub><it>M</it></sub>), where <it>c</it><sub><it>i </it></sub>(<it>i </it>&#8712; [0, <it>M</it>]) takes one state in 1,..., <it>r</it><sub><it>i</it></sub>.</p>
				<p>Upon initialization of the latent variable <it>L</it>, we randomly assign each of the <it>N </it>data points one of the <it>k </it>states. Given a data <it>s </it>which is assigned state label <it>c</it><sub><it>i </it></sub>by the <it>i</it>th clustering method <it>C</it><sub><it>i</it></sub>, we derive its probability of taking state label <it>l </it>(where <it>l </it>&#8712; [1, <it>k</it>]) in consensus <it>L</it>, i.e., P(<it>L </it>= <it>l</it>|<it>s</it>). Based on the conditional independence assumption, we have</p>
				<p>
					<display-formula id="M1">
						<m:math name="1471-2121-8-S1-S3-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:mi>P</m:mi>
									<m:mo stretchy="false">(</m:mo>
									<m:msup>
										<m:mi>L</m:mi>
										<m:mrow>
											<m:mo stretchy="false">(</m:mo>
											<m:mi>j</m:mi>
											<m:mo stretchy="false">)</m:mo>
										</m:mrow>
									</m:msup>
									<m:mo>=</m:mo>
									<m:mi>l</m:mi>
									<m:mo>|</m:mo>
									<m:msup>
										<m:mi>s</m:mi>
										<m:mrow>
											<m:mo stretchy="false">(</m:mo>
											<m:mi>j</m:mi>
											<m:mo stretchy="false">)</m:mo>
										</m:mrow>
									</m:msup>
									<m:mo stretchy="false">)</m:mo>
									<m:mo>&#8733;</m:mo>
									<m:mi>P</m:mi>
									<m:mo stretchy="false">(</m:mo>
									<m:msup>
										<m:mi>s</m:mi>
										<m:mrow>
											<m:mo stretchy="false">(</m:mo>
											<m:mi>j</m:mi>
											<m:mo stretchy="false">)</m:mo>
										</m:mrow>
									</m:msup>
									<m:mo>|</m:mo>
									<m:msup>
										<m:mi>L</m:mi>
										<m:mrow>
											<m:mo stretchy="false">(</m:mo>
											<m:mi>j</m:mi>
											<m:mo stretchy="false">)</m:mo>
										</m:mrow>
									</m:msup>
									<m:mo>=</m:mo>
									<m:mi>l</m:mi>
									<m:mo stretchy="false">)</m:mo>
									<m:mo>=</m:mo>
									<m:mstyle displaystyle="true">
										<m:munderover>
											<m:mo>&#8719;</m:mo>
											<m:mrow>
												<m:mi>i</m:mi>
												<m:mo>=</m:mo>
												<m:mn>1</m:mn>
											</m:mrow>
											<m:mi>M</m:mi>
										</m:munderover>
										<m:mrow>
											<m:mi>P</m:mi>
											<m:mo stretchy="false">(</m:mo>
											<m:msubsup>
												<m:mi>C</m:mi>
												<m:mi>i</m:mi>
												<m:mrow>
													<m:mo stretchy="false">(</m:mo>
													<m:mi>j</m:mi>
													<m:mo stretchy="false">)</m:mo>
												</m:mrow>
											</m:msubsup>
											<m:mo>=</m:mo>
											<m:msub>
												<m:mi>c</m:mi>
												<m:mi>i</m:mi>
											</m:msub>
											<m:mo>|</m:mo>
											<m:msup>
												<m:mi>L</m:mi>
												<m:mrow>
													<m:mo stretchy="false">(</m:mo>
													<m:mi>j</m:mi>
													<m:mo stretchy="false">)</m:mo>
												</m:mrow>
											</m:msup>
											<m:mo>=</m:mo>
											<m:mi>l</m:mi>
											<m:mo stretchy="false">)</m:mo>
										</m:mrow>
									</m:mstyle>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqGGOaakcqWGmbatdaahaaWcbeqaaiabcIcaOiabdQgaQjabcMcaPaaakiabg2da9iabdYgaSjabcYha8jabdohaZnaaCaaaleqabaGaeiikaGIaemOAaOMaeiykaKcaaOGaeiykaKIaeyyhIuRaemiuaaLaeiikaGIaem4Cam3aaWbaaSqabeaacqGGOaakcqWGQbGAcqGGPaqkaaGccqGG8baFcqWGmbatdaahaaWcbeqaaiabcIcaOiabdQgaQjabcMcaPaaakiabg2da9iabdYgaSjabcMcaPiabg2da9maarahabaGaemiuaaLaeiikaGIaem4qam0aa0baaSqaaiabdMgaPbqaaiabcIcaOiabdQgaQjabcMcaPaaakiabg2da9iabdogaJnaaBaaaleaacqWGPbqAaeqaaOGaeiiFaWNaemitaW0aaWbaaSqabeaacqGGOaakcqWGQbGAcqGGPaqkaaGccqGH9aqpcqWGSbaBcqGGPaqkaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd2eanbqdcqGHpis1aaaa@6A52@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>where <it>j </it>denotes the <it>j</it>th data in the dataset <it>D</it>, <it>P</it>(<it>C</it><sub><it>i </it></sub>= <it>c</it><sub><it>i</it></sub>|<it>L </it>= <it>l</it>) (<it>i </it>&#8712; [0, <it>M</it>]) can be easily obtained by counting and normalizing the occurrence frequency of data that are assigned the state label <it>c</it><sub><it>i </it></sub>by the clustering method <it>C</it><sub><it>i</it></sub>, given the data is assigned the state label <it>l </it>in <it>L</it>. Once P(<it>L </it>= <it>l</it>|<it>s</it>) is available, we use it to resample and update the state label of each data in <it>L</it>. The above process repeats until all the data do not change states. This will lead to the estimation of an optimal consensus function <it>L </it>for a specified number of clusters, <it>k</it>.</p>
				<p>We observe that when the data samples (LBFs) are independent of each other, the likelihood of the latent variable <it>L </it>which has <it>k </it>states can be estimated as</p>
				<p>
					<display-formula id="M2">
						<m:math name="1471-2121-8-S1-S3-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:mi>P</m:mi>
									<m:mo stretchy="false">(</m:mo>
									<m:mi>k</m:mi>
									<m:mo>|</m:mo>
									<m:mi>D</m:mi>
									<m:mo stretchy="false">)</m:mo>
									<m:mo>=</m:mo>
									<m:mstyle displaystyle="true">
										<m:munderover>
											<m:mo>&#8719;</m:mo>
											<m:mrow>
												<m:mi>j</m:mi>
												<m:mo>=</m:mo>
												<m:mn>1</m:mn>
											</m:mrow>
											<m:mi>N</m:mi>
										</m:munderover>
										<m:mrow>
											<m:mi>P</m:mi>
											<m:mo stretchy="false">(</m:mo>
											<m:mi>k</m:mi>
											<m:mo>|</m:mo>
											<m:msup>
												<m:mi>s</m:mi>
												<m:mrow>
													<m:mo stretchy="false">(</m:mo>
													<m:mi>j</m:mi>
													<m:mo stretchy="false">)</m:mo>
												</m:mrow>
											</m:msup>
											<m:mo stretchy="false">)</m:mo>
										</m:mrow>
									</m:mstyle>
									<m:mo>=</m:mo>
									<m:mstyle displaystyle="true">
										<m:munderover>
											<m:mo>&#8719;</m:mo>
											<m:mrow>
												<m:mi>j</m:mi>
												<m:mo>=</m:mo>
												<m:mn>1</m:mn>
											</m:mrow>
											<m:mi>N</m:mi>
										</m:munderover>
										<m:mrow>
											<m:mstyle displaystyle="true">
												<m:munderover>
													<m:mo>&#8721;</m:mo>
													<m:mrow>
														<m:mi>l</m:mi>
														<m:mo>=</m:mo>
														<m:mn>1</m:mn>
													</m:mrow>
													<m:mi>k</m:mi>
												</m:munderover>
												<m:mrow>
													<m:mi>P</m:mi>
													<m:mo stretchy="false">(</m:mo>
													<m:mi>k</m:mi>
													<m:mo>,</m:mo>
													<m:msup>
														<m:mi>L</m:mi>
														<m:mrow>
															<m:mo stretchy="false">(</m:mo>
															<m:mi>j</m:mi>
															<m:mo stretchy="false">)</m:mo>
														</m:mrow>
													</m:msup>
													<m:mo>=</m:mo>
													<m:mi>l</m:mi>
													<m:mo>|</m:mo>
													<m:msup>
														<m:mi>s</m:mi>
														<m:mrow>
															<m:mo stretchy="false">(</m:mo>
															<m:mi>j</m:mi>
															<m:mo stretchy="false">)</m:mo>
														</m:mrow>
													</m:msup>
													<m:mo stretchy="false">)</m:mo>
												</m:mrow>
											</m:mstyle>
										</m:mrow>
									</m:mstyle>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqGGOaakcqWGRbWAcqGG8baFcqWGebarcqGGPaqkcqGH9aqpdaqeWbqaaiabdcfaqjabcIcaOiabdUgaRjabcYha8jabdohaZnaaCaaaleqabaGaeiikaGIaemOAaOMaeiykaKcaaOGaeiykaKcaleaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqWGobGta0Gaey4dIunakiabg2da9maarahabaWaaabCaeaacqWGqbaucqGGOaakcqWGRbWAcqGGSaalcqWGmbatdaahaaWcbeqaaiabcIcaOiabdQgaQjabcMcaPaaakiabg2da9iabdYgaSjabcYha8jabdohaZnaaCaaaleqabaGaeiikaGIaemOAaOMaeiykaKcaaOGaeiykaKcaleaacqWGSbaBcqGH9aqpcqaIXaqmaeaacqWGRbWAa0GaeyyeIuoaaSqaaiabdQgaQjabg2da9iabigdaXaqaaiabd6eaobqdcqGHpis1aaaa@6663@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>It is apparent that we can maximize the likelihood in Eq. (2) to find the best <it>k </it>over a specified range. In practice, we can often avoid iteration in Eq. (2) by directly assigning a big <it>k</it>. After convergence in solving Eq. (1), there are <it>k</it>* (<it>k </it>&#8805; <it>k</it>*) states in <it>L </it>that have non-zero number of data points. This <it>k* </it>value is the statistically optimal <it>k </it>value automatically determined.</p>
			</sec>
			<sec>
				<st>
					<p>Computing cluster histograms for cells of different phenotypes</p>
				</st>
				<p>Once we obtained reliable clusters of LBF distributions of individual nuclei, we analyzed how the cells belonging to different phenotypes, defined by the behavior of the cells, (i.e., S1 and T4-2 cells cultured in different days) were distributed across the various LBF clusters. For this purpose, we counted the number of nuclei whose LBF distribution fell into each cluster for each phenotype, i.e., S1 cells cultured for 3, 5, 10, and 12 days, and T4-2 cells cultured for 4, 5, 11, and 12 days. By doing so, we obtained the cluster histogram of each phenotype, represented by the percentile of nuclei as a function of clusters. The cluster histograms do not only directly link to predefined phenotypes (as shown in Figure <figr fid="F3">3</figr>) but also provided more detail information compared to cell malignancy and days in culture.</p>
			</sec>
			<sec>
				<st>
					<p>Constructing the phenotype tree</p>
				</st>
				<p>Taking the non-neoplastic S1 cells cultured for different days as an example, our method in constructing the tree is as follows. For all the <it>N </it>images of S1 cells, we assume images of the same day are of the same phenotype and morphogenesis progresses montotonically, as defined by biologists. This allowed us to group the images sequentially, leading to <inline-formula><m:math name="1471-2121-8-S1-S3-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mstyle displaystyle="true"><m:msubsup><m:mo>&#8721;</m:mo><m:mrow><m:mi>i</m:mi><m:mo>=</m:mo><m:mn>1</m:mn></m:mrow><m:mrow><m:mi>P</m:mi><m:mo>&#8722;</m:mo><m:mn>1</m:mn></m:mrow></m:msubsup><m:mrow><m:msubsup><m:mi>C</m:mi><m:mrow><m:mi>P</m:mi><m:mo>&#8722;</m:mo><m:mn>1</m:mn></m:mrow><m:mi>i</m:mi></m:msubsup></m:mrow></m:mstyle></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaaeWaqaaiabdoeadnaaDaaaleaacqWGqbaucqGHsislcqaIXaqmaeaacqWGPbqAaaaabaGaemyAaKMaeyypa0JaeGymaedabaGaemiuaaLaeyOeI0IaeGymaedaniabggHiLdaaaa@3A97@</m:annotation></m:semantics></m:math></inline-formula> possible ways of grouping the different phenotypes, where <it>C </it>denotes the combination operation and <it>P </it>is the number of defined cell phenotypes. For instance, if <it>P </it>= 4, then the total number of possible ways of grouping phenotypes is 7 (i.e., <inline-formula><m:math name="1471-2121-8-S1-S3-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mstyle displaystyle="true"><m:msubsup><m:mo>&#8721;</m:mo><m:mrow><m:mi>i</m:mi><m:mo>=</m:mo><m:mn>1</m:mn></m:mrow><m:mn>3</m:mn></m:msubsup><m:mrow><m:msubsup><m:mi>C</m:mi><m:mn>3</m:mn><m:mi>i</m:mi></m:msubsup></m:mrow></m:mstyle></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaaeWaqaaiabdoeadnaaDaaaleaacqaIZaWmaeaacqWGPbqAaaaabaGaemyAaKMaeyypa0JaeGymaedabaGaeG4mamdaniabggHiLdaaaa@3673@</m:annotation></m:semantics></m:math></inline-formula>). Among these 7 cases, 3 cases (i.e., <inline-formula><m:math name="1471-2121-8-S1-S3-i5" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>C</m:mi><m:mn>3</m:mn><m:mn>1</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGdbWqdaqhaaWcbaGaeG4mamdabaGaeGymaedaaaaa@2FCC@</m:annotation></m:semantics></m:math></inline-formula>) correspond to grouping the four macroscopically defined phenotypes into 2 groups, 3 cases (i.e., <inline-formula><m:math name="1471-2121-8-S1-S3-i6" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>C</m:mi><m:mn>3</m:mn><m:mn>2</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGdbWqdaqhaaWcbaGaeG4mamdabaGaeGOmaidaaaaa@2FCE@</m:annotation></m:semantics></m:math></inline-formula>) correspond to grouping them into 3 groups, and 1 case (i.e., <inline-formula><m:math name="1471-2121-8-S1-S3-i7" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>C</m:mi><m:mn>3</m:mn><m:mn>3</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGdbWqdaqhaaWcbaGaeG4mamdabaGaeG4mamdaaaaa@2FD0@</m:annotation></m:semantics></m:math></inline-formula>) corresponds to grouping them into 4 groups. These 7 cases are shown in Figure <figr fid="F9">9a</figr>. Different colors in each row represent different groups. The first three bins correspond to dividing the S1 cells cultured for 3 days, 5 days, 10 days and 12 days into 2 groups, the next three bins correspond to dividing the cells into 3 groups, and the 7<sup>th </sup>bin corresponds to dividing the cells into 4 groups.</p>
				<fig id="F9">
					<title>
						<p>Figure 9</p>
					</title>
					<caption>
						<p>An illustration of phenotype tree construction process</p>
					</caption>
					<text>
						<p><b>An illustration of phenotype tree construction process</b>. (a) Images 1&#8211;25, 26&#8211;45, 46&#8211;61, and 62&#8211;77 correspond to non-neoplastic S1 cells cultured for 12 days, 10 days, 5 days, and 3 days respectively. There are 7 possible ways of grouping the phenotypes. Each row corresponds to one possible way. Different colors represent different phenotype groups. The first 3 rows correspond to grouping the 4 predefined phenotypes into 2 groups. The next 3 rows correspond to grouping the phenotypes into 3 groups, and the last row correspond to 4 groups. (b) Taking the 4 phenotype group case (last row in (a)) as an example, we used traditional clustering methods to divide the cluster histogram of the image (one cluster histogram per image) into the same number of clusters (i.e., 4 in this example). Each row corresponds to the clustering result of one method. (c) The <it>F</it>-measures computed by pairing the phenotype group in the last row of (a) with each clustering result in (b). The maximum <it>F</it>-score, which in this case is achieved by the Gaussian Mixture Model approach (GM), is selected as the <it>confidence </it>of the corresponding cell phenotype grouping. (d) Confidence values as functions of different cases of phenotype groupings. We tested the confidence values under different number of clusters predefined for clustering LBF distributions using the five traditional methods (i.e., the second step of our algorithm, see Figure 6) as shown by dots of different colors. The numbers of clusters we tested were 4 to 26 with step size of 2. The consistent distribution of the dots indicates that our phenotype tree construction method is insensitive to the number of clusters we selected for clustering LBF distributions.</p>
					</text>
					<graphic file="1471-2121-8-S1-S3-9"/>
				</fig>
				<p>Our next step is to determine the likelihood of these potential groupings. Assume we want to divide the predefined phenotypes into <it>p </it>groups (where <it>p </it>= 2,3,4 in the above example). We then grouped the cluster histogram of the 77 S1 cell images into the same number of clusters. To improve reliability we again used multiple clustering algorithms, including K-means, fuzzy C-means clustering, hierarchical clustering, Gaussian Mixture model, and spectral clustering, as used in generating the LBF clusters (see Figure <figr fid="F9">9b</figr>). We then paired each clustering result with the phenotype grouping under consideration, and calculated the degree of agreement between them using the <it>F</it>-measure. We then selected the maximum <it>F</it>-score as the <it>confidence </it>of the corresponding cell phenotype grouping (see Figure <figr fid="F9">9c</figr>). By repeating the process for each potential phenotype grouping, we finally obtained the value of the confidence as the function of the different cases of phenotype grouping.</p>
				<p>To further test the sensitivity of this method to the number of clusters predefined when generating the clusters of LBF distributions using the five traditional clustering approaches, we repeated the process for different numbers of clusters predefined for the traditional methods and obtained a set of confidence values for each phenotype grouping case as indicated by the colored dots in each bin of Figure <figr fid="F9">9d</figr>. The result exhibits a central tendency, indicating that the method is insensitive to the number of clusters predefined in clustering the LBF distributions. We then took the median of the confidence values obtained under different number of clusters on each bin as the overall confidence value of the corresponding phenotype grouping.</p>
				<p>Given <it>p</it>, the number of groups that the predefined phenotype should be grouped into, we selected from all the phenotype grouping cases that have the same number of groups the one that has the maximum confidence value, as the most likely phenotype grouping case under the given <it>p</it>. For instance, if we want to group the predefined phenotypes into 2 groups, i.e., <it>p </it>= 2, there are three phenotype grouping cases, corresponding to the first three bins in Figure <figr fid="F9">9d</figr> and the first three rows in Figure <figr fid="F9">9a</figr>. The second case has the maximum confidence value (indicated by the left-most dashed ellipse in Figure <figr fid="F9">9d</figr>, which corresponds to the second row of Figure <figr fid="F9">9a</figr>) and is thus taken as the right way of grouping the predefined phenotypes into 2 groups. This means that S1 cells cultured for 10 and 12 days (i.e., images 1&#8211;45) belong to one group, and those cultured for 3 and 5 days belong to another (i.e., images 46&#8211;77). Using this approach, we determined the most likely phenotype grouping for <it>p </it>= 3 and <it>p </it>= 4, which correspond to the 6<sup>th </sup>and 7<sup>th </sup>bin in Figure <figr fid="F9">9d</figr> and the 6<sup>th </sup>and 7<sup>th </sup>row in Figure <figr fid="F9">9a</figr> respectively. These three phenotype groupings constitute the first to the third level of the phenotype tree as shown in Figure <figr fid="F4">4a</figr>.</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Competing interests</p>
			</st>
			<p>The authors declare that they have no competing interests.</p>
		</sec>
	</bdy>
	<bm>
		<ack>
			<sec>
				<st>
					<p>Acknowledgements</p>
				</st>
				<p>This work was supported by the Department of Defense-Breast Cancer Research Program/DOD-BCRP (DAMD-170210440 to D.W.K.), the National Institutes of Health, National Cancer Institute (1 R33 CA118479-01 to D.W.K.), and a grant from the "Friends For An Earlier Breast Cancer Test" Foundation to S.A.L.</p>
				<p>This article has been published as part of <it>BMC Cell Biology</it> Volume 8 Supplement 1, 2007: 2006 International Workshop on Multiscale Biological Imaging, Data Mining and Informatics. The full contents of the supplement are available online at <url>http://www.biomedcentral.com/1471-2121/8?issue=S1</url></p>
			</sec>
		</ack>
		<refgrp>
			<bibl id="B1">
				<title>
					<p>Nuclear structure in cancer cells</p>
				</title>
				<aug>
					<au>
						<snm>Zink</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Fischer</snm>
						<fnm>AH</fnm>
					</au>
					<au>
						<snm>Nickerson</snm>
						<fnm>JA</fnm>
					</au>
				</aug>
				<source>Nat Rev Cancer</source>
				<pubdate>2004</pubdate>
				<volume>4</volume>
				<fpage>677</fpage>
				<lpage>687</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1038/nrc1430</pubid>
						<pubid idtype="pmpid" link="fulltext">15343274</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B2">
				<title>
					<p>Cell nucleus in context</p>
				</title>
				<aug>
					<au>
						<snm>Leli&#232;vre</snm>
						<fnm>SA</fnm>
					</au>
					<au>
						<snm>Bissell</snm>
						<fnm>MJ</fnm>
					</au>
					<au>
						<snm>Pujuguet</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Crit Rev Eukaryot Gene Expr</source>
				<pubdate>2000</pubdate>
				<volume>10</volume>
				<fpage>13</fpage>
				<lpage>20</lpage>
				<xrefbib>
					<pubid idtype="pmpid">10813390</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B3">
				<title>
					<p>Unravelling heterochromatin: competition between positive and negative factors regulates accessibility</p>
				</title>
				<aug>
					<au>
						<snm>Dillon</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Festenstein</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>Trends Genet</source>
				<pubdate>2002</pubdate>
				<volume>18</volume>
				<fpage>252</fpage>
				<lpage>258</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/S0168-9525(02)02648-3</pubid>
						<pubid idtype="pmpid" link="fulltext">12047950</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B4">
				<title>
					<p>Tissue phenotype depends on reciprocal interactions between the extracellular matrix and the structural organization of the nucleus</p>
				</title>
				<aug>
					<au>
						<snm>Leli&#232;vre</snm>
						<fnm>SA</fnm>
					</au>
					<au>
						<snm>Weaver</snm>
						<fnm>VM</fnm>
					</au>
					<au>
						<snm>Nickersondagger</snm>
						<fnm>JA</fnm>
					</au>
					<au>
						<snm>Larabell</snm>
						<fnm>CA</fnm>
					</au>
					<au>
						<snm>Bhaumik</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Petersen</snm>
						<fnm>OW</fnm>
					</au>
					<au>
						<snm>Bissell</snm>
						<fnm>MJ</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>1998</pubdate>
				<volume>95</volume>
				<fpage>14711</fpage>
				<lpage>14716</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">24514</pubid>
						<pubid idtype="pmpid" link="fulltext">9843954</pubid>
						<pubid idtype="doi">10.1073/pnas.95.25.14711</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B5">
				<title>
					<p>Automated local bright feature image analysis of nuclear protein distribution identifies changes in tissue phenotype</p>
				</title>
				<aug>
					<au>
						<snm>Knowles</snm>
						<fnm>DW</fnm>
					</au>
					<au>
						<snm>Sudar</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Carol</snm>
						<fnm>Bator-Kelly</fnm>
					</au>
					<au>
						<snm>Bissell</snm>
						<fnm>MJ</fnm>
					</au>
					<au>
						<snm>Leli&#232;vre</snm>
						<fnm>SA</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>2006</pubdate>
				<volume>103</volume>
				<fpage>4445</fpage>
				<lpage>4450</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1450191</pubid>
						<pubid idtype="pmpid" link="fulltext">16537359</pubid>
						<pubid idtype="doi">10.1073/pnas.0509944102</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B6">
				<title>
					<p>A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters</p>
				</title>
				<aug>
					<au>
						<snm>Dunn</snm>
						<fnm>JC</fnm>
					</au>
				</aug>
				<source>Journal of Cybernetics</source>
				<pubdate>1973</pubdate>
				<volume>3</volume>
				<fpage>32</fpage>
				<lpage>57</lpage>
			</bibl>
			<bibl id="B7">
				<aug>
					<au>
						<snm>Bezdek</snm>
						<fnm>JC</fnm>
					</au>
				</aug>
				<source>Pattern Recognition with Fuzzy Objective Function Algoritms</source>
				<publisher>Plenum Press, New York</publisher>
				<pubdate>1981</pubdate>
			</bibl>
			<bibl id="B8">
				<aug>
					<au>
						<snm>McLachlan</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Basford</snm>
						<fnm>K</fnm>
					</au>
				</aug>
				<source>Mixture models: inference and application to clustering</source>
				<publisher>Marcel Dekker, New Nork</publisher>
				<pubdate>1988</pubdate>
			</bibl>
			<bibl id="B9">
				<title>
					<p>Bayesian approaches to Gaussian mixture modeling</p>
				</title>
				<aug>
					<au>
						<snm>Roberts</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Husmeier</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Rezek</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Penny</snm>
						<fnm>W</fnm>
					</au>
				</aug>
				<source>IEEE trans Pattern Analysis and Machine Intelligence</source>
				<pubdate>1998</pubdate>
				<volume>20</volume>
				<issue>11</issue>
				<fpage>1133</fpage>
				<lpage>1142</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1109/34.730550</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B10">
				<title>
					<p>Data clustering: a review</p>
				</title>
				<aug>
					<au>
						<snm>Jain</snm>
						<fnm>AK</fnm>
					</au>
					<au>
						<snm>Murty</snm>
						<fnm>MN</fnm>
					</au>
					<au>
						<snm>Flynn</snm>
						<fnm>PJ</fnm>
					</au>
				</aug>
				<source>ACM Computing Surveys</source>
				<pubdate>1999</pubdate>
				<volume>31</volume>
				<issue>3</issue>
				<fpage>264</fpage>
				<lpage>323</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1145/331499.331504</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B11">
				<aug>
					<au>
						<snm>Hartigan</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Clustering Algorithms</source>
				<publisher>John Wiley &amp; Sons, NY</publisher>
				<pubdate>1975</pubdate>
			</bibl>
			<bibl id="B12">
				<title>
					<p>Finding groups in data: an introduction to cluster analysis</p>
				</title>
				<aug>
					<au>
						<snm>Kaufman</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Rousseeuw</snm>
						<fnm>PJ</fnm>
					</au>
				</aug>
				<publisher>John Wiley and Sons, NY</publisher>
				<pubdate>1990</pubdate>
			</bibl>
			<bibl id="B13">
				<title>
					<p>An optimal graph theoretic approach to data clustering: theory and its application to image segmentation</p>
				</title>
				<aug>
					<au>
						<snm>Wu</snm>
						<fnm>Z</fnm>
					</au>
					<au>
						<snm>Leahy</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>IEEE Tran Pattern Analysis and Machine Intelligence</source>
				<pubdate>1993</pubdate>
				<volume>15</volume>
				<issue>11</issue>
				<fpage>1101</fpage>
				<lpage>1113</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1109/34.244673</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B14">
				<title>
					<p>Automatic content extraction of filled form images based on clustering component block projection vectors</p>
				</title>
				<aug>
					<au>
						<snm>Peng</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>He</snm>
						<fnm>X</fnm>
					</au>
					<au>
						<snm>Long</snm>
						<fnm>F</fnm>
					</au>
				</aug>
				<source>Proc IS&amp;T/SPIE 16th Annual Symp of Electronic Imaging, Conf on Document Recognition and Retrieval XI, San Jose, CA, USA</source>
				<pubdate>2004</pubdate>
				<fpage>204</fpage>
				<lpage>212</lpage>
			</bibl>
			<bibl id="B15">
				<title>
					<p>Evidence: accumulation clustering based on the K-means algorithm</p>
				</title>
				<aug>
					<au>
						<snm>Fred</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Jain</snm>
						<fnm>AK</fnm>
					</au>
				</aug>
				<source>Proc of the 16th International Conference on Pattern Recognition, Quebec City</source>
				<pubdate>2002</pubdate>
				<fpage>276</fpage>
				<lpage>280</lpage>
			</bibl>
			<bibl id="B16">
				<title>
					<p>Cluster ensembles &#8211; a knowledge reuse framework for combining multiple partitions</p>
				</title>
				<aug>
					<au>
						<snm>Strehl</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Ghosh</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Journal of Machine Learning Research</source>
				<pubdate>2002</pubdate>
				<volume>3</volume>
				<fpage>583</fpage>
				<lpage>617</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1162/153244303321897735</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B17">
				<title>
					<p>Combining multiple weak clusterings</p>
				</title>
				<aug>
					<au>
						<snm>Topchy</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Jain</snm>
						<fnm>AK</fnm>
					</au>
					<au>
						<snm>Punch</snm>
						<fnm>W</fnm>
					</au>
				</aug>
				<source>Proc IEEE Intl Conf on Data Mining, Melbourne, FL</source>
				<pubdate>2003</pubdate>
				<fpage>331</fpage>
				<lpage>338</lpage>
			</bibl>
			<bibl id="B18">
				<title>
					<p>A mixture model for clustering ensembles</p>
				</title>
				<aug>
					<au>
						<snm>Topchy</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Jain</snm>
						<fnm>AK</fnm>
					</au>
					<au>
						<snm>Punch</snm>
						<fnm>W</fnm>
					</au>
				</aug>
				<source>Proc SIAM Intl Conf on Data Mining, SDM</source>
				<pubdate>2004</pubdate>
				<fpage>379</fpage>
				<lpage>390</lpage>
			</bibl>
			<bibl id="B19">
				<title>
					<p>Bagging for path-based clustering</p>
				</title>
				<aug>
					<au>
						<snm>Fischer</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Buhmann</snm>
						<fnm>JM</fnm>
					</au>
				</aug>
				<source>IEEE Trans On Pattern Analysis and Machine Intelligence</source>
				<pubdate>2003</pubdate>
				<volume>25</volume>
				<issue>11</issue>
				<fpage>1411</fpage>
				<lpage>1415</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1109/TPAMI.2003.1240115</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B20">
				<title>
					<p>Bagging to improve the accuracy of a clustering procedure</p>
				</title>
				<aug>
					<au>
						<snm>Dudoit</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Fridlyand</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>19</volume>
				<issue>9</issue>
				<fpage>1090</fpage>
				<lpage>1099</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/btg038</pubid>
						<pubid idtype="pmpid" link="fulltext">12801869</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B21">
				<title>
					<p>Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables</p>
				</title>
				<aug>
					<au>
						<snm>Chickering</snm>
						<fnm>DM</fnm>
					</au>
					<au>
						<snm>Heckerman</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Machine Learning</source>
				<pubdate>1997</pubdate>
				<volume>29</volume>
				<fpage>181</fpage>
				<lpage>212</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1023/A:1007469629108</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B22">
				<title>
					<p>Bayesian clustering methods for morphological analysis of MR images</p>
				</title>
				<aug>
					<au>
						<snm>Peng</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Herskovits</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Davatzikos</snm>
						<fnm>C</fnm>
					</au>
				</aug>
				<source>Int Symp on Biomedical Imaging: from Nano to Macro, Washington, D.C</source>
				<pubdate>2002</pubdate>
				<fpage>485</fpage>
				<lpage>488</lpage>
			</bibl>
			<bibl id="B23">
				<title>
					<p>A Bayesian morphometry algorithm</p>
				</title>
				<aug>
					<au>
						<snm>Herskovits</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Peng</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Davatzikos</snm>
						<fnm>C</fnm>
					</au>
				</aug>
				<source>IEEE Transactions on Medical Imaging</source>
				<pubdate>2004</pubdate>
				<volume>24</volume>
				<issue>6</issue>
				<fpage>723</fpage>
				<lpage>737</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1109/TMI.2004.826949</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B24">
				<title>
					<p>NuMA Influences Higher Order Chromatin Organization in Human Mammary Epithelium</p>
				</title>
				<aug>
					<au>
						<snm>Abad</snm>
						<fnm>PC</fnm>
					</au>
					<au>
						<snm>Lewis</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Mian</snm>
						<fnm>IS</fnm>
					</au>
					<au>
						<snm>Knowles</snm>
						<fnm>DW</fnm>
					</au>
					<au>
						<snm>Sturgis</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Badve</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Xie</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Leli&#232;vre</snm>
						<fnm>SA</fnm>
					</au>
				</aug>
				<source>Mol Biol Cell</source>
				<pubdate>2007</pubdate>
				<volume>18</volume>
				<fpage>348</fpage>
				<lpage>361</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1783787</pubid>
						<pubid idtype="pmpid" link="fulltext">17108325</pubid>
						<pubid idtype="doi">10.1091/mbc.E06-06-0551</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
		</refgrp>
	</bm>
</art>
