MOH Key Laboratory of Systems Biology of Pathogens, Institute of Pathogen Biology, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100730, China

Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China

National Center for Mathematics and Interdisciplinary Sciences, Chinese Academy of Sciences, Beijing 100190, China

Beijing Institute of Genomics, Chinese Academy of Sciences, 7 Beitucheng West Road, Beijing 100029, China

Abstract

Background

Gene expression profiling technologies have gradually become a community standard tool for clinical applications. For example, gene expression data has been analyzed to reveal novel disease subtypes (class discovery) and assign particular samples to well-defined classes (class prediction). In the past decade, many effective methods have been proposed for individual applications. However, there is still a pressing need for a unified framework that can reveal the complicated relationships between samples.

Results

We propose a novel convex optimization model to perform class discovery and class prediction in a unified framework. An efficient algorithm is designed and software named OTCC (Optimization Tool for Clustering and Classification) is developed. Comparison in a simulated dataset shows that our method outperforms the existing methods. We then applied OTCC to acute leukemia and breast cancer datasets. The results demonstrate that our method not only can reveal the subtle structures underlying those cancer gene expression data but also can accurately predict the class labels of unknown cancer samples. Therefore, our method holds the promise to identify novel cancer subtypes and improve diagnosis.

Conclusions

We propose a unified computational framework for class discovery and class prediction to facilitate the discovery and prediction of subtle subtypes of cancers. Our method can be generally applied to multiple types of measurements, e.g., gene expression profiling, proteomic measuring, and recent next-generation sequencing, since it only requires the similarities among samples as input.

Background

Accurate diagnosis is a great challenge for clinical therapies. In particular, the current diagnosis based on only a few genes, proteins or metabolites are very limited when it comes to tackling the intrinsic complexity of many diseases, e.g., cancers. Fortunately with the rapid development of high-throughput technologies, gene expression profiling techniques have been widely applied in clinical research. The big advantage is to simultaneously measure the expressions of thousands of genes

In the machine learning framework, class discovery is an unsupervised task. Many methods related to clustering have been proposed and applied to identify new disease subtypes. Several well-known methods, e.g., hierarchical clustering (HC), self-organizing maps (SOM), and non-negative matrix factorization (NMF) have been successfully used

In this paper, we propose a semi-supervised solution to formulate class discovery and class prediction into a unified framework. We term it OTCC (Optimization Tool for Clustering and Classification). The underlying principle is to seek an optimal sample labeling scheme to ensure that similar samples can be assigned with similar labels. This assumption is straightforward and can be easily understood by clinicians. OTCC has several prominent features: 1) The global optimal solution is guaranteed because it is based on convex quadratic programming; 2) It implements class discovery and class prediction in one computational framework; 3) It does not require many samples; 4) It can be applied to both small and large datasets due to a customized iterative algorithm. Experiments on acute leukemia and breast cancer datasets suggest the validity and advantages of OTCC in mining the clinical significance of patient gene expression data.

Methods

Overview of the optimization model

For simplicity, we consider two classes to illustrate the optimization model. We note that both class discovery and class prediction for the two classes can be transformed into a sample labeling problem. In this section, the optimization model is formulated to find the best way to assign labels to the samples. The labeling problem for multi-class cases for class discovery and class prediction will be discussed in the next sections.

For two-class cases, we denote one class by zero and the other class by one. Assume all the sample labels are continuous variables between zero and one. The objective of the optimization model is to assign similar labels to similar samples as much as possible. The formulations are given as follows:

Subject to

where _{ij} is the similarity score of samples _{i} and _{j,} which is calculated from the gene expression profiles; and _{i} is the unknown variable to be determined and represents the label of sample _{i}.

The objective function (1) can be rewritten in vector form as _{i,} is the label of Sample _{ij}, the similarity score of samples _{ij} are all non-negative,

Due to the form of the objective function, our optimization model is tightly related to spectral clustering and semi-supervised learning

The sample similarity matrix

Usually the gene expression profile for _{ij} represents the expression level of gene _{i} is an m-dimensional vector denoting the expression value of gene _{ij}, a linear transformation can be adopted to map [−1, 1] to [0, 1]. Because the Pearson correlation coefficients based on the gene expression profiles are calculated pairwisely between every two samples, it does not consider the similarities among samples globally. To provide a global similarity measure, a second-order correlation similarity matrix can be constructed by exploiting the deduced sample correlation features (i.e., calculating the Pearson correlation coefficients of the sample correlation vectors). In this study we used second-order correlation similarity matrices to identify the underlying structures of cancer gene expression data.

Setting for class discovery

Given the similarity matrix _{ij}. The trivial solution indicates that all the samples belong to one class, which is meaningless. To obtain a meaningful solution, _{ij} for _{ab}. Let Sample _{a} be labeled with zero and _{b} be labeled with one, or vice versa. If there is more than one minimal value in S, the sample pair with minimal values in S^{n} (the power of similarity matrix S, where n > 1 is a positive integer) is also a candidate to determine set A and B. Model (1–2) is then well constructed and optimal labeling can be uniquely determined by solving the model.

Setting for class prediction

Class prediction tries to assign a set of particular samples to known classes. In this setting, gold-standard data are generally available and some gene expression profiles for samples are labeled with known classes. That is,

A fast algorithm for large-scale problems

Model (1–2) can be considered convex quadratic programming if all values of _{ij} are positive. It can be solved efficiently by the general solvers such as quadprog in Matlab and the sequential minimal optimization (SMO) algorithm which has been applied successfully to solve the optimization problems in support vector machine applications. Here, a simple customized algorithm is proposed to solve Model (1–2) quickly, even for very large-scale problems by fully considering its particular characteristics.

The Lagrange function of optimization model (1–2) is:

Then the Karush-Kuhn-Tucker (KKT) conditions are:

These conditions can be reduced as:

We design the following algorithm to quickly find the solution:

**Algorithm 1.**

· **Step 1**: Let

· **Step 2**: Calculate

· **Step 3**: Let **Step 2** and **Step 3**.

Next, we prove the above algorithm is correct and convergent.

**Theroem 1:** Suppose **Algorithm 1** gives rise to the sequence,

Firstly, we prove that **Algorithm 1** is convergent. The Lagrangian function of our optimization model (1–2) is as follows,

Then an auxiliary function

where

where

Recalling the KKT condition and our iterative **Step 2** can be reformulated as,

By the property of the auxiliary function, we have

Secondly we show **Algorithm 1** is correct. At convergence, the solution is

One advantage of our algorithm is that the computational complexity is low and it requires only a small amount of computer memory. So our algorithm can be applied to very large data sets.

Post-processing the solutions

Each sample gets a continuous label between zero and one after the optimization model (1)-(2) is solved. We can easily obtain the binary labels by applying a pre-defined threshold. If a training data set is available, this threshold can be learned from the training data by cross-validation. Otherwise, the median of zero and one, 0.5, is a natural cutoff to convert the continuous labels into binary labels. If label

Multiple-class cases

In practice, the samples may belong to more than two classes. For class discovery cases, the class labels can be obtained by recursively applying our model to classify samples into two groups on each step until some stopping criterion is satisfied. Here we propose an intuitive criterion and name it as the minimum similarity score criterion. Formally, the procedure for class discovery with multiple classes is described as follows:

· **Step 1**: Classify samples into two classes by OTCC.

· **Step 2**: Calculate the inner minimum similarity score for each class. If the minimum similarity score of some class is less than a predefined threshold, then repeat **Step 1** to classify the samples of this class into two sub-classes.

· **Step 3**: repeat **Step 2** until all the inner minimum similarity scores of the classes are above the threshold.

The procedure does not require the number of clusters but instead relies on the least tolerant similarity score within classes. Compared to the number of clusters which is generally required by many existing class discovery methods, our similarity score is tightly related to the expert’s knowledge and is expected to be defined by clinicians and biologists based on their knowledge. Alternatively, without pre-defining a stopping criterion, OTCC can be applied recursively until each sample is a single class. This outputs a binary tree in which all samples are leaves and the relationships among them are fully depicted. This property allows OTCC to reveal the fine structure of patient samples.

For class prediction cases, the relationship between multiple classes can be organized as a binary tree and then the model can be applied recursively according to the binary tree to obtain the labels of all samples. The binary tree should reflect the relationship of the classes. Otherwise wrong prior information will be introduced and mislead the class prediction results. When the class relationships are not available or all the classes are independent of each other, an arbitrary binary tree can be used. One-vs-one or one-vs-all strategies can also be adopted to extend OTCC to multi-class cases.

Results and discussion

Performance of OTCC on simulated data sets

We first evaluated OTCC on a simulated dataset and compared the results with those that can be obtained using the existing method. Two types of datasets were simulated. The first dataset consisted of two classes. One class had five samples and the other had n-fold samples relative to the first class. We directly simulated the similarity matrix of the samples. The similarity scores of the two samples from the same class were set to be one and the similarity scores of two samples from different classes were set to be zero. Then noise subjected to a normal distribution with mean zero and standard variation “Sigma” was added. Each setting (noise and ratio of class sizes) was repeated 1000 times. With various levels of noise and ratio of class sizes, the performance of OTCC was noted, and is shown in Figure

Clustering accuracy of OTCC (A) and Affinity Propagation (B) on simulated data sets with various levels of noise and ratios of class sizes.

**Clustering accuracy of OTCC (A) and Affinity Propagation (B) on simulated data sets with various levels of noise and ratios of class sizes**. “Sigma” is the standard variation of noise distribution.

The second simulation dataset consisted of multiple classes and was generated using a similar procedure. For multiple classes, we applied OTCC recursively to construct a binary tree to reveal the multiple classes. If the real relationship among multiple classes is indeed a binary tree, it is reasonable to expect OTCC to succeed. Here we consider an extreme example to show that OTCC can also successfully deal with cases in which the relationship among multiple classes is inherently not a binary tree.

In Figure

A, a simple simulated data set with three classes; B, performance of OTCC on multiple classes with unbalanced classes and various levels of noise.

**A, a simple simulated data set with three classes; B, performance of OTCC on multiple classes with unbalanced classes and various levels of noise.**

The success of OTCC for resolving the above multi-cluster structure lies in its ability to form pseudo-clusters when clustering. There are two globally optimum solutions in this case (Nodes 11 to 15 have the same labels as Nodes 1 to 5 or Nodes 6 to 10). OTCC assigns Nodes 11 to 15 to the same labels as Nodes 1 to 5, generating a degenerative pseudo-cluster whereas Nodes 6 to 10 are classified correctly first. We recursively applying OTCC to pseudo-clusters until the consistence criterion applies to each cluster. In this way it resolves the multi-cluster structure irrespective of whether the relationship among the multiple classes is inherently a binary tree or not.

In Figure

Experiments on cancer gene expression data sets

Next we use two real data sets to demonstrate the effectiveness and advantages of our models in both class discovery and class prediction settings. One data set is the gene expression profiling of seventy-two acute leukemia patients

Leukemia data

The raw microarray data contain much noise, so we perform data preprocessing before we construct the similarity matrix and do class discovery and class prediction. We first set a ceiling (16,000) and a floor (100) for the intensities and then filter those genes with

**Methods**

**AML vs ALLs**

**AMLs vs B cell ALLs vs T cell ALLs**

*k-means was run 1000 times and the accuracy was calculated based on running with the minimal objective function; ^, if affinity propagation generated more than predefined clusters, similar clusters would be merged to calculate the accuracy.

OTCC

98%

96%

98%

71%

Spectral clustering in jClust

97%

85%

Affinity propagation in jClust^

97%

94%

Hierarchical clustering

98%

76%

We first applied

Clustering accuracy of 1000

**Clustering accuracy of 1000** **-means runs on the AML and ALL data vs the corresponding objective functions.** The minimal sum of deviation from the class centers (the objective function of

To highlight the pattern underlying the AML and ALL samples, we construct a similarity matrix by first calculating the Pearson correlation coefficients of the gene expression profiles and then calculating the Pearson correlation coefficients of the similarity vectors of each sample. That is, the similarity vectors of each sample (the similarity relationships to other samples) are treated as new features. Then we apply our model (1)-(2) recursively to explore the groups underlying the samples. The result is shown as a rooted tree (Figure

The classes underlying the seventy-two AML and ALL samples in the leukemia data set revealed by OTCC with the class discovery setting.

**The classes underlying the seventy-two AML and ALL samples in the leukemia data set revealed by OTCC with the class discovery setting.** Samples 1, · · ·, 25 are AMLs. Samples 26, · · ·, 62 are B cell ALLs. Samples 63, · · ·, 72 are T cell ALLs.

Applying the spectral clustering to the same similarity matrix (implemented in jClust

We also evaluated the affinity propagation algorithm

Agglomerative hierarchical clustering is another popular tool for analyzing the subtle structure underlying the gene expression profiles of cancer samples. Applying agglomerative hierarchical clustering with Euclidean distance to the AMLs and ALLs dataset, it can identify AMLs from ALLs except sample 25. But it failed to discriminate B cell ALLs from T cell ALLs (accuracy: 31/47 = 66%). The T cell ALLs and a set of sixteen B cell ALLs form one cluster whereas other B cell ALLs form the other cluster. The failure of the agglomerative hierarchical clustering for discriminating T cell ALLs from B cell ALLs can be attributed to the fact that the bottom-up cluster merge strategy is a greedy one and cannot find global optimum.

Given the known labels of some samples, our model can also carry out the class prediction task. Using the same data set, we evaluate the performance of our model under different conditions in which a fraction of sample labels are known. Given the numbers of each type of samples whose labels are known, we randomly select the same numbers of samples as the prior knowledge and then apply our model to predict the labels of the remaining samples. Repeating one thousand times, we calculate the mean accuracy. The result is shown in Figure

Mean accuracy heatmap by applying our model to predict the labels of samples in the leukemia data set given labels of certain samples. Each condition was repeated one thousand times.

**Mean accuracy heatmap by applying our model to predict the labels of samples in the leukemia data set given labels of certain samples.** Each condition was repeated one thousand times.

Breast cancer data

The leukemia data set is assumed to be easy because there are many informative genes which indicate the underlying cluster structure. We repeat the evaluation on another breast cancer dataset to illustrate the advantages of our model on noisier data sets. Since the data set is generated by profiling the gene expressions of stromal and epithelial cells of five normal and twenty-eight breast cancer patients, the samples belong to four classes: normal stromal cells (ns), normal epithelial cells (ne), cancer stromal cells (cs), and cancer epithelial cells (ce)

The three major classes underlying the fifty-six breast cancer samples and ten normal samples.

**The three major classes underlying the fifty-six breast cancer samples and ten normal samples.**

Given some prior information about the labels of the samples, we applied our model to this data set in the class prediction setting. We obtained similar observations to the leukemia dataset (Figure

Mean accuracy heatmap by applying our model to predict the labels of samples in the breast cancer data set given labels of certain samples. Each condition was repeated one thousand times.

**Mean accuracy heatmap by applying our model to predict the labels of samples in the breast cancer data set given labels of certain samples.** Each condition was repeated one thousand times.

Property summary of OTCC compared to other methods

Gene expression profiling technologies, e.g. microarrays and deep sequencing, have become more and more important for clinical practices, such as diagnosis and prognosis. Class discovery and class prediction are two typical tasks to utilize gene expression profiling technologies to leverage the quality and efficiency of diagnosis and prognosis. In this study, we propose a novel optimization model and integrate two tasks in one framework by treating class discovery and class prediction as a process of labeling. By seeking an optimal labeling scheme that fits best to the gene expression profiling of samples, a convex quadratic programming model is established. It can be solved efficiently and the global optimum solution is guaranteed. It does not need manual intervention to set a cutoff and can detect outliers to improve the statistical signal in the data. It does not use directly the clinical measurement but rather uses a similarity matrix as its input. The biomarker identification process is thus separated from class discovery and class prediction, facilitating clinicians to integrate prior knowledge with the measurements. It can also be applied to multiple types of measurements, e.g. gene expression profiling, proteomic analysis, and next-generation sequencing. Because the similarity matrix is the only input, the output is sensitive to biomarker selection and similarity measures choices. Proper biomarkers and similarity measures will generate reasonable accuracy and greatly accelerate understanding of the nature of diseases. Numerical experiments on leukemia and breast cancer data sets suggest that it is very effective for revealing and predicting the subtle subtypes of cancers based on the gene expression data of patients.

Because the objective function of our model is a quadratic form of the Laplacian matrix, it is closely related to spectral clustering and semi-supervised learning methods. Spectral clustering can be generally solved by seeking the Fiedler vector of the Laplacian matrix

Conclusions

Class discovery and class prediction are two tasks linked to each other inherently in clinical research. Previous studies proposed methods for these two tasks separately. And thus ignored the linkage between these two tasks. In this study, we model class discovery and class prediction in one framework and facilitate the discovery and prediction of subtle subtypes of cancers. Because of its flexibility, our method can be applied to multiple types of measurements, e.g. gene expression profiling, proteomic analysis, and next-generation sequencing and allows the integration of extensive prior information.

Abbreviations

HC, hierarchical clustering; SOM, self-organizing maps; NMF, non-negative matrix factorization; OTCC, an Optimization Tool for Clustering and Classification; SMO, sequential minimal optimization algorithm; AML, acute myeloid leukemia; ALL, acute lymphoblastic leukemia.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

XR proposed the model. YW proposed the fast algorithm. XSZ checked the theoretical properties of the model and the algorithm. XR and JW completed the computation on the datasets. XR, YW, JW and XSZ wrote and approved the manuscript. All authors read and approved the final manuscript.

Acknowledgments

The authors thank the members of the ZHANGroup of Academy of Mathematics and Systems Science, Chinese Academy of Science for their valuable discussion and comments. This work is partially supported by the Grant No. 61171007 and No. 11131009 from the National Natural Science Foundation of China.