Abstract
Background
Although microscopic diagnosis has been playing the decisive role in cancer diagnostics, there have been cases in which it does not satisfy the clinical need. Differential diagnosis of malignant and benign thyroid tissues is one such case, and supplementary diagnosis such as that by gene expression profile is expected.
Results
With four thyroid tissue types, i.e., papillary carcinoma, follicular carcinoma, follicular adenoma, and normal thyroid, we performed gene expression profiling with adaptortagged competitive PCR, a highthroughput RTPCR technique. For differential diagnosis, we applied a novel multiclass predictor, introducing probabilistic outputs. Multiclass predictors were constructed using various combinations of binary classifiers. The learning set included 119 samples, and the predictors were evaluated by strict leaveoneout cross validation. Trials included classical combinations, i.e., onetoone, onetotherest, but the predictor using more combination exhibited the better prediction accuracy. This characteristic was consistent with other gene expression data sets. The performance of the selected predictor was then tested with an independent set consisting of 49 samples. The resulting test prediction accuracy was 85.7%.
Conclusion
Molecular diagnosis of thyroid tissues is feasible by gene expression profiling, and the current level is promising towards the automatic diagnostic tool to complement the present medical procedures. A multiclass predictor with an exhaustive combination of binary classifiers could achieve a higher prediction accuracy than those with classical combinations and other predictors such as multiclass SVM. The probabilistic outputs of the predictor offer more detailed information for each sample, which enables visualization of each sample in lowdimensional classification spaces. These new concepts should help to improve the multiclass classification including that of cancer tissues.
Background
Histopathological analysis is the traditional mainstay of cancer diagnostics, and is relied on heavily to discriminate between malignant and benign tissues. However, in spite of its long and successful history as a routine medical technique, it is plagued by a number of wellknown problems. For example, histological diagnosis often depends upon judgment calls by individual pathologists, leading to variations in diagnosis [1,2]. In addition, it may be difficult in some cases to differentiate between malignant and benign tissues. In such cases, microscopic observations have often been complemented by staining for molecular markers, such as expressed genes. This led directly to today's emerging approach, gene expression profiling, by which the expression levels of thousands of genes throughout the genome are measured by DNA microarrays or alternative techniques. Using this technique, diagnostic systems using multiple genes have been constructed [3]. Clinical status parameters such as prognosis or drug resistance are the popular targets, and the supervised learning theory is often applied.
In most examples of cancer classification, binary classifiers have been used, with multiclass predictors only rarely applied. However, because more than three tissue types are often differentiated during pathological diagnosis, a stable or optimized multiclass predictor is needed.
In this study, we developed a new multiclass predictor based on a probabilistic model. We examined various combinations of binary classifiers to perform optimized multiclass prediction, and applied our system to gene expression data from four tissue types of human thyroid origin. Thyroid cancer is a relatively common cancer, accounting for roughly 1% of total cancer incidence. There are two main types of thyroid cancer, papillary carcinoma (PC) and follicular carcinoma (FC). In addition to these malignant types, a benign tumor, follicular adenoma (FA), is also prevalent. The intriguing problem in thyroid cancer therapy is the preoperative differential diagnosis of follicular carcinoma vs. adenoma. The main diagnostic procedure is fine needle aspiration, but because the tissue structure is disrupted during the sampling process, the differential diagnosis is extremely difficult [4]. Papillary carcinoma is relatively easier to distinguish, but good molecular diagnostics would still be of potential clinical value. Thus, thyroid cancer represents a good example of a tumor type for which conventional histopathologic diagnosis faces limitations.
We performed gene expression profiling of the four thyroid tissues by adaptortagged competitive polymerase chain reaction (ATACPCR) [5,6], a highthroughput RTPCR technique. We first analyzed the global differences in gene expression among the four thyroid tissues, and confirmed that the data matrix contained information for class separation. Then, we applied our new multiclass predictor for differential diagnosis of the four tissue types. The multiclass predictor outperformed previous methods used in the field.
Results
Difference in gene expression among the four thyroid tissues
We first examined whether there were significant differences in gene expression between the four thyroid tissue types by performing spectrum analysis based on correlation ratio (CR) [7]. Correlation ratio is an indicator for correlation of gene expression with two classes. High CR values indicate strong gene expression differences. Genes were first sorted by CR value order, and then the CRs of the original total data set were compared with those of permuted data. Figure 1a shows the results of comparison between FA and normal thyroid tissue (N). The CRs of the original data set are consistently higher than those of the permuted data. Because those of the original data were consistently higher throughout the full range of CRs, the correlation was not restricted to a small number of genes, but was a global character. With five other comparisons, we achieved similar results (data not shown), which suggested that there were significant differences in gene expression among the four thyroid tissues. The maximal (i.e. the most pessimistic) permutation pvalues of the top 250 genes are 0.0064 for FA vs. FC, 0.0046 for FA vs. N, 0.0001 FC vs. N, and <10^{7 }for the other four comparisons.
Figure 1. Correlation of global gene expression profiles with differences in thyroid tissue type. a) correlation ratio between follicular adenoma (FA) and normal thyroid (N). The red line and blue lines represent the correlation ratio of the original data and those of the permuted data, respectively. Blue lines are the results of twelve trials of permutation. b) correlation ratios of various combination of the four thyroid tissues.
We then compared the CR spectra of all pairwise comparisons. The CR of each pair was calculated; genes were sorted by CR value order, and plotted (Figure 1b). Figure 1b shows the CR values of the top 250 genes: this figure represents the degree of gene expression difference of pairwise comparison, not indicating significance of gene expression difference (see below). Throughout this region, FA vs. N and FA vs. FC exhibited the smallest and the second smallest CR values. This result may reflect the difficulty in diagnosis: FA and N maintain microscopic follicular structures; FA and FC are the most difficult tissues to be differentially diagnosed. The other four comparisons, i.e., FC vs. N, FC vs. PC, FA vs. PC, N vs. PC, exhibited no substantial difference in CR in the high CR region (top 50 genes). In the middle CR region (the top 50–250 genes), the CRs of N vs. PC are close to those of FA vs. FC.
The CR spectra directly represent global differences in gene expression. We also evaluated these results by conventional measures, using pvalues and qvalues [8] of the standard ttest. When the cutoff pvalue (significance level) is set at 0.0001, the numbers of genes and corresponding qvalues are as follows: FA vs. N, 8 genes (qvalue, 0.018); FA vs. FC, 7 (0.017); N vs. PC, 44 (0.0038); FC vs. N, 25 (0.0048); PC vs. FA, 66 (0.002); PC vs. FC, 41 (0.0026). These results parallel those from the CR spectrum analysis. Comparisons such as FA vs. N and FA vs. FC exhibited small numbers of significant genes and high qvalues, and those such as PC vs. FA, and PC vs. FC exhibited large numbers of significant genes and low qvalues. FC vs. N and N vs. PC are the middle of the above two.
Differential diagnosis by binary classifier
Before performing multiclass prediction, we examined the performance of binary classifiers in the differential diagnosis of each pair of the four thyroid tissues. We constructed each diagnostic system based on the weightedvoting algorithm [3], which has excellent predictive ability for gene expression data. As shown in Figure 2, classifiers for PC and one of the other tissues are stable, and exhibited good classification. In contrast, those for FA vs. FC and FA vs. N were not optimal, exhibiting unstable accuracy curves. Thus, the results using binary classifiers paralleled those of statistical analyses of individual genes by CR and false discovery ratio (FDR).
Figure 2. Accuracy curves of binary classifiers differentiating two of the four thyroid tissues. Vertical axis, accuracy; horizontal axis, selection criteria of diagnostic genes as pvalue (tstatistics).
Strategy for construction of a multiclass predictor
In this study, we constructed a multiclass prediction system based on a combination of binary classifiers. Each unit binary classifier was constructed by the weightedvoting algorithm, where the diagnostic genes were selected by the criterion p ≤ 0.0001. In this criterion, all of the corresponding qvalues were less than 0.05, maintaining the least inclusion of false positive genes. It should be noted that application of the following strategy is not limited to weightedvoting. We examined the following four types of construction methods using different numbers of binary classifiers.
1R: constructed using classifiers discriminating between one class and all of the other classes. 1R is conventionally referred to as "onetotherest" [9,10].
11: constructed using classifiers discriminating a pair of classes. 11 is conventionally referred to as "onetoone" [1113].
1A: constructed using classifiers discriminating between one class and any subset of other classes.
AA: constructed using classifiers discriminating between any pair of disjoint subsets of classes.
For this study of the four thyroid tissues, the numbers of classifiers were four (1R), six (11), twentytwo (1A), and twentyfive (AA). 1A includes all of the binary classifiers of 1R and 11, and AA includes all of the binary classifiers of 1R, 11 and 1A. The combinations in this study are shown in Figure 3. 1R and 11 have been used in previous studies [913], but 1A and AA are novel.
Figure 3. Combinations of unit binary classifiers used for construction of a multiclass predictor in this study. [lm] represents a unit binary classifier separating class(es) l and class(es) m.
The conventional construction rule for multiclass prediction from outputs of each binary classifier was designed for 11 and 1R. Binary output functions are usually used (see the materials and methods section), and prediction is performed by simple summation of the outputs. It should be noted that outputs are null when a group of classes, i.e., ''rest'' of ''onetotherest'', wins. This rule, however, is not applicable to AA, because AA includes classification between groups of classes.
To incorporate general binary classifiers, we invented a method to integrate probabilistic outputs from the classifiers (Figure 4a). First, for each pair of class subsets, l or m (for example l = {c_{1}} and m = {c_{2},c_{3}}), the corresponding binary classifier was assumed to produce the probabilistic outputs q_{[lm] }and 1  q_{[lm]}. When weighted voting is used as a classifier, the probabilistic output is calculated by onedimensional logistic regression from the prediction strength of weighted voting. Thus, the probabilistic output (termed "class probability") of each of the multiple classes was calculated by integrating the set of probabilistic outputs calculated above. Finally, we obtained a single discrete class prediction as the class, which takes the maximum class probability. Using a simple integration method called simple summation (SIS), we calculated the sum of the probabilistic output q_{[lm] }to obtain the probability of each class, when the selected class subset l included the class. SIS is a probabilistic version of conventional voting [13,14]. In addition, we propose a method called shared summation (SHS), by which the probabilistic output of a multiclass group is shared equally by each class (Figure 4b). This method is based on a probabilistic decision process using probabilistic outputs from binary classifiers.
Figure 4. a) The probabilistic decision process of the multiclass predictor, for an example classification problem of four classes: 1, 2, 3 and 4. Firstly, one of the binary classifiers in a set B is selected with uniform probability 1#B, where #B is the number of binary targets in B. Secondly, class subset 1 or 2,3 is selected with probability q_{[123]}(x) or 1  q_{[123]}(x), respectively. Thirdly, class 2 or 3 is selected with a probability of 1/2. Accordingly, one of the classes is selected with a certain probability. b) Calculation of class probability by SHS. For a binary classifier [lm] in B and an input x which is a member of classes in l or m, we define q_{[lm]}(x) as an estimated probability where x is a member of class(es) in l, and the complement probability 1  q_{[lm]}(x) where x is a member of class(es) in m. For example, q_{[123]}(x) and 1  q_{[123]}(x) indicate the probability that x belongs to the class 1, and that x belongs to the class 2 or 3 provided that x belongs to the class 1, 2 or 3. In the SHS procedure, the probabilistic outputs by the multiple classifiers are shared and integrated by multiple classes, leading to the estimated class membership probabilities: p_{1},p_{2},p_{3 }and p_{4}. When l and/or m are set of multiple classes, the corresponding probabilistic outputs are shared equally to each of the members. For example, q_{[123]}(x) is added to p_{1}, 1  q_{[123]}(x) is shared equally and added to p_{2 }and p_{3}, q_{[1234]}(x) is added to p_{1}, 1  q_{[1234]}(x) is shared equally and added to p_{2}, p_{3 }and p_{4}, and so on for all members of B. Consequently, we obtain an estimation of multiple class probabilities p_{1},p_{2}, p_{3 }and p_{4 }by normalizing them so that the summation p_{1}, p_{2}, p_{3 }and p_{4 }would be one.
From four patterns of classifier combination (1R, 11, 1A, AA) and two output summation methods (SIS and SHS), we generated seven types of multiclass predictors (AA with SIS is not applicable).
Classification of thyroid tissues
Using 119 samples from the learning set, we evaluated the seven multiclass predictors described above by leaveoneout crossvalidation (LOO). For 119 iterations, one sample was left out, unit binary classifiers of an appropriate combination were constructed using data from the remaining 118 samples, each binary classifier was run, and the binary outputs were summated and classified. As shown in Table 1, prediction accuracies of 1A and AA were higher than those of conventional methods, i.e., 11 and 1R. The prediction accuracy was greatest (79.8%) for 1ASIS and 1ASHS, and the next for AASHS (79.0%).
Table 1. Prediction accuracies (%) of the learning sample set, evaluated by leaveoneout crossvalidation.
The details of the predictor 1ASHS were analyzed by confusion matrix (Figure 5a). There were some misclassifications with FC as FA and FA as N. In contrast, there was very little misclassification of PC.
Figure 5. a) Confusion matrix of the results of the learning set (1ASHS, evaluated by leaveoneout crossvalidation). Each cell shows the number of samples along with true and predicted labels. True and predicted class labels are aligned vertically, and horizontally. b) Confusion matrix of the results of the test set.
Our multiclass predictor was validated with an independent test set consisting of 49 samples. A multiclass predictor based on 1ASIS, 1ASHS, and AASHS was constructed using gene expression data from the learning set (119 samples), and was applied to the independent test set. The overall prediction accuracy was 85.7% with all of the three predictors. The confusion matrix of 1ASHS is shown in Figure 5b.
Comparison with other methods
We compared our method with other multiclass predictors popular in the field of cancer classification using the same learning data set (119 samples) as well as the other three data sets. The additional data sets are as follows. All data sets and algorithms were evaluated by LOO. With any algorithms, limited numbers of trials in classifier conditions (e.g., number of selected genes) were performed to avoid excessive overlearning.
The esophageal cancer data set
This data set is also composed of original gene expression profiles obtained from esophageal cancer of Japanese patients by ATACPCR [15]. The task here is differential diagnosis of three histological types: poorly differentiated, 14; moderately differentiated, 97; well differentiated, 30.
The GCM data set
The original global cancer map data set consists of 14 cancer types [14]. Because the number of combinations of binary classifiers increases exponentially as the number of classes increases, we selected five cancer types: breast, 11 samples; prostate, 10; leukemia, 30; renal, 11; mesothelioma, 11.
The SRBCT data set
Gene expression profiles about small round blue cell tumors (SRBCTs) of childhood, which contain 83 samples and 2308 genes measured by cDNA microarrays [16]. SRBCTs, which include the Ewing family of tumors (EWS), rhabdomyosarcoma (RMS), Burkitt lymphoma (BL), and neuroblastoma (NB) [17]. The composition of the samples are 29 (EWS), 25 (RMS), 11 (BL), and 18 (NB).
Firstly, we tested our predictors based on the weightedvoting algorithm with various diagnostic gene sets. We selected diagnostic genes with seven threshold pvalues, i.e., 10^{6}, 10^{5}, 10^{4}, 10^{3}, 10^{2}, 10^{1}, 1. Due to unstable results with a small number of genes, we used at least 10 genes in cases where the number of selected genes was less than 10. In Table 2, we presented only best accuracies with respective threshold pvalues. With the thyroid cancer data set, 1ASHS/SIS exhibited best results. 11SHS exhibited the accuracy as high as those by 1ASHS/SIS, but threshold pvalue was their 10fold. With the GCM set, AASHS exhibited the highest accuracy. With the esophageal cancer data set, the highest accuracy was observed with 1A/AASHS. As a whole, these results suggest 1A and AA are superior to conventional combinations of the classifiers, i.e., 11 and 1R.
Table 2. Comparison of prediction accuracies of various algorithms. Each figure represents the best accuracy obtained by the geneselection condition shown as the value in the parenthesis. The greatest accuracy of each data set is shown as bold letters. Values in parentheses are numbers of diagnostic genes selected by recursive feature elimination for MCSVM, shrinkage parameters for SC, and threshold pvalue for others.
Our predictors were compared with other multiclass prediction methods reported in the field. One is the combination of support vector machines (multiclass SVM, MCSVM) described by Ramaswamy et al. [14], and the other is the shrunken centroid method (SC) [18]. The first method corresponds to 1RSIS without probabilistic outputs by our definition, and linearkernel SVM is used as a unit binary classifier instead of a weightedvoting algorithm. To select diagnostic genes, we performed recursive feature elimination [19]: prediction accuracies were determined with various numbers of genes, i.e., all genes, 2000, 1024, 512, 256, 128, 64, 32, 16, 8, 4, 2, 1. The greatest accuracies with respective gene number are shown in Table 2: the accuracies were less than those obtained by our method.
The shrunken centroid is an improved method of the simple nearest prototype (centroid) classifier. We evaluated this method by LOO with various shrinkage parameters from 0 to 6 with 0.5 as an interval, i.e., a total of 13 parameters. With the all four data sets, our method yielded results superior to the shrunken centroid.
Tissue sample display by class probability
The outputs of our multiclass predictors include not only the label, but also the class probability of each class. This information can be used to differentiate samples under the same label, and further characterize differences among them. In Figure 6, the learning set samples are displayed in the threedimensional space constructed from the class probabilities of FA, FC, and N, and indeed all four tissue types could be resolved into separate locations. It should be noted that several samples are located apart from their respective clusters, partly due to an error in labeling in the histopathological diagnosis. Otherwise, these could represent samples with molecular features distinct from the other members of their assigned clusters.
Figure 6. Visualization of the learning samples by class probabilities.
Discussion
In spite of their integral role in cancer diagnosis, many histopathological diagnostic protocols are still dependent on judgment calls made by individual pathologists, and as a result are significantly prone to error. More objective measures are needed, such as computational interpretation of microscopic images or complementary diagnosis by gene expression. For diagnosis by gene expression profile, conventional binary classifiers are not sufficient because the number of tissue types to be differentially diagnosed is usually greater than two. In this study, we have described the differential diagnosis of four thyroid tissues by use of a novel multiclass predictor with a test prediction accuracy of as high as 79~85%. As a result, our diagnostic system represents a promising first step towards an automated pathological tool to complement current diagnostic procedures.
Our approach to multiclass prediction is a novel one characterized by increasing combination of binary classifiers and a probabilistic approach for outputs. Traditionally, "onetoone" or "onetotherest" approaches have been applied to integrate combinations of unit binary classifiers. As shown above, 1A and AA exhibited prediction accuracy better than conventional approaches. Our results suggest that conventional methods such as "onetoone" and "onetotherest" are not necessarily optimal, recommending use of more combinations of binary classifiers. The main disadvantage of AA and 1A is that the number of classifiers increases exponentially as the number of classes increases. However, this is not a significant disadvantage when the method is applied to cancer classification problems, because the number of classes for separation is nearly always less than seven.
Another characteristic of our predictor is the introduction of a probabilistic model to the outputs. Probabilistic output of the multiclass predictor, here designated as class probability, offers detailed information about each sample, which could not be obtained by binary outputs. In particular, it is useful to visualize the relationships between the classes using a two or threedimensional plot of class probabilities. This visualization may in fact be more effective than unsupervised approaches such as principal component analysis and multidimensional scaling [20], because the class probabilities were calculated from weighted sums of selected diagnostic genes. In particular, it is useful for gaining a deeper understanding of individual samples. For example, among the FC samples, some appear rather similar to FA, and a few seem separate from all of the classes. Detailed analysis of class probability may lead to discoveries of new tumor classes. It is also possible that some of the samples may have been initially misdiagnosed by the pathologist. Reexamination of these tissue sections may lead to more accurate diagnosis.
One may argue that our assumption that the choice of unit binary classifier is an independent process (see Materials and Methods) is too simplified. For example, because one class appears in different classifiers in different combinations, these classifiers may not be independent. A simple solution for this problem is introduction of prior probability in the choice of unit binary classifier. For example, our predictor is only modestly effective at separating FC from FA, but this shortcoming could be improved by the introduction of appropriate prior probabilities.
Another possible critique is our equal sharing of outputs of unit binary classifiers. The optimal ratio of output sharing can be obtained by taking account of correlations between similar targets, such as [c_{1}c_{2}] and [c_{1},c_{2}c_{3},c_{4}]. Our preliminary results using a method similar to that previously used by Hastie et al. in the pairwise case [12] suggested that the prediction accuracy was equal to that of SHS. We suspect that SHS is a good estimate for the sharing method deduced by this more detailed model. In any case, the optimum model of outputs should not decrease accuracy as increase in number of constituent binary classifiers. However, with the thyroid data set, the accuracy of 1A was higher than that of AA, suggesting that the current probabilistic model is incomplete. Our complete version of the output model, requiring a larger computer resource, will be described elsewhere (Yukinawa et al., submitted).
Multiclass classification techniques can be roughly divided into two types. One is the decomposition of multiclass problems into binary ones. ''onetotherest'' methods [9,10], pairwise comparisons, i.e., ''onetoone'' methods [1113], errorcorrecting output coding [2123] belong to this type. There have been comprehensive studies centered on SVM [24,25], but there have been no definitive methods in this type. Recent developments also include logitBoost [26] or genetic algorithm [27] as unit classifiers. The other type is binary classification algorithms that can be naturally extended to handle multiclass problems directly. SVM with multiclass objective functions [28,29], discriminant analysis [30] and regression and decision trees are of this type. Recent developments include shrunken centroid [18], an advanced version of nearest centroid [31], and total principal component regression [32]. Our method is an extension of one of the former type approaches. For comparison with previous methods, we chose a ''onetotherest'' method using a support vector machine from the first type [14], and the shrunken centroid method [18] from the latter type. With the four data sets, including two in the public domain, we established that our method was superior to these two methods.
Ramaswamy et al. [14] pointed out that the weightedvoting was generally not wellperformed as MCSVM. This may be because they selected genes in the order of signaltonoise ratio, and used the gene number as the threshold. In contrast, we used the pvalue as the threshold: the pvalue threshold would be essential to obtain multiclass predictor with the weighedvoting algorithm, because using the number of genes instead may incorporate unnecessary genes or ignore necessary genes in some classifiers.
The shrunken centroid [18] is a simple modification of the nearest centroid method. The prediction accuracy obtained was higher than that of the combination of SVM, but still fell significantly short of our algorithm. Moreover, because the best results were obtained by comparison of different shrinkage parameters, the predictor may be overfitted. The discriminant hyperplanes resulting from the shrunken centroid were rather simple, and a more complex classification such as that for thyroid tissue may require more complicated discriminant surfaces, such those obtained by our method. One of the advantages of the shrunken centroid method is that each sample can be described by class probabilities calculated from discriminant scores. This feature is similar to our class probability, and enables visualization of samples as in our method.
The main shortcoming common to previous multiclass studies of cancer classification is the use of sample data matrices, which are easily diagnosed by conventional techniques. The solid tumors or small round blue cell tumors used have been of different tissues of origin, and could be differentiated easily by microscopic observations as well as by gene expression analysis. Because solving such problems is of little clinical value, such data matrices are not adequate for the benchmark test of predictors. Use of data matrices which model realistic problems is of pivotal importance for evaluation of techniques.
Conclusion
Among tasks in histopathology, differential diagnosis of thyroid tissues is a difficult one. We have developed a new multiclass predictor based on a probabilistic model using gene expression data. Tests on four types of thyroid tissues revealed its excellent performance in prediction. The novel approaches introduced in this study show promise as a means to differentiate between similar tumor types from the same tissue of origin. However, before we can draw a conclusion on their efficacy, a number of confirmatory experiments, including analysis of artificial data sets, are necessary. Nevertheless, we believe these algorithms will contribute to significant advances in the pathological diagnosis of cancer in the near future.
Methods
Patients and tumor samples
The learning set consisted of 41 FA, 20 FC, 30 PC and 28 N samples, and the test set consisted of 17 FA, 8 FC, 12 PC and 12 N samples. For the test set, we simply chose samples with the most recent surgery dates. The tumor samples were obtained from patients who underwent surgery (hemithyroidectomy or total thyroidectomy) during the period from February 1998 to December 2002. Histological diagnosis was performed by an experienced pathologist. Tumor samples obtained during surgery were snap frozen in liquid nitrogen and stored at 80°C until use. The study protocol was approved by the Institutional Review Board of Osaka University Medical School, and written informed consent was obtained from each patient.
ATACPCR assay and data processing
To select genes for ATACPCR, we first constructed four cDNA libraries; one from a mixture of five normal thyroid tissues, one from a mixture of five PC samples, one from a mixture of five FA samples and one from a mixture of four FC samples as described [33]. We then performed single pass sequencing of 6154 clones (1424 from follicular carcinoma, 1592 from papillary carcinoma, 1575 from follicular adenoma, and 1563 from normal thyroid). A total of 2383 genes were selected from this EST collection, prioritizing abundant genes. Then, 133 genes were added by experts in thyroid cancer, based on literature information. We designed PCR primers for ATACPCR reactions for these 2516 genes. The specificity of this gene selection provides an advantage over more universal gene sets, such as those selected from the UniGene database, which include genes not expressed in thyroid tissue. Total RNA was purified from thyroid tumors using Trizol reagent (GibcoBRL), and was subjected to geneexpression profiling by ATACPCR. The ATACPCR experimental procedure was performed as previously described [34]. The measurements of 2516 genes were repeated twice with ten tumor samples, and correlation coefficient of these duplicates was calculated. The correlation coefficient and corresponding pvalue was 0.92 and 0.000163, showing the sufficient reproducibility of the measurement. The raw data describing gene expression levels were divided by the median of each sample. Because the median of each sample is expected to reflect the overall mRNA level, normalization by this value corrects for variation in mRNA level from sample to sample. Values less than 0.05 and more than 20 were converted to 0.05 and 20, respectively, and subsequently the entire data matrix was converted to a logarithmic scale. The detailed protocols for the ATACPCR experimental procedure are available on our web site [35].
For the analysis, missing values were estimated and filled in by a method based on Bayesian inference [36] (see the additional files 1 and 2). In the ATACPCR reaction, the fluorescence intensity of the products is correlated with the data quality. Genes were sorted by the order of the average of the fluorescence intensity, and we selected the top 2000 genes for the following analysis.
Additional File 1. contains all gene expression data.
Format: ZIP Size: 1.8MB Download file
Additional File 2. contains explanation of the data file 1.
Format: PDF Size: 27KB Download file
This file can be viewed with: Adobe Acrobat Reader
Statistical analysis
The correlation ratio R of gene i is defined by the following equation.
where n_{c }is the number of samples in a particular class c ∈ C; x_{ij }is the expression level of gene i in sample j; _{ci }is the average expression level of gene i for samples in a particular class c ∈ C; and _{i }is the average expression level of gene i for all samples. Permutation pvalue of a correlation ratio is defined by the proportion of trials where R of randomly permutated data exceeded that of original data, among 10,000 permutation trials.
pvalues were calculated using tstatistics. qvalues were calculated using software by Dr Storey [37].
Unit binary classifiers
The data set used in this study is defined as X = {x^{(n)},t^{(n)}}_{n = 1:N}; here, x^{(n) }∈ ℜ^{D }and t^{(n) }∈ C ≡ {c_{1},...,c_{M}} denote gene expression pattern and class label, respectively, of the nth sample; D is the number of genes.
C is the set of whole classes, and 2^{c }is the labels' power set, as follows.
2^{c }≡ {{c_{1}}, {c_{2}},..., {c_{1}, c_{2}},..., {c_{1}, c_{2}, c_{3}},..., C, φ}.
We designated a pair of subsets to be classified as a "target". Consider a binary classifier trained on a binary target [lm] ∈ B where indices l, m ∈ 2^{C } {C, φ} denote two disjoint (l ∩ m = φ) subsets of class labels and B is a certain set of targets. In this study, we used some selections of B such as B_{1R}, B_{11}, B_{1A }and B_{AA }(Figure 3).
We employed the weightedvoting algorithm for the binary classification of each target. [lm]. The discriminant function f_{[lm] }(x) ∈ ℜ was defined using selected diagnostic genes with the criterion p ≤ 0.0001 for each genewise ttest.
Output function of the binary classifier
When a discriminant function f_{[lm] }(x) is determined from a learning data set X_{[lm] }= {(x^{(n)}, t^{(n)}) ∈ Xt^{(n) }∈ l ∪ m}, the binary output function of the classifier is described as follows.
In this study, we took a probabilistic approach, where q_{[lm]}(·) may be a real value of 0 ≤ q_{[lm]}(·) ≤ 1 which corresponds to a posterior probability, Pr(t ∈ lx, t ∈ l ∪ m, X_{[lm]}). The following logistic function r(x; a_{[lm]}, b_{[lm]}) can represent a mapping from the discriminant function f_{[lm]}(x) to the probability. The mapping is based on onedimensional logistic regression,
where a_{[lm] }and b_{[lm] }are determined by maximumlikelihood estimation so that r(x^{(n)}; a_{[lm]}, b_{[lm]}) for all learning data x^{(n) }∈ X[_{[lm] }best fits the corresponding labels t^{(n)}. The probabilistic output for the gene expression vector of a test sample x^{NEW},q_{[lm]}(x^{NEW}), is determined as follows.
q_{[lm]}(x^{NEW}) = r(x^{NEW}; a_{[lm]}, b_{[lm]}).
Multiclass prediction scheme
For each expression vector x, we considered a probabilistic process of selecting a single class c ∈ C as a prediction of its label.
1. Random selection of a binary target [lm] in a set B, and calculation of q_{[lm]}(x).
2. Stochastic selection of a class subset l or m, with probability q_{[lm]}(x) for l and 1  q_{[lm]}(x) for m.
3. Random selection of a single class prediction c in the above selected subset l or m. According to this process, the probability of selecting a class c ∈ C, referred to as "class probability", becomes
Here I(A) is equal to 1 if condition A holds, and zero otherwise; #l and #m denote the numbers of classes in the class subset l and m; Z is a normalization constant which guarantees . The above equation takes the summation of q_{[lm]}(x) for all targets [lm] where subset l or m contains class c, and where probability scores are shared equally to member classes of the subsets. Therefore, this is called shared summation (SHS) of probabilistic output. Subsequently, we can obtain a Bayesoptimal class prediction as .
In simple summation (SIS), we determined class probability p_{c}(x) for each class c ∈ C as a summation of q_{[lm]}(x) for all targets [lm] such that l or m is {c} itself,
where Z is a normalization constant. Because of the condition l = {c} or m = {c}, classifiers whose output has more than one label were not included in the above summation. As a result, AA and 1A are equivalent in the SIS case. Although SIS is a straightforward extension of conventional voting criterion to handle probabilistic guesses, it lacks a consistent model of decision process, such as that presents in the background of SHS.
Matlab programs to perform 1A/AASIS/SHS using the weightedvoting algorithm is available from NY.
Authors' contributions
NY and SO implemented their algorithm, and NY carried out application experiments to gene expression data sets. SO and SI developed the mathematical formulation of the multiclass classifiers proposed in this paper. KK initiated and supervised the whole project. KT and KIK did experimental parts of the thyroid cancer study. YT and SN collected thyroid tumor and normal tissues, and are responsible for clinical aspects of the study. All authors read and approved the final manuscript.
Acknowledgements
We thank Ms Noriko Ueno, Satoko MakiKinjo, Keiko MiyaokaIkegami and Mihoko Yoshino for their technical assistance. This work was supported by the Knowledge Cluster Initiative (the Keihanna Science City area) of the Ministry of Education, Culture, Sports, Science and Technology of Japan.
References

Fassina AS, Montesco MC, Ninfo V, Denti P, Masarotto G: Histological evaluation of thyroid carcinomas: reproducibility of the "WHO" classification.
Tumori 1993, 79:314320. PubMed Abstract

Saxen E, Franssila K, Bjarnason O, Normann T, Ringertz N: Observer variation in histologic classification of thyroid cancer.

Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.
Science 1999, 286:531537. PubMed Abstract  Publisher Full Text

Baloch ZW, Fleisher S, LiVolsi VA, Gupta PK: Diagnosis of "follicular neoplasm": a gray zone in thyroid fineneedle aspiration cytology.
Diagn Cytopathol 2002, 26:4144. PubMed Abstract  Publisher Full Text

Kato K: Adaptortagged competitive PCR: a novel method for measuring relative gene expression.
Nucleic Acids Res 1997, 25:46944696. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

KitaMatsuo H, Yukinawa N, Matoba R, Oba S, Saito S, Ishii S, Kato K: Adaptortagged competitive PCR: Amplification bias and quantified gene expression levels.
Anal Biochem 2005, 339:1528. PubMed Abstract  Publisher Full Text

Muro S, Takemasa I, Oba S, Matoba R, Ueno N, Maruyama C, Yamashita R, Sekimoto M, Yamamoto H, Nakamori S, et al.: Identification of expressed genes linked to malignancy of human colorectal carcinoma by parametric clustering of quantitative expression data.
Genome Biol 2003, 4:R21. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Storey JD: A direct approach to false discovery rates.
Journal of the Royal Statistical Society, Series B 2002, 64:479498. Publisher Full Text

Schölkopf B, Smola AJ: Learning With Kernels: Support Vector Machines, Regularization, Optimization and Beyond (Adaptive Computation and Machine Learning Series). MIT Press, Cambridge, MA; 2002.

Bottou L, Cortes C, Denker JS, Drucker H, Guyon I, Jackel LD, Le Cun Y, Muller UA, Säckinger E, Simard P, Vapnik VN: Comparison of Classifier Methods: A Case Study in Handwritten Digit Recognition.
Proceedings of the 13th International Conference on Pattern Recognition 1994.

Kreeel UH: Pairwise classification and support vectormachines.
Advances in Kernel Methods – Support Vector Learning 1999, 255268.

Hastie T, Tibshirani R: Classification by Pairwise Coupling.
Advances in Neural Information Processing Systems 1998, 10:507513.

Friedman J: Another approach to polychotomous classification. In Technical report. Department of Statistics, Stanford Palo Alto, CA; 1996.

Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, et al.: Multiclass cancer diagnosis using tumor gene expression signatures.
Proc Natl Acad Sci USA 2001, 98:1514915154. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kato K, Yamashita R, Matoba R, Monden M, Noguchi S, Takagi T, Nakai K: Cancer Gene Expression Database (CGED): a database for gene expression profiling and accompanying clinical information of human cancer tissues.
Nucleic Acids Res 2005, 33:D533D536. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

The SRBST data set [http://research.nhgri.nih.gov/microarray/Supplement/] webcite

Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westemann F, Berthold F, Schwab M, Antonescu CR, Oetersib C, Meltzer PS: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks.
Nature Medicine 2001, 7:673679. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression.
Proc Natl Acad Sci USA 2002, 99:65676572. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Guyon I, Weston J, Barnhill S, Vapnik V: Gene Selection for cancer classification using support vector machines.
Machine Learning 2002, 46:389422. Publisher Full Text

Kruskal JB: Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypothesis.
Psychometrika 1964, 29:127. Publisher Full Text

Dietterich TG, Bakiri G: Errorcorrecting output codes: A general method for improving multiclass inductive learning programs.
Proceedings of the Ninth National Conference on Artificial Intelligence (AAAI91) 1991, 572577.

Dietterich TG, Bakiri G: Solving Multiclass Learning Problems via ErrorCorrecting Output Codes.

Allwein EL, Schapire RE, Singer Y: Reducing multiclass to binary: a unifying approach for margin classifiers.
J Machine Learning Res 2001, 1:113141. Publisher Full Text

Li T, Zhang C, Ogihara M: A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression.
Bioinformatics 2004, 20:24292437. PubMed Abstract  Publisher Full Text

Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis.
Bioinformatics 2005, 21:631643. PubMed Abstract  Publisher Full Text

Dettling M, Buhlmann P: Boosting for tumor classification with gene expression data.
Bioinformatics 2003, 19:10611069. PubMed Abstract  Publisher Full Text

Liu JJ, Cutler G, Li W, Pan Z, Peng S, Hoey T, Chen L, Ling XB: Multiclass cancer classification and biomarker discovery using GAbased algorithms.
Bioinformatics 2005, 21:26912697. PubMed Abstract  Publisher Full Text

Weston J, Watkins C: Multiclass support vector machines. In Technical Report. Department of Computer Science Holloway, University of London, Egham, UK; 1998.

Lee Y, Lee CK: Classification of multiple cancer types by multicategory support vector machines using gene expression data.
Bioinformatics 2003, 19:11321139. PubMed Abstract  Publisher Full Text

Hastie T, Tibshirani R, Friedman J:
The Elements of Statistical Learning: Data Mining, Inference, Prediction, Springer. 2001.

Dabney AR: Classification of microarrays to nearest centroids.
Bioinformatics 2005, 21:41484154. PubMed Abstract  Publisher Full Text

Tan Y, Shi L, Tong W, Wang C: Multiclass cancer classification by total principal component regression (TPCR) using microarray gene expression data.
Nucleic Acids Res 2005, 33:5665. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Matoba R, Kato K, Saito S, Kurooka C, Maruyama C, Sakakibara Y, Matsubara K: Gene expression in mouse cerebellum during its development.
Gene 2000, 241:125131. PubMed Abstract  Publisher Full Text

IwaoKoizumi K, Matoba R, Ueno N, Kim SJ, Ando A, Miyoshi Y, Maeda E, Noguchi S, Kato K: Prediction of docetaxel response in human breast cancer by gene expression profiling.
J Clin Oncol 2005, 23:422431. PubMed Abstract  Publisher Full Text

Authors' web site [http://genome.mc.pref.osaka.jp] webcite

Oba S, Sato MA, Takemasa I, Monden M, Matsubara K, Ishii S: A Bayesian missing value estimation method for gene expression profile data.
Bioinformatics 2003, 19:20882096. PubMed Abstract  Publisher Full Text

q value [http://faculty.washington.edu/~jstorey/qvalue/] webcite