Abstract
Motivation
Complex diseases induce perturbations to interaction and regulation networks in living systems, resulting in dynamic equilibrium states that differ for different diseases and also normal states. Thus identifying gene expression patterns corresponding to different equilibrium states is of great benefit to the diagnosis and treatment of complex diseases. However, it remains a major challenge to deal with the high dimensionality and small size of available complex disease gene expression datasets currently used for discovering gene expression patterns.
Results
Here we present a phaseonly correlation (POC) based classification method for recognizing the type of complex diseases. First, a virtual sample template is constructed for each subclass by averaging all samples of each subclass in a training dataset. Then the label of a test sample is determined by measuring the similarity between the test sample and each template. This novel method can detect the similarity of overall patterns emerged from the differentially expressed genes or proteins while ignoring small mismatches.
Conclusions
The experimental results obtained on seven publicly available complex disease datasets including microarray and protein array data demonstrate that the proposed POCbased disease classification method is effective and robust for diagnosing complex diseases with regard to the number of initially selected features, and its recognition accuracy is better than or comparable to other stateoftheart machine learning methods. In addition, the proposed method does not require parameter tuning and data scaling, which can effectively reduce the occurrence of overfitting and bias.
Introduction
Classification and diagnostic prediction of complex diseases such as cancers and neurondegeneration diseases using genomic or proteomic data can improve the quality of pathological diagnosis and help develop personalized treatment of these diseases [1]. Although great efforts have been exerted in this field, making early and precise diagnosis of complex diseases, followed through with effectively treating remains a great challenge. For example, the histological methods cannot precisely distinguish between the subtypes of some cancers [2] that the development of effective therapies depends on. The molecular mechanisms of many neurondegeneration diseases such as Alzerheimer's (AD) and Parkinson's (PD) diseases are not fully understood and diagnosis of these diseases rely on medical history evaluation and the combination of physical and neurological assessments [3,4], often after irreversible brain damage or mental decline already occurs.
The rationale of classification and diagnostic prediction of complex diseases using genomic or proteomic data is based on the assumption that complex diseases induce perturbations to interaction and regulation networks of living systems, resulting in dynamic equilibrium states that differ for different diseases and also normal states. Thus identifying gene expression patterns corresponding to different equilibrium states is a key task to the success of these types of approaches. Many pattern recognition methods based on machine learning, such as knearest neighbor (KNN), support vector machines (SVM) [57], probabilistic neuron networks (PNN) [810], naive Bayes model (NBM) [11] and random forest (RF) [4,12], etc., have been extensively explored for the classification and diagnostic prediction of complex diseases [13]. Usually, these supervised learning methods are called modelbased ones because a classification model needs to be constructed using a training set before it can be used to predict the label of a test sample. However, for the modelbased methods, feature extraction and feature selection techniques play a vital role in improving the performance of complex disease classification due to the highdimensionality and small sample size of GEP dataset.
An example of feature extraction methods is that independent component analysis (ICA) was used to extract independent components from GEP to reduce the dimensionality of sample [7,14,15]. Other feature extraction methods such as principal component analysis (PCA) [6], linear discriminant analysis (LDA) [16] and locally linear discriminant embedding (LLDE) [17] are also extensively applied to the dimensionality reduction of GEP. Although such methods can achieve satisfactory classification performance, there is weak biomedical interpreter and significance. An example of gene selection methods is that the Classification to Nearest Centroids (ClaNC) method for classspecific gene selection was proposed to determine a gene subset of given size that maximizes the classification accuracy [18]. Although such methods have biomedical meaning, there are a great number of gene subsets with the same predictive performance, which could lead to the selection arbitrariness of candidate gene subsets. In fact, each method has its drawbacks, and many factors such as normalization, small sample size, noisy data, improper evaluation methods, and too many model parameters can lead to the overfitting of the constructed model, the bias of results and false discovery [1922]. Even so, "microarrays remain a useful technology to address a wide array of biological problems and the optimal analysis of these data to extract meaningful results still pose many bioinformatics challenges." [23]. Therefore, with the increasing accumulation of GEP and protein microarray data, it is still necessary to design more effective and more biomedical methods to recognize complex disease type, which is also the requirement of clinical application.
For potential clinical applications, a candidate classification model should be evaluated for three aspects: accuracy, interpretability and practicality [18]. And a novel method should be measured up from three aspects. 1) A good model should be simple and have no or few parameters to be tuned. If parameters are necessary, the model should be robust with regard to the variation of these parameters. 2) The obtained model should achieve the best or nearoptimal performance of disease classification as compared to the relevant stateoftheart methods because there is no classification method that always outperforms all others in all circumstances [23,24]. 3) The obtained model should be obviously interpretable from biomedical perspective, which requires that the intrinsic signatures of sample set should be used as designing the classification model.
Previous studies suggest that each complex disease type or subtype corresponds to a dynamic equilibrium state of diseaseinduced genomic interaction and regulation network, and different samples at the same state are similar in gene expression profiles [25]. Thus analyzing the similarity level of gene expression profiles can be in principle used to distinguish different disease types or subtypes. A gene expression profile, which comprises the expression levels of numerous genes, can be likened to a digital image that consists of the luminance of pixels. In fact, both microarray and protein array data are originated from digital images. We therefore suggest that it is reasonable to apply some image processing methods to analyze genomic and proteomic data. Based on this idea, recently we successfully proposed two correlation filters based on tumor classification methods, namely, minimum average correlation energy (MACE) and optimal tradeoff synthetic discriminant function (OTSDF), to identify the overall pattern of differentially expressed genes (DEGs), corresponding to the tumor subtypes [26]. Although the two methods perform well in classifying tumor subtypes, they have some drawbacks: 1) The two methods are sensitive to the data scaling methods used to standardize the data; 2) although the template synthesized for each subtype in frequency domain space can be used to characterize the corresponding subtype, the biomedical significance of the synthesized template itself is not obvious enough. Thus it is highly desirable to explore other correlation methods which can recognize disease types well but without the weaknesses of the MACE and OTSDFbased disease classification methods.
Our further experiments indicate that phaseonly correlation (POC) [27] may be such a method. Like the MACE and OTSDF filters, POC also utilizes a fast frequency domain approach to estimate the similarity degree between two samples. In recent years POC has also been extensively applied to image recognition [28,29] and identification of seismic events [30]. In this study, we present a novel POCbased method to complex disease classification based on virtual sample templates using genomic or proteomic data. First, we construct one template for each subclass on a training set. Sample matching can then be performed by crosscorrelating a test sample with each template in training set using POC and analyzing the resulting correlation output. By comparing the peaks of correlation output, the test sample can be easily assigned to the class for which the template with the highest similarity to the test sample represents.
Methods
Complex disease datasets
Seven public available complex disease datasets are used to evaluate the proposed method in our experiments. They include the Leukemia1 [31], GSE29676 (http://www.ncbi.nlm.nih.gov webcite) [3,4], Leukemia2 [32], small round blue cell tumor (SRBCT) [33], GSE5281 (http://www.ncbi.nlm.nih.gov webcite) [34], colon tumor (Colon) [35], and GCM [36] datasets. The Leukemia1 dataset contains 72 samples from three subtypes or subclasses, i.e., MLL, AML and ALL. The GSE29676 dataset includes 50 Alzheimer's disease and 29 Parkinson's disease samples as well as 40 nondemented control samples. The Leukemia2 dataset contains 72 samples and 7,129 genes from three subclasses, i.e., AML, ALLT and ALLB. The GSE5281 dataset includes 71 normal samples and 87 Alzheimer samples. The SRBCT dataset consists of four subclasses, i.e., Ewing's sarcoma (EWS), Burkitt's (BL), Neuroblastoma (NB), and rhabdomyosarcoma (RMS). The GCM dataset consists of fourteen different tumor types. These datasets are summarized in Table 1.
Table 1. The summary of the seven complex disease datasets.
Both protein and DNA microarray data can be represented with matrices. Thus we use DNA microarray data as an example to describe the design of our method. Let denote a set of N genes, and denote a set containing samples, where denotes the gene expression column vector of the corresponding sample on all features. Each sample is assigned with a label denoting the kth subclass set , , where is the total number of subclasses and is the index of the subclass with the label , and represents the number of samples with the same label .
Flowchart of analysis
POC allows us evaluate the similarity of disease samples in frequency domain based on their GEPs. Figure 1 shows the flowchart of the proposed method for predicting the type of a disease sample. This method is essentially equivalent to a special case of 1NN classification method with just one virtual sample per subclass in training set. The procedure involves the following steps:
Figure 1. The flowchart of applying POC analysis to identify disease types. 1D DFT: oneDimensional Discrete Fourier Transform; 1D IDFT: oneDimensional Inverse Discrete Fourier Transform; Template k denotes the kth template constructed using training set.
1) The entire sample set is randomly split into two disjoint parts: a training set and a test set. We then select a certain number of DEGs or differentially expressed proteins (DEPs) using the KruskalWallis rank sum test (KWRST) method [37].
2) A virtual sample template for each subclass of training set is constructed by averaging all samples in the subclass. The jth component of virtual sample template for subclass is , the mean expression of the kth subclass in training set for feature . Thus the concept of virtual sample template is the same as the centroid proposed in [38].
3) The POC function is calculated between each virtual sample template and a test sample, both of which are already transformed using the onedimensional discrete Fourier transform (1D DFT). The similarity between each virtual sample template and a given test sample is evaluated using the peak value of POC. The formalized representation of a test sample matching with all templates is denoted by
where , , , and denote discrete Fourier transform (DFT), POC function, taking only phase, and inverse discrete Fourier transform (IDFT), respectively. Thus the peak vector of the test sample matching with all templates can be calculated by
4) The highest peak of POC can be utilized to determine the label of the test sample , that is, the label of the test sample is assigned by
If we adopt a square matrix to represent a sample instead of the vector form of the sample, we can analyze a sample set using twodimensional POC (2D POC) to identify disease types. The flowchart of 2D POC analysis method is very similar to the 1D POC shown in Figure 1. The only difference is that 1D DFT and 1D IDFT in Figure 1 are replaced with 2D DFT and 2D IDFT, respectively. In fact, we can easily convert a sample vector (assuming that the length of the sample vector is a square number) into a square matrix easily.
Phaseonly correlation
We adopt both 1D POC and 2D POC methods to analyze disease samples. Here we only give the mathematical description of 1D POC. The principle of 2D POC can be found in literature [28]. Given two samples and , here we assume that their index ranges are , and for mathematical simplicity, where is an integer. Let and denote the onedimensional discrete Fourier transform (1D DFT) of two samples and , respectively. They are given by
where and . and are amplitude components, and and are phase components. The crossphase spectrum is defined as
where denotes the complex conjugate of and denotes the phase difference. Only the phase information is utilized while the amplitude is discarded because phase information is significantly more important than amplitude information in preserving the properties of intrinsic pattern [27]. Thus the 1D inverse DFT (1D IDFT) of is denoted as
where is the 1D POC function between and , and its value has a range from 0 to 1. The correlation peak value of provides a measure of the similarity between the two samples. Usually, the larger the peak value is, the more similar the two samples are, and vice versa. The peak value decreases when the noise in a test sample and the constructed templates increase [28]. Thus highlevel noise in samples may degrade the accuracy of prediction.
In contrast to the templatebased POC method, we design a POC1DKNN method that utilizes 1D POC to measure the similarity of two samples and apply 5nearest neighbor (5NN) to predict the label of test sample.
Experimental methods
Although there is no parameter in the proposed method, the different number of preselected features and the different divisions of training sets and test sets can also affect the classification performance. To obtain objective results, the Balance Division Method (BDM) is used to divide each original dataset into balanced training sets and test sets [26]. For the BDM, samples from each subclass of the original dataset are randomly selected and used as a training set, while the remaining samples are used as test set. For example, if we set to 5 for the SRBCT dataset, then 5 samples per subclass are randomly selected, that is, samples are used as a training set and the rest samples are assigned to a test set. Considering that 2D POC requires the square number of features selected, we select features using KWRST to evaluate the performance of POC method because the number of genes or proteins related to complex diseases is unknown and likely different from one disease to another.
Results
Visualization of experimental results
The results of 1D POC and 2D POC can be visually represented. Taking the SRBCT dataset as an example, 1D crosscorrelation coefficients calculated by a test sample (belonging to EWS subtype) matching with the four templates corresponding to the four subtypes of the SRBCT dataset are shown in Figure 2 (a), (b), (c) and 2(d), respectively. Their matching peak values are 0.7213, 0.2153, 0.2154, and 0.1889, respectively, suggesting the test sample can be correctly assigned to EWS subtype based on these values. Figure 3 shows 2D crosscorrelation coefficients calculated for a test sample (belonging to EWS subtype) compared to the four templates of the SRBCT dataset. Their matching peak values are 0.7118, 0.1971, 0.1487, and 0.1471, respectively. Thus this test sample can be easily assigned to EWS subtype.
Figure 2. 1D crosscorrelation coefficient calculated by a test sample (belonging to EWS) matching with the four templates of the SRBCT dataset, respectively. (a) EWS subtype. (b) BL. (c) NB. (d) RMS.
Figure 3. 2D crosscorrelation coefficient calculated by a test sample (belonging to EWS subtype) matching with the four templates of SRBCT. (a) EWS subtype. (b) BL. (c) NB. (d) RMS.
The resulting separability of all test samples (belonging to the same subclass) matching with each template can be visualized using plots, from which we can visually determine which test sample is correctly or mistakenly classified. For example, Figure 4, obtained using POC based on the training set selected randomly with 5 samples per subclass from the SRBCT dataset, shows the separability of the four subtypes of the SRBCT dataset in four subplots, respectively. In each subplot, the abscissa axis denotes the sequence number of all test samples within the same subclass, and the ordinate axis denotes the similarity degree (peak value) calculated by matching test samples with each template. To make it clearer, in each subplot we connect the points belonging to the same subclass to demonstrate the separability of different subclasses. Figure 4 clearly shows that all test samples in the two subtypes (BL and NB) are correctly classified, but the classification of the EWS subtype is not perfect.
Figure 4. The separability of all test samples in the SRBCT dataset. (a) The separability of all test samples belonging to EWS subtype. (b) BL. (c) NB. (d) RMS.
Comparison with other methods
Comparison with MACE method
Due to the fact that the OTSDF is a method with one parameter and the performance of OTSDF and MACE are almost equivalent, for fairer comparison we do not compare POC with OTSDF in performance, while we only compare the performance of POC with the one of MACE. Like POC, MACE is also a nonparametric method that has shown good performance on recognizing tumor subtypes [26]. However, the performance of MACE is sensitive to the data scaling method used to standardize the data. POC does not require data scaling and thus it can avoid this problem. We fix the number of the selected genes to and assess the classification performance varies with regard to different sample size of training set. Figure 5 shows the comparison of performance for POC and MACE on six disease datasets, where each original dataset is divided into a balanced training set and a test set by using the BDM method with varying from 5 to , where . If , then the value takes from 5 to 25. The comparison clearly shows that POC outperforms MACE in predictive accuracy on all six datasets (note that for the GSE29676 dataset only when the number of training samples is larger than 12, and MACE is obviously superior to POC in performance; for the SRBCT dataset only when the number of training samples is lesser than 7, and MACE is slightly superior to POC in performance).
Figure 5. The comparison of performance between the POC and MACE methods on the six complex disease datasets under the condition of no normalization. (a) The performance comparison of POC and MACE on the GSE29676, Leukemia2, and SRBCT datasets; (b) The performance comparison of POC and MACE on the Leukemia1, GSE5281, and Colon datasets.
Comparison with other modelbased methods
Since the templatebased POC method can be used to build classifiers, we compare it with other stateoftheart modelbased classification algorithms including NBM, KNN, PNN, and SVM. For KNN, we set its to 5 and adopt the correlation distance (one minus the sample correlation between points) as the measure between two samples, where the correlation distance is computed by the following formula.
For PNN, there is a smoothing parameter to be tuned within the range of . To determine the optimal value, 5fold crossvalidation (5fold CV) is performed by taking value from 0 to 1 by step 0.1 on each training set divided randomly on original dataset using BDM. The optimal is the one with the best performance of 5fold CV. For SVM, radial basis function (RBF) kernel is used as the kernel function of SVM. There are two parameters, and , to be tuned. We use 5fold CV on training set to determine the optimal combination of the parameters and by screening all combinations of the following and : , and . Because SVM requires data scaling, each dataset is standardized into one with zero mean and unit variance. Therefore, to obtain fairer comparison data scaling preprocess is performed before classification.
Because the performance of a model is sensitive to data division into training and test sets, we repeat the procedure 200 times using randomly divided training and test sets and report the mean value of the 200 predictive accuracies for each method (Figure 6). Figure 6 shows the performance of seven methods with regard to different number of training samples per subclasses. Both POC1D and POC2D perform well and are slightly superior to POC1DKNN except on the GSE5281 dataset. Overall, our methods achieve optimal or nearoptimal performance.
Figure 6. The performance of eight methods varying with the number of training samples per subclass on the six datasets.
We then fix the number of training samples per subclass to 8 but use different number of selected features and study the performance of models for each dataset (Figure 7). The results show that the performance of our method is very robust with regard to the number of features. KNN slightly outperforms our methods only on the Leukemia2 and GSE5281 datasets, but it is obviously inferior to our methods on the GSE29676 and SRBCT datasets. We have also studied the effects of using other feature filter methods such as ttest instead of KWRST, and the experimental results indicate that different feature filters affect less the performance of the POCbased method.
Figure 7. The performance of eight methods varying with the number of features on the six datasets.
Comparison with feature extractionbased methods
Due to the high dimensionality of dataset, feature extraction is often used to reduce the dimensionality of dataset before classification, and it plays a crucial role in simplifying classification model and improving the classification performance. Here we compare our method with five dimensionality reduction methods, i.e., PCA, LDA, ICA, LLDE, and LPP, which are extensively applied into the classification of complex disease. Our previous study suggests that the prediction accuracy depends less on classification methods [39] when the number of features extracted is small enough. Thus we also adopt the simplest classification method knearest neighbor (KNN) with correlation distance to classify disease samples, and fixedly set its to 5.
To avoid overfitting, before classification we extract only 5 features adopting these feature extraction methods except LDA whose number extracted is . Due to these feature extraction methods require data normalization, so each dataset has been samplewise normalized to zero mean and one variance after feature selection. We call these methods as PCAKNN, LDAKNN, ICAKNN, LLDEKNN, and LPPKNN, respectively. To further valid the effectiveness of our method on multiclass dataset, we select the GCM dataset with 14 different tumor types to evaluate the performance of our method. Figure 8 shows the performance of eight methods with regard to different number of training samples per subclasses. The results indicate that the performance of POC1D and POC2D are almost the same and slightly superior to POC1DKNN. Although LDAKNN outperforms POC on the GCM dataset, on the Colon dataset POC outperforms LDAKNN. Our method can achieve optimal or nearoptimal performance and has clear biomedical meaning, compared with other five feature extractionbased methods. Furthermore, for each dataset we also fix the number of training samples per subclass to 8, and study the performance varying with the number of selected features, as shown in Figure 9, indicating that the performance of these methods is robust with regard to the number of genes and our method can also achieve the best or nearoptimal performance except LDAKNN on the GCM dataset. To conclude, our novel method is very effective and can obtain the best and nearoptimal performance.
Figure 8. The performance of eight methods varying with the number of training samples per subclass on the six datasets.
Figure 9. The performance of eight methods varying with the number of features on the six datasets.
Permutation assessment
To further assess the reliability of the proposed method, we calculate the label permutationbased pvalues [40] of the six datasets. For each dataset we fix the number of training samples in each subclass to 8 and the number of initially selected features to . First we perform randomizations of training sets and test sets, and then for each randomization we randomly permute the labels of the test set times while keeping the labels of training set original. Therefore predictive values are obtained, which are denoted as matrix . For each randomization, predictive values are averaged, which is denoted as . The final mean can be calculated by . Values can be calculated by
where denotes a set of randomized version of the original dataset and denotes the predictive accuracy obtained using POC on dataset . denotes the mean of 200 predictive accuracy using POC on 200 randomizations of training set and test set obtained on original dataset. Table 2 shows the results of permutation tests with and using the templatebased POC1D and POC2D methods on the six datasets. It is clear that the obtained classification performance is reliable because their pvalues are very small. The mean value of predictive accuracy with label permutation for each dataset is close to except the Leukemia2, SRBCT and GCM datasets (for the Leukemia2 and SRBCT datasets there is only one sample in a subclass in test set, and for the GCM dataset there are few or several samples in many subclasses in test set), where K is the number of subclasses in dataset, indicating that no bias occurred in the obtained results [22].
Table 2. Permutation tests with POC1D and POC2D on the six datasets.
Discussions
Data scaling or normalization is a very important data preprocessing step for many machine learning algorithms sensitive to the numeric ranges of attributes. There are several widely used scaling methods such as Zscore that transforms data into the one with zeromean and onevariance, and 01 scaling method that transforms data into the range between 0 and 1, etc. Currently it is difficult to predict what is the best data scaling method for a given dataset [41], and no clear standard criterion can be used to evaluate various scaling methods [42]. Besides, information such as dynamic ranges might be lost during data scaling. Therefore the proposed method is advantageous over those demanding a scaling process because it does not require data scaling.
In the present study, we construct the template of each subtype using the means of the data points in the training dataset. The results demonstrate that this approach is reasonable and good performance is achieved. Nevertheless, there are certainly other ways to construct templates. For example, medians, instead of means, are another possible approach that might be more suitable for data that are not normally distributed. For the present study, we test medians but do not find significant difference from means (data not shown). Thus only the results using means are reported.
Conclusions
A POCbased method is reported as a new technique for identifying similar gene expression signatures for the differentially expressed genes or proteins. By measuring the similarity between a test sample and the virtual sample templates constructed on training set for each subclass, the label of the test sample can be easily determined. Applying the POCbased classification method to six complex disease datasets shows that this novel method is feasible, efficient and robust. Compared with five stateoftheart classification algorithms and five feature extractionbased methods, the proposed method can achieve optimal or nearoptimal classification accuracy.
Our methods can detect the similarity of overall pattern while ignoring small mismatches between a giving test sample and templates because correlation filters are based on integration operation. Compared with the MACE and OTSDF methods, POC is not sensitive to data scaling methods. The experimental results show that the POCbased method can achieve satisfactory results even without scaling data. Moreover, there is no parameter to be tuned in POC, so this method can easily avoid the overfitting problem as well as the effects of dimensionality curse. One possible drawback of this novel method is that highlevel noise in the template can suppress the output peak. Our future work will focus on exploring novel method to construct more representative template to further improve predictive accuracy.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
All authors contributed to the design of the project, the interpretation of the results, and the drafting and production of the manuscript.
Acknowledgements
We sincerely thank Dr. YiHai Zhu (University of Rhode Island) for the discussion on the application of phaseonly correlation method. This work was supported in part by the National Institutes of Health (NIH) Grant P01 AG12993 (PI: E. Michaelis) and the National Science Foundation of China (grant nos. 60973153, 61133010, 31071168, 60873012).
Declarations
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 8, 2013: Proceedings of the 2012 International Conference on Intelligent Computing (ICIC 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S8.
References

Karley D, Gupta D, Tiwari A: Biomarkers: the future of medical science to detect cancer.
Molecular Biomarkers & Diagnosis 2011, 2(5):118. PubMed Abstract  PubMed Central Full Text

Chan WC, Armitage JO, Gascoyne R, Connors J, Close P, Jacobs P, Norton A, Lister TA, Pedrinis E, Cavalli F, et al.: A clinical evaluation of the International Lymphoma Study Group classification of nonHodgkin's lymphoma.
Blood 1997, 89(11):39093918. PubMed Abstract  Publisher Full Text

Han M, Nagele E, DeMarshall C, Acharya N, Nagele R: Diagnosis of Parkinson's Disease Based on DiseaseSpecific Autoantibody Profiles in Human Sera.
PLoS One 2012., 7(2) PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Nagele E, Han M, DeMarshall C, Belinka B, Nagele R: Diagnosis of Alzheimer's Disease Based on DiseaseSpecific Autoantibody Profiles in Human Sera.
PLoS One 2011., 6(8) PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wang L, Zhu J, Zou H: Hybrid huberized support vector machines for microarray classification and gene selection.
Bioinformatics 2008, 24(3):412419. PubMed Abstract  Publisher Full Text

Wang SL, Wang J, Chen HW, Zhang BY: SVMbased tumor classification with gene expression data.
Advanced Data Mining and Applications, Proceedings 2006, 4093:864870. Publisher Full Text

Huang DS, Zheng CH: Independent component analysisbased penalized discriminant method for tumor classification using gene expression data.
Bioinformatics 2006, 22(15):18551862. PubMed Abstract  Publisher Full Text

Wang SL, Li XL, Zhang SW, Gui J, Huang DS: Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction.
Comput Biol Med 2010, 40(2):179189. PubMed Abstract  Publisher Full Text

Huang DS: A constructive approach for finding arbitrary roots of polynomials by neural networks.
Ieee T Neural Networ 2004, 15(2):477491. PubMed Abstract  Publisher Full Text

Huang DS: Radial basis probabilistic neural networks: Model and application.
International Journal of Pattern Recognition and Artificial Intelligence 1999, 13(7):10831101. Publisher Full Text

Demichelis F, Magni P, Piergiorgi P, Rubin MA, Bellazzi R: A hierarchical Naive Bayes Model for handling sample heterogeneity in classification problems: an application to tissue microarrays.
Bmc Bioinformatics 2006., 7 PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Boulesteix AL, Porzelius C, Daumer M: Microarraybased classification and clinical predictors: on combined classifiers and additional predictive value.
Bioinformatics 2008, 24(15):16981706. PubMed Abstract  Publisher Full Text

Zheng CH, Zhang L, Ng VT, Shiu SC, Huang DS: Molecular pattern discovery based on penalized matrix decomposition.
IEEE/ACM Trans Comput Biol Bioinform 2011, 8(6):15921603. PubMed Abstract  Publisher Full Text

Zheng CH, Chen Y, Li XX, Li YX, Zhu YP: Tumor classification based on independent component analysis.
International Journal of Pattern Recognition and Artificial Intelligence 2006, 20(2):297310. Publisher Full Text

Huang DS, Mi JX: A new constrained independent component analysis method.
IEEE T Neural Networ 2007, 18(5):15321535. PubMed Abstract

Sharma A, Paliwal KK: Cancer classification by gradient LDA technique using microarray gene expression data.
Data Knowl Eng 2008, 66(2):338347. Publisher Full Text

Li B, Zheng CH, Huang DS, Zhang L, Han K: Gene expression data classification using locally linear discriminant embedding.
Computers in Biology and Medicine 2010, 40(10):802810. PubMed Abstract  Publisher Full Text

Dabney AR: Classification of microarrays to nearest centroids.
Bioinformatics 2005, 21(22):41484154. PubMed Abstract  Publisher Full Text

Ransohoff DF: Rules of evidence for cancer molecularmarker discovery and validation.
Nature Reviews Cancer 2004, 4(4):309314. PubMed Abstract  Publisher Full Text

Ransohoff DF: Bias as a threat to the validity of cancer molecularmarker research.
Nature Reviews Cancer 2005, 5(2):142149. PubMed Abstract  Publisher Full Text

Simon R, Radmacher MD, Dobbin K, McShane LM: Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification.
Journal of the National Cancer Institute 2003, 95(1):1418. PubMed Abstract  Publisher Full Text

Wood IA, Visscher PM, Mengersen KL: Classification based upon gene expression data: bias and precision of error rates.
Bioinformatics 2007, 23(11):13631370. PubMed Abstract  Publisher Full Text

Rocke DM, Ideker T, Troyanskaya O, Quackenbush J, Dopazo J: Papers on normalization, variable selection, classification or clustering of microarray data.
Bioinformatics 2009, 25(6):701702. Publisher Full Text

Wolpert DH, Macready WG: Coevolutionary free lunches.
Ieee T Evolut Comput 2005, 9(6):721735. Publisher Full Text

Chen LN, Liu R, Liu ZP, Li MY, Aihara K: Detecting earlywarning signals for sudden deterioration of complex diseases by dynamical network biomarkers.
Sci RepUk 2012., 2 PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wang SL, Zhu YH, Jia W, Huang DS: Robust Classification Method of Tumor Subtype by Using Correlation Filters.
IEEEAcm Transactions on Computational Biology and Bioinformatics 2012, 9(2):580591. PubMed Abstract  Publisher Full Text

Horner JL, Gianino PD: PhaseOnly Matched Filtering.
Applied Optics 1984, 23(6):812816. PubMed Abstract  Publisher Full Text

Ito K, Nakajima H, Kobayashi K, Aoki T, Higuchi T: A fingerprint matching algorithm using phaseonly correlation.
Ieice Transactions on Fundamentals of Electronics Communications and Computer Sciences 2004, E87A(3):682691.

Shibaharaa T, Aoki T, Nakajima H, Kobayashi K: A highaccuracy stereo correspondence technique using 1D bandlimited phaseonly correlation.
Ieice Electron Expr 2008, 5(4):125130. Publisher Full Text

Moriya H: Phaseonly correlation of timevarying spectral representations of microseismic data for identification of similar seismic events.
Geophysics 2011, 76(6):Wc37Wc45. Publisher Full Text

Armstrong SA, Staunton JE, Silverman LB, Pieters R, de Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ: MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia.
Nat Genet 2002, 30(1):4147. PubMed Abstract  Publisher Full Text

Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring.
Science 1999, 286(5439):531537. PubMed Abstract  Publisher Full Text

Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, et al.: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks.
Nature Medicine 2001, 7(6):673679. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Liang WS, Reiman EM, Valla J, Dunckley T, Beach TG, Grover A, Niedzielko TL, Schneider LE, Mastroeni D, Caselli R, et al.: Alzheimer's disease is associated with reduced expression of energy in posterior cingulate metabolism genes neurons.
Proceedings of the National Academy of Sciences of the United States of America 2008, 105(11):44414446. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
Proceedings of the National Academy of Sciences of the United States of America 1999, 96(12):67456750. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, et al.: Multiclass cancer diagnosis using tumor gene expression signatures.
Proceedings of the National Academy of Sciences of the United States of America 2001, 98(26):1514915154. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Deng L, Ma JW, Pei J: Rank sum method for related gene selection and its application to tumor diagnosis.

Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression.
Proceedings of the National Academy of Sciences of the United States of America 2002, 99(10):65676572. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wang SL, You HZ, Lei YK, Li XL: Performance Comparison of Tumor Classification Based on Linear and Nonlinear Dimensionality Reduction Methods.
Advanced Intelligent Computing Theories and Applications 2010, 6215:291300. Publisher Full Text

Ojala M, Garriga GC: Permutation Tests for Studying Classifier Performance.

Chua SW, Vijayakumar P, Nissom PM, Yam CY, Wong VVT, Yang H: A novel normalization method for effective removal of systematic variation in microarray data.
Nucleic Acids Research 2006., 34(5) PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gold DL, Wang J, Coombes KR: Intergene correlation on oligonucleotide arrays  How much does normalization matter?
Am J Pharmacogenomic 2005, 5(4):271279. PubMed Abstract