This article is part of the supplement: Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2008
A kernel-based approach for detecting outliers of high-dimensional biological data
Department of Computer Science and Engineering, The University of Texas, Arlington, Texas, USA
BMC Bioinformatics 2009, 10(Suppl 4):S7 doi:10.1186/1471-2105-10-S4-S7Published: 29 April 2009
In many cases biomedical data sets contain outliers that make it difficult to achieve reliable knowledge discovery. Data analysis without removing outliers could lead to wrong results and provide misleading information.
We propose a new outlier detection method based on Kullback-Leibler (KL) divergence. The original concept of KL divergence was designed as a measure of distance between two distributions. Stemming from that, we extend it to biological sample outlier detection by forming sample sets composed of nearest neighbors. KL divergence is defined between two sample sets with and without the test sample. To handle the non-linearity of sample distribution, original data is mapped into a higher feature space. We address the singularity problem due to small sample size during KL divergence calculation. Kernel functions are applied to avoid direct use of mapping functions. The performance of the proposed method is demonstrated on a synthetic data set, two public microarray data sets, and a mass spectrometry data set for liver cancer study. Comparative studies with Mahalanobis distance based method and one-class support vector machine (SVM) are reported showing that the proposed method performs better in finding outliers.
Our idea was derived from Markov blanket algorithm that is a feature selection method based on KL divergence. That is, while Markov blanket algorithm removes redundant and irrelevant features, our proposed method detects outliers. Compared to other algorithms, our proposed method shows better or comparable performance for small sample and high-dimensional biological data. This indicates that the proposed method can be used to detect outliers in biological data sets.