Skip to main content

A kernel-based approach for detecting outliers of high-dimensional biological data

Abstract

Background

In many cases biomedical data sets contain outliers that make it difficult to achieve reliable knowledge discovery. Data analysis without removing outliers could lead to wrong results and provide misleading information.

Results

We propose a new outlier detection method based on Kullback-Leibler (KL) divergence. The original concept of KL divergence was designed as a measure of distance between two distributions. Stemming from that, we extend it to biological sample outlier detection by forming sample sets composed of nearest neighbors. KL divergence is defined between two sample sets with and without the test sample. To handle the non-linearity of sample distribution, original data is mapped into a higher feature space. We address the singularity problem due to small sample size during KL divergence calculation. Kernel functions are applied to avoid direct use of mapping functions. The performance of the proposed method is demonstrated on a synthetic data set, two public microarray data sets, and a mass spectrometry data set for liver cancer study. Comparative studies with Mahalanobis distance based method and one-class support vector machine (SVM) are reported showing that the proposed method performs better in finding outliers.

Conclusion

Our idea was derived from Markov blanket algorithm that is a feature selection method based on KL divergence. That is, while Markov blanket algorithm removes redundant and irrelevant features, our proposed method detects outliers. Compared to other algorithms, our proposed method shows better or comparable performance for small sample and high-dimensional biological data. This indicates that the proposed method can be used to detect outliers in biological data sets.

Background

Outlier detection is an active research area that has many applications such as network intrusion detection [1], fraud detection [2] and biomedical data analysis [3]. In particular, outliers caused from instrument error or human error in the biomedical data analysis such as biomarker selection and disease diagnosis could deeply degrade the performance of the data analysis. Therefore, prior to the analysis, during preprocessing it is imperative to remove outliers to prevent wrong results. To detect such anomalous observations from normal ones, data mining techniques are widely used.

Outlier detection has been studied by researchers using a diversity of approaches. Statistical methods often view objects that are located relatively far from the center of the data distribution as outliers. Several distance measures were implemented. The Mahalanobis distance is the most commonly used multivariate outlier criterion. Based on Akaike's Information Criterion (AIC), Kadota et al. developed a method for detecting outliers, which is free from a significance level [4]. Knorr and Ng introduced a distance-based approach in which outliers are those objects for which there are less than k points within a given threshold in the input data set [5, 6]. Angiulli et al. proposed a distance-based outlier detection method which finds the top outliers and provides a subset of the data set, called outlier detection solving set, that can be used to predict if new unseen objects are outliers [7]. Distance-based strategies are advantageous since model learning is not required. As an alternative, clustering algorithms can be used for outlier detection in which objects that do not belong to any cluster are regarded as outliers. Wang and Chiang proposed an effective cluster validity measure with outlier detection and cluster merging strategies for support vector clustering (SVC) [8]. The validity measure is capable of finding suitable values for the kernel parameter and soft margin constant. Based on these parameters, SVC algorithm can identify the ideal cluster number and increase robustness to outliers and noises. Schölkopf proposed a method of adapting support vector machine (SVM) to one-class classification problems [9]. Manevitz and Yousef presented two versions using the one-class SVM, both of which can identify outliers: Schölkopf's method and their proposed suggestion [10]. In such methods, after mapping the original samples into a feature space using an appropriate kernel function, the origin is referred to as the second class. In the feature space, samples close to the origin or lying on the standard subspace such as axes are regarded as outliers. Bandyopadhyay and Santra applied a genetic algorithm to the outlier detection problem in a lower dimensional space of a given data set, dividing these spaces into grids and efficiently computing the sparsity factor of the grid [11]. Aggarwal and Yu studied the problem of outlier detection for high-dimensional data, which works by finding lower dimensional projections [12]. Malossini et al. proposed two methods for detecting potential labeling errors: Classification-stability algorithm (CL-stability) and Leave-One-Out-Error-sensitivity algorithm (LOOE-sensitivity) [13]. In CL-stability, the stability of the classification of a sample is evaluated with a small perturbation of the other samples. LOOE-sensitivity was derived from the fact that if a sample is mislabeled, flipping the label of the sample should improve the prediction power.

In this paper, we propose a new outlier detection method based on KL divergence [14]. Due to the possible non-linearity of data structure, we deal with this problem in a higher feature space rather than the original space. Several issues arise after data mapping such as singularity because of small sample size versus high feature dimension. We address the computational issues and show the effectiveness of the proposed approach, KL divergence for outlier detection (KLOD).

Methods

Markov blanket

Markov blanket algorithm proposed by Koller and Sahami is a cross-entropy based technique to identify redundant and irrelevant features [15]. Let F be a full set of features and M F be a subset of features which does not contain feature F i . Then, M is called a Markov blanket for F i if F i is conditionally independent of F - M-{F i } given M. Generally, the Markov blanket M i of F i is defined as a subset of features that consists of some features that have the highest Pearson correlation with F i . To evaluate the closeness between F i and its Markov blanket M i , the following expected cross-entropy Δ is estimated:

Δ ( F i | M i ) = f M i , f i P ( M i = f M i , F i = f i ) × D ( P ( c | M i = f M i , F i = f i ) | | P ( c | M i = f M i ) ) , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeuiLdqKaeiikaGIaemOray0aaSbaaSqaaiabdMgaPbqabaGccqGG8baFcqWHnbqtdaWgaaWcbaGaemyAaKgabeaakiabcMcaPiabg2da9maaqafabaGaemiuaaLaeiikaGIaeCyta00aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpcqWHMbGzdaWgaaWcbaGaeCyta00aaSbaaWqaaiabdMgaPbqabaaaleqaaOGaeiilaWIaemOray0aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpcqWGMbGzdaWgaaWcbaGaemyAaKgabeaakiabcMcaPiabgEna0kabdseaejabcIcaOiabdcfaqjabcIcaOiabdogaJjabcYha8jabh2eannaaBaaaleaacqWGPbqAaeqaaOGaeyypa0JaeCOzay2aaSbaaSqaaiabh2eannaaBaaameaacqWGPbqAaeqaaaWcbeaakiabcYcaSiabdAeagnaaBaaaleaacqWGPbqAaeqaaOGaeyypa0JaemOzay2aaSbaaSqaaiabdMgaPbqabaGccqGGPaqkcqGG8baFcqGG8baFcqWGqbaucqGGOaakcqWGJbWycqGG8baFcqWHnbqtdaWgaaWcbaGaemyAaKgabeaakiabg2da9iabhAgaMnaaBaaaleaacqWHnbqtdaWgaaadbaGaemyAaKgabeaaaSqabaGccqGGPaqkcqGGPaqkaSqaaiabhAgaMnaaBaaameaacqWHnbqtdaWgaaqaaiabdMgaPbqabaaabeaaliabcYcaSiabdAgaMnaaBaaameaacqWGPbqAaeqaaaWcbeqdcqGHris5aOGaeiilaWcaaa@7F2A@

where f M i and f i are feature values to M i and F i , respectively, c is the class label, and D(.||.) represents the cross-entropy (a.k.a. Kullback-Leibler divergence). For each feature, Δ value is computed and a feature with the smallest Δ value is eliminated from the whole feature set. With the remaining features, the procedure is repeated until a predefined number of features remains.

Kullback-Leibler (KL) divergence

KL divergence, widely used in information theory, is adopted in Markov blanket as a core component. As shown in Markov blanket, KL divergence represents a measure of the distance between two probability distributions [16], i.e., for two probability densities p(x) and q(x), the KL-divergence is defined as

D KL ( p | | q ) = x p ( x ) log p ( x ) q ( x ) d x . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiraq0aaSbaaSqaaiabbUealjabbYeambqabaGccqGGOaakcqWGWbaCcqGG8baFcqGG8baFcqWGXbqCcqGGPaqkcqGH9aqpdaWdraqaaiabdchaWjabcIcaOiabhIha4jabcMcaPiGbcYgaSjabc+gaVjabcEgaNLqbaoaalaaabaGaemiCaaNaeiikaGIaeCiEaGNaeiykaKcabaGaemyCaeNaeiikaGIaeCiEaGNaeiykaKcaaOGaemizaqMaeCiEaGNaeiOla4caleaacqWH4baEaeqaniabgUIiYdaaaa@51FB@

Suppose that N MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8xdX7eaaa@3763@ (μ, Σ) is a multivariate Gaussian distribution defined as

N ( μ , ) = 1 ( 2 π ) m | | exp ( 1 2 ( x μ ) T 1 ( x μ ) ) , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8xdX7KaeiikaGIaeqiVd0MaeiilaWIaeyyeIuUaeiykaKIaeyypa0tcfa4aaSaaaeaacqaIXaqmaeaadaGcaaqaaiabcIcaOiabikdaYiabec8aWjabcMcaPmaaCaaabeqaaiabd2gaTbaacqGG8baFcqGHris5cqGG8baFaeqaaaaakiGbcwgaLjabcIha4jabcchaWnaabmaabaGaeyOeI0scfa4aaSaaaeaacqaIXaqmaeaacqaIYaGmaaGccqGGOaakcqWH4baEcqGHsislcqaH8oqBcqGGPaqkdaahaaWcbeqaaiabbsfaubaakiabggHiLpaaCaaaleqabaGaeyOeI0IaeGymaedaaOGaeiikaGIaeCiEaGNaeyOeI0IaeqiVd0MaeiykaKcacaGLOaGaayzkaaGaeiilaWcaaa@65AB@

where x R m MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83gHi1aaWbaaSqabeaacqWGTbqBaaaaaa@3835@ and |Σ| is the determinant of covariance matrix Σ. Given two different probability density functions, p(x) = N MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8xdX7eaaa@3763@ 1(μ1, Σ1) and q(x) = N MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8xdX7eaaa@3763@ 2(μ2, Σ2), the KL divergence is defined as

D K L ( N 1 | | N 2 ) = 1 2 { ( μ 1 μ 2 ) T 2 1 ( μ 1 μ 2 ) + l o g | 2 | | 1 | + t r [ 1 2 1 I m ] } . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiraq0aaSbaaSqaaiabhUealjabhYeambqabaGccqGGOaakt0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFneVtdaWgaaWcbaGaeGymaedabeaakiabcYha8jabcYha8jab=1q8onaaBaaaleaacqaIYaGmaeqaaOGaeiykaKIaeyypa0tcfa4aaSaaaeaacqaIXaqmaeaacqaIYaGmaaGaei4EaSNccqGGOaakcqaH8oqBdaWgaaWcbaGaeGymaedabeaakiabgkHiTiabeY7aTnaaBaaaleaacqaIYaGmaeqaaOGaeiykaKYaaWbaaSqabeaacqWHubavaaGccqGHris5daqhaaWcbaGaeGOmaidabaGaeyOeI0IaeGymaedaaOGaeiikaGIaeqiVd02aaSbaaSqaaiabigdaXaqabaGccqGHsislcqaH8oqBdaWgaaWcbaGaeGOmaidabeaakiabcMcaPiabgUcaRiabhYgaSjabh+gaVjabhEgaNLqbaoaalaaabaGaeiiFaWhcceGae4xeIu+aaSbaaeaacqWHYaGmaeqaaiabcYha8bqaaiabcYha8jab+fHiLpaaBaaabaGaeCymaedabeaacqGG8baFaaGccqGHRaWkcqWH0baDcqWHYbGCcqGGBbWwcqGFris5daWgaaWcbaGaeCymaedabeaakiab+fHiLpaaDaaaleaacqWHYaGmaeaacqGHsislcqWHXaqmaaGccqGHsislcqWHjbqsdaWgaaWcbaGaemyBa0gabeaakiabc2faDjabc2ha9jabc6caUaaa@84B5@

Concept of KL divergence for outlier detection (KLOD)

In Markov blanket, based on KL divergence, after calculating Δ value of Eq. (1) for each feature, a feature with the lowest Δ value is considered to be the most redundant. Using KL divergence, our new outlier detection method, called KLOD, employs similar strategy to the Markov blanket, i.e., while Markov blanket algorithm detects redundant and irrelevant features, our method identifies outliers. In KLOD, each sample x i has a sample set that consists of t samples close to the x i . To calculate the distance between samples, Euclidean metric is used. More specifically, we define two sample sets, i.e., S1 and S2: S2 is a sample set close to x i in Euclidean distance and the other set S1 consists of x i and all samples in S2. The similarity, D x i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiraq0aaSbaaSqaaiabhIha4naaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@3022@ (S1||S2), between S1 and S2 for each sample can be measured by using KL divergence, where 1 ≤ in and n is the total number of samples in the data set. Intuitively, in our strategy, a sample x i with the largest D is regarded as an outlier.

o = arg max 1 i n D x i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4Ba8Maeyypa0JagiyyaeMaeiOCaiNaei4zaCMagiyBa0MaeiyyaeMaeiiEaG3aaSbaaSqaaiabigdaXiabgsMiJkabdMgaPjabgsMiJkabd6gaUbqabaGccqWGebardaWgaaWcbaGaeCiEaG3aaSbaaWqaaiabdMgaPbqabaaaleqaaaaa@4261@

Given a data set with nonlinear data structure, if we model the linearity for the data set, it will cause our strategy to fail. Here, we focus on modeling the nonlinearity. Accordingly, with a mapping function ϕ, the original space is mapped into a higher dimensional feature space. Let S 1 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabigdaXaqaaiabfA6agbaaaaa@2F9F@ and S 2 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabikdaYaqaaiabfA6agbaaaaa@2FA1@ denote the two sample sets in the feature space in which we compute the similarity D( S 1 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabigdaXaqaaiabfA6agbaaaaa@2F9F@ || S 2 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabikdaYaqaaiabfA6agbaaaaa@2FA1@ ) between S 1 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabigdaXaqaaiabfA6agbaaaaa@2F9F@ and S 2 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabikdaYaqaaiabfA6agbaaaaa@2FA1@ . For each sample, its D( S 1 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabigdaXaqaaiabfA6agbaaaaa@2F9F@ || S 2 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabikdaYaqaaiabfA6agbaaaaa@2FA1@ ) is calculated. A sample which has the largest D( S 1 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabigdaXaqaaiabfA6agbaaaaa@2F9F@ || S 2 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabikdaYaqaaiabfA6agbaaaaa@2FA1@ ) is referred to as an outlier.

Please see an example in Figure 1. However, the calculation leads to several important issues to be considered, such as kernel trick, singularity problem, and calculation of KL divergence in the feature space. In the following sections, we will describe them.

Figure 1
figure 1

Outlier detection in a high feature space. Suppose that the red dot is a real outlier which is the farthest one from the majority of data. (a) in the original space, x1 is regarded as an outlier. (b) in the higher feature space, x2 is correctly detected as an outlier.

Kernel function

Suppose that {x1, x2, x n } are the given samples in the original space. After mapping the samples into a higher feature space by a nonlinear mapping function ϕ, the samples in the feature space are observed as Φ mxn = [ϕ(x1), ϕ (x2), , ϕ (x n )] where m is the number of features. Denote K as follows:

K = ΦTΦ.

The calculation can be performed using kernel trick, i.e., the ij th element, ϕ (x i )Tϕ (x j ), of the K matrix can be computed as a kernel function k(x i , x j ). In literatures, the polynomial kernel and the Gaussian kernel are the most widely used kernel functions. In this study, the Gaussian kernel function is used:

k ( x , y ) = exp ( | | x y | | 2 2 σ 2 ) , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4AaSMaeiikaGIaeCiEaGNaeiilaWIaeCyEaKNaeiykaKIaeyypa0JagiyzauMaeiiEaGNaeiiCaa3aaeWaaeaacqGHsisljuaGdaWcaaqaaiabcYha8jabcYha8jabhIha4jabgkHiTiabhMha5jabcYha8jabcYha8naaCaaabeqaaiabikdaYaaaaeaacqaIYaGmcqaHdpWCdaahaaqabeaacqaIYaGmaaaaaaGccaGLOaGaayzkaaGaeiilaWcaaa@4B0E@

where σ controls the kernel width. Similar to Eq. (6), we define K ij as follows:

K i j = Φ i T Φ j , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4saS0aaSbaaSqaaiabdMgaPjabdQgaQbqabaGccqGH9aqpcqqHMoGrdaqhaaWcbaGaemyAaKgabaGaeCivaqfaaOGaeuOPdy0aaSbaaSqaaiabdQgaQbqabaGccqGGSaalaaa@3968@

where if ij, Φ i and Φ j are different sample sets in the feature space; if i = j, K ij is equivalent to the definition of K. Indeed, the feature space and the mapping function may not be explicitly known. However, once the kernel function is known, we can easily deal with the nonlinear mapping problem by replacing the mapping functions by the kernel functions.

KL divergence equation is composed of mean and covariance components. The mean and the covariance matrix in the feature space are estimated as

μ ^ = 1 n i = 1 n ϕ ( x i ) = Φ s , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqiVd0MbaKaacqGH9aqpjuaGdaWcaaqaaiabigdaXaqaaiabd6gaUbaakmaaqahabaGaeqy1dyMaeiikaGIaeCiEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGPaqkcqGH9aqpiiqacqWFMoGrcqWHZbWCaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6gaUbqdcqGHris5aOGaeiilaWcaaa@444E@
Σ ^ = 1 n i = 1 n ( ϕ ( x i ) μ ) ( ϕ ( x i ) μ ) T = Φ J J T Φ T , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafu4OdmLbaKaacqGH9aqpjuaGdaWcaaqaaiabigdaXaqaaiabd6gaUbaakmaaqahabaGaeiikaGIaeqy1dyMaeiikaGIaeCiEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGPaqkcqGHsislcqaH8oqBcqGGPaqkcqGGOaakcqaHvpGzcqGGOaakcqWH4baEdaWgaaWcbaGaemyAaKgabeaakiabcMcaPiabgkHiTiabeY7aTjabcMcaPmaaCaaaleqabaGaeCivaqfaaOGaeyypa0dcceGae8NPdyKaeCOsaOKaeCOsaO0aaWbaaSqabeaacqWHubavaaGccqWFMoGrdaahaaWcbeqaaiabhsfaubaaaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGUbGBa0GaeyyeIuoakiabcYcaSaaa@59C0@

where s n × 1 = 1 n 1 T , J = 1 n ( I n s 1 ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4Cam3aaSbaaSqaaiabd6gaUjabgEna0kabigdaXaqabaGccqGH9aqpjuaGdaWcaaqaaiabigdaXaqaaiabd6gaUbaakiqbhgdaXyaalaWaaWbaaSqabeaacqWHubavaaGccqGGSaalcqWHkbGscqGH9aqpjuaGdaWcaaqaaiabigdaXaqaamaakaaabaGaemOBa4gabeaaaaGccqGGOaakcqWHjbqsdaWgaaWcbaGaemOBa4gabeaakiabgkHiTiabhohaZHqabiqb=fdaXyaalaGaeiykaKcaaa@463D@ and 1 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbeGaf8xmaeJbaSaaaaa@2CD8@ = [1, 1, , 1]. Then, an m × n matrix W is denoted as

W = Φ J = 1 n [ ( ϕ ( x 1 ) μ ) , , ( ϕ ( x n ) μ ) ] . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4vaCLaeyypa0dcceGae8NPdyKaeCOsaOKaeyypa0tcfa4aaSaaaeaaieqacqGFXaqmaeaadaGcaaqaaiabd6gaUbqabaaaaOGaei4waSLaeiikaGIaeqy1dyMaeiikaGIaeCiEaG3aaSbaaSqaaiabigdaXaqabaGccqGGPaqkcqGHsislcqaH8oqBcqGGPaqkcqGGSaalcqWIVlctcqGGSaalcqGGOaakcqaHvpGzcqGGOaakcqWH4baEdaWgaaWcbaGaemOBa4gabeaakiabcMcaPiabgkHiTiabeY7aTjabcMcaPiabc2faDjabc6caUaaa@5182@

Singularity problem

The covariance matrix in Eq. (10) is rank-deficient due to the small number of samples against the number of features. This problem, called singularity problem, makes it impossible to calculate the inverse of the covariance matrix. To overcome the problem, several methods have been proposed. In this study, we make use of a simple regularized approximation in which some positive constant values are added to the diagonal elements of the covariance matrix [17]. Therefore, the modified covariance matrix is of full rank, hence nonsingular. Let C denote

C = Φ J J T Φ T + ρ I m , = W W T + ρ I m , = Φ R Φ T + ρ I m , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeWabmqaaaqaaiabhoeadjabg2da9GGabiab=z6agjabhQeakjabhQeaknaaCaaaleqabaGaeCivaqfaaOGae8NPdy0aaWbaaSqabeaacqWHubavaaGccqGHRaWkcqaHbpGCcqWHjbqsdaWgaaWcbaGaemyBa0gabeaakiabcYcaSaqaaiabg2da9iabhEfaxjabhEfaxnaaCaaaleqabaGaeCivaqfaaOGaey4kaSIaeqyWdiNaeCysaK0aaSbaaSqaaiabd2gaTbqabaGccqGGSaalaeaacqGH9aqpcqWFMoGrcqWHsbGucqWFMoGrdaahaaWcbeqaaiabhsfaubaakiabg2da9iabeg8aYjabhMeajnaaBaaaleaacqWGTbqBaeqaaOGaeiilaWcaaaaa@549F@

where R = JJT, ρ > 0, and I m is an identity matrix. In this study, ρ = 1 is used. Then, the inversion of the matrix C can be computed by using Woodbury formula:

C 1 = ( ρ I m + Φ J J T Φ T ) 1 , = ( ρ I m + W W T ) 1 , = ρ 1 ( I m ρ 1 W ( I n + ρ 1 W T W ) 1 W T ) , = ρ 1 ( I m W ( ρ I n + W T W ) 1 W T ) , = ρ 1 ( I m Φ J M 1 J T Φ T ) , = ρ 1 ( I m Φ B Φ T ) , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeWabyqaaaaabaGaeC4qam0aaWbaaSqabeaacqGHsislcqWHXaqmaaGccqGH9aqpcqGGOaakcqaHbpGCcqWHjbqsdaWgaaWcbaGaemyBa0gabeaakiabgUcaRGGabiab=z6agjabhQeakjabhQeaknaaCaaaleqabaGaeCivaqfaaOGae8NPdy0aaWbaaSqabeaacqWHubavaaGccqGGPaqkdaahaaWcbeqaaiabgkHiTiabhgdaXaaakiabcYcaSaqaaiabg2da9iabcIcaOiabeg8aYjabhMeajnaaBaaaleaacqWGTbqBaeqaaOGaey4kaSIaeC4vaCLaeC4vaC1aaWbaaSqabeaacqWHubavaaGccqGGPaqkdaahaaWcbeqaaiabgkHiTiabigdaXaaakiabcYcaSaqaaiabg2da9iabeg8aYnaaCaaaleqabaGaeyOeI0IaeGymaedaaOGaeiikaGIaeCysaK0aaSbaaSqaaiabd2gaTbqabaGccqGHsislcqaHbpGCdaahaaWcbeqaaiabgkHiTiabigdaXaaakiabhEfaxjabcIcaOiabhMeajnaaBaaaleaacqWGUbGBaeqaaOGaey4kaSIaeqyWdi3aaWbaaSqabeaacqGHsislcqaIXaqmaaGccqWHxbWvdaahaaWcbeqaaiabhsfaubaakiabhEfaxjabcMcaPmaaCaaaleqabaGaeyOeI0IaeGymaedaaOGaeC4vaC1aaWbaaSqabeaacqWHubavaaGccqGGPaqkcqGGSaalaeaacqGH9aqpcqaHbpGCdaahaaWcbeqaaiabgkHiTiabigdaXaaakiabcIcaOiabhMeajnaaBaaaleaacqWGTbqBaeqaaOGaeyOeI0IaeC4vaCLaeiikaGIaeqyWdiNaeCysaK0aaSbaaSqaaiabd6gaUbqabaGccqGHRaWkcqWHxbWvdaahaaWcbeqaaiabhsfaubaakiabhEfaxjabcMcaPmaaCaaaleqabaGaeyOeI0IaeGymaedaaOGaeC4vaC1aaWbaaSqabeaacqWHubavaaGccqGGPaqkcqGGSaalaeaacqGH9aqpcqaHbpGCdaahaaWcbeqaaiabgkHiTiabigdaXaaakiabcIcaOiabhMeajnaaBaaaleaacqWGTbqBaeqaaOGaeyOeI0IaeuOPdyKaeCOsaOKaeCyta00aaWbaaSqabeaacqGHsislcqaIXaqmaaGccqWHkbGsdaahaaWcbeqaaiabhsfaubaakiab=z6agnaaCaaaleqabaGaeCivaqfaaOGaeiykaKIaeiilaWcabaGaeyypa0JaeqyWdi3aaWbaaSqabeaacqGHsislcqaIXaqmaaGccqGGOaakcqWHjbqsdaWgaaWcbaGaemyBa0gabeaakiabgkHiTiabfA6agjabhkeacjabfA6agnaaCaaaleqabaGaeCivaqfaaOGaeiykaKIaeiilaWcaaaaa@B606@

where B = JM-1JT and M = ρ I n + WTW = ρ I n + JTΦTΦJ = ρ I n + JTKJ.

Definition (Woodbury formula): Let A be a square r × r invertible matrix, where U and V are two r × k matrices with kr. Assume that a k × k matrix Σ = I k + β VTA-1U, in which I k denotes a k × k identity matrix and β is an arbitrary scalar, is invertible. Then

(A + β UVT)-1 = A-1 - β A-1-1VTA-1.

Calculation of KL divergence

Suppose that S 1 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabigdaXaqaaiabfA6agbaaaaa@2F9F@ and S 2 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabikdaYaqaaiabfA6agbaaaaa@2FA1@ are two sample sets in the feature space as mentioned in section. We know that the covariance matrices for both sets are singular. Let C1 and C2 denote the approximated covariance matrices for S 1 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabigdaXaqaaiabfA6agbaaaaa@2F9F@ and S 2 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabikdaYaqaaiabfA6agbaaaaa@2FA1@ , respectively, where the size of S 1 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabigdaXaqaaiabfA6agbaaaaa@2F9F@ is one larger than that of S 2 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabikdaYaqaaiabfA6agbaaaaa@2FA1@ . Also, let μ1 and μ2 be mean matrices for S 1 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabigdaXaqaaiabfA6agbaaaaa@2F9F@ and S 2 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabikdaYaqaaiabfA6agbaaaaa@2FA1@ , respectively. Therefore, KL divergence for S 1 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabigdaXaqaaiabfA6agbaaaaa@2F9F@ and S 2 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabikdaYaqaaiabfA6agbaaaaa@2FA1@ is expressed as follows:

2 D K L ( N 1 | | N 2 ) = ( μ 1 μ 2 ) T C 2 1 ( μ 1 μ 2 ) + l o g | C 2 | | C 1 | + t r | C 1 C 2 1 ] m . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeGOmaiJaemiraq0aaSbaaSqaaiabhUealjabhYeambqabaGccqGGOaakt0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFneVtdaWgaaWcbaGaeGymaedabeaakiabcYha8jabcYha8jab=1q8onaaBaaaleaacqaIYaGmaeqaaOGaeiykaKIaeyypa0JaeiikaGIaeqiVd02aaSbaaSqaaiabigdaXaqabaGccqGHsislcqaH8oqBdaWgaaWcbaGaeGOmaidabeaakiabcMcaPmaaCaaaleqabaGaeCivaqfaaOGaeC4qam0aa0baaSqaaiabikdaYaqaaiabgkHiTiabigdaXaaakiabcIcaOiabeY7aTnaaBaaaleaacqaIXaqmaeqaaOGaeyOeI0IaeqiVd02aaSbaaSqaaiabikdaYaqabaGccqGGPaqkcqGHRaWkcqWHSbaBcqWHVbWBcqWHNbWzjuaGdaWcaaqaaiabcYha8jabhoeadnaaBaaabaGaeCOmaidabeaacqGG8baFaeaacqGG8baFcqWHdbWqdaWgaaqaaiabhgdaXaqabaGaeiiFaWhaaOGaey4kaSIaeCiDaqNaeCOCaiNaeiiFaWNaeC4qam0aaSbaaSqaaiabhgdaXaqabaGccqWHdbWqdaqhaaWcbaGaeCOmaidabaGaeyOeI0IaeCymaedaaOGaeiyxa0LaeyOeI0IaemyBa0MaeiOla4caaa@7C4A@

The KL divergence above is composed of three terms, i.e.,

{ ( μ 1 μ 2 ) T C 2 1 ( μ 1 μ 2 ) l o g | C 2 | | C 1 | t r [ C 1 C 2 1 ] . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaiqaaeaafaqaaeWabaaabaGaeiikaGIaeqiVd02aaSbaaSqaaiabigdaXaqabaGccqGHsislcqaH8oqBdaWgaaWcbaGaeGOmaidabeaakiabcMcaPmaaCaaaleqabaGaeCivaqfaaOGaeC4qam0aa0baaSqaaiabikdaYaqaaiabgkHiTiabigdaXaaakiabcIcaOiabeY7aTnaaBaaaleaacqaIXaqmaeqaaOGaeyOeI0IaeqiVd02aaSbaaSqaaiabikdaYaqabaGccqGGPaqkaeaacqWHSbaBcqWHVbWBcqWHNbWzdaWcaaqaaiabcYha8jabhoeadnaaBaaaleaacqWHYaGmaeqaaOGaeiiFaWhabaGaeiiFaWNaeC4qam0aaSbaaSqaaiabhgdaXaqabaGccqGG8baFaaaabaGaeCiDaqNaeCOCaiNaei4waSLaeC4qam0aaSbaaSqaaiabhgdaXaqabaGccqWHdbWqdaqhaaWcbaGaeCOmaidabaGaeyOeI0IaeCymaedaaOGaeiyxa0LaeiOla4caaaGaay5Eaaaaaa@5EC8@

It should be noted that as shown in Eq. (9), Eq. (12) and Eq. (13), μ i , C i and C i 1 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4qam0aa0baaSqaaiabdMgaPbqaaiabgkHiTiabigdaXaaaaaa@304D@ (i = 1 or 2) have mapping functions rather than kernel functions.

Here, we will show how each term can be expressed by kernel functions instead of mapping functions. The first term consists of four sub-terms,

( μ 1 μ 2 ) T C 2 1 ( μ 1 μ 2 ) = μ 1 T C 2 1 μ 1 + μ 2 T C 2 1 μ 2 μ 1 T C 2 1 μ 2 μ 2 T C 2 1 μ 1 . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeiikaGIaeqiVd02aaSbaaSqaaiabigdaXaqabaGccqGHsislcqaH8oqBdaWgaaWcbaGaeGOmaidabeaakiabcMcaPmaaCaaaleqabaGaeCivaqfaaOGaeC4qam0aa0baaSqaaiabikdaYaqaaiabgkHiTiabigdaXaaakiabcIcaOiabeY7aTnaaBaaaleaacqaIXaqmaeqaaOGaeyOeI0IaeqiVd02aaSbaaSqaaiabikdaYaqabaGccqGGPaqkcqGH9aqpcqaH8oqBdaqhaaWcbaGaeGymaedabaGaeCivaqfaaOGaeC4qam0aa0baaSqaaiabikdaYaqaaiabgkHiTiabigdaXaaakiabeY7aTnaaBaaaleaacqaIXaqmaeqaaOGaey4kaSIaeqiVd02aa0baaSqaaiabikdaYaqaaiabhsfaubaakiabhoeadnaaDaaaleaacqaIYaGmaeaacqGHsislcqaIXaqmaaGccqaH8oqBdaWgaaWcbaGaeGOmaidabeaakiabgkHiTiabeY7aTnaaDaaaleaacqaIXaqmaeaacqWHubavaaGccqWHdbWqdaqhaaWcbaGaeGOmaidabaGaeyOeI0IaeGymaedaaOGaeqiVd02aaSbaaSqaaiabikdaYaqabaGccqGHsislcqaH8oqBdaqhaaWcbaGaeGOmaidabaGaeCivaqfaaOGaeC4qam0aa0baaSqaaiabikdaYaqaaiabgkHiTiabigdaXaaakiabeY7aTnaaBaaaleaacqaIXaqmaeqaaOGaeiOla4caaa@7324@

Substituting Eq. (9) and Eq. (13) into each sub-term μ i T C j 1 μ k MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeqiVd02aa0baaSqaaiabdMgaPbqaaiabhsfaubaakiabhoeadnaaDaaaleaacqWGQbGAaeaacqGHsislcqaIXaqmaaGccqaH8oqBdaWgaaWcbaGaem4AaSgabeaaaaa@3817@ , we have

μ i T C j 1 μ k = s i T Φ i T ρ 1 ( I m Φ j B j Φ j T ) Φ k s k , = ρ 1 ( s i T K i k s k s i T K i j B j K j k s k ) , = ρ 1 θ i j k . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeWabmqaaaqaaiabeY7aTnaaDaaaleaacqWGPbqAaeaacqWHubavaaGccqWHdbWqdaqhaaWcbaGaemOAaOgabaGaeyOeI0IaeGymaedaaOGaeqiVd02aaSbaaSqaaiabdUgaRbqabaGccqGH9aqpcqWHZbWCdaqhaaWcbaGaemyAaKgabaGaeCivaqfaaOGaeuOPdy0aa0baaSqaaiabdMgaPbqaaiabhsfaubaakiabeg8aYnaaCaaaleqabaGaeyOeI0IaeGymaedaaOGaeiikaGIaeCysaK0aaSbaaSqaaiabd2gaTbqabaGccqGHsislcqqHMoGrdaWgaaWcbaGaemOAaOgabeaakiabhkeacnaaBaaaleaacqWGQbGAaeqaaOGaeuOPdy0aa0baaSqaaiabdQgaQbqaaiabhsfaubaakiabcMcaPiabfA6agnaaBaaaleaacqWGRbWAaeqaaOGaeC4Cam3aaSbaaSqaaiabdUgaRbqabaGccqGGSaalaeaacqGH9aqpcqaHbpGCdaahaaWcbeqaaiabgkHiTiabigdaXaaakiabcIcaOiabhohaZnaaDaaaleaacqWGPbqAaeaacqWGubavaaGccqWHlbWsdaWgaaWcbaGaemyAaKMaem4AaSgabeaakiabhohaZnaaBaaaleaacqWGRbWAaeqaaOGaeyOeI0Iaem4Cam3aa0baaSqaaiabdMgaPbqaaiabhsfaubaakiabhUealnaaBaaaleaacqWGPbqAcqWGQbGAaeqaaOGaeCOqai0aaSbaaSqaaiabdQgaQbqabaGccqWHlbWsdaWgaaWcbaGaemOAaOMaem4AaSgabeaakiabhohaZnaaBaaaleaacqWGRbWAaeqaaOGaeiykaKIaeiilaWcabaGaeyypa0JaeqyWdi3aaWbaaSqabeaacqGHsislcqaIXaqmaaGccqaH4oqCdaWgaaWcbaGaemyAaKMaemOAaOMaem4AaSgabeaakiabc6caUaaaaaa@8D8A@

As a result of the effort, all mapping functions in the first term are replaced with kernel functions. Before dealing with the second term, we want to introduce the following three properties of determinant that are essential in the calculation of the second term.

Properties of determinant

  1. (a)

    If A is an r-by-r matrix, det|d A| = det|d I r A| = drdet|A|.

  2. (b)

    If A and B are k-by-r matrices, det|I k + ABT| = det|I r + BTA|.

  3. (c)

    If A is invertible, det|A-1| = 1/det|A|.

In the second term, we should compute the determinant of C(C1 or C2). Instead of directly calculating the determinant of C, we try to obtain it through the determinant of C-1. That is,

| C 1 | = | ρ 1 ( I m Φ B Φ T ) | , = ρ m | I m Φ B Φ T | , by property  ( a ) = ρ m | I m Q Φ T | , = ρ m | I n Φ T Q | , by property  ( b ) = ρ m | I n Φ T Φ B | , = ρ m | I n K B | , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeaabyabaaaaaeaacqGG8baFcqWHdbWqdaahaaWcbeqaaiabgkHiTiabigdaXaaakiabcYha8bqaaiabg2da9aqaaiabcYha8jabeg8aYnaaCaaaleqabaGaeyOeI0IaeGymaedaaOGaeiikaGIaeCysaK0aaSbaaSqaaiabd2gaTbqabaGccqGHsisliiqacqWFMoGrcqWHcbGqcqWFMoGrdaahaaWcbeqaaiabhsfaubaakiabcMcaPiabcYha8jabcYcaSaqaaaqaaaqaaiabg2da9aqaaiabeg8aYnaaCaaaleqabaGaeyOeI0IaemyBa0gaaOGaeiiFaWNaeCysaK0aaSbaaSqaaiabd2gaTbqabaGccqGHsislcqWFMoGrcqWHcbGqcqWFMoGrdaahaaWcbeqaaiabhsfaubaakiabcYha8jabcYcaSaqaaiabbkgaIjabbMha5jabbccaGiabbchaWjabbkhaYjabb+gaVjabbchaWjabbwgaLjabbkhaYjabbsha0jabbMha5jabbccaGiabcIcaOiabbggaHjabcMcaPaqaaaqaaiabg2da9aqaaiabeg8aYnaaCaaaleqabaGaeyOeI0IaemyBa0gaaOGaeiiFaWNaeCysaK0aaSbaaSqaaiabd2gaTbqabaGccqGHsislcqWHrbqucqWFMoGrdaahaaWcbeqaaiabhsfaubaakiabcYha8jabcYcaSaqaaaqaaaqaaiabg2da9aqaaiabeg8aYnaaCaaaleqabaGaeyOeI0IaemyBa0gaaOGaeiiFaWNaeCysaK0aaSbaaSqaaiabd6gaUbqabaGccqGHsislcqWFMoGrdaahaaWcbeqaaiabhsfaubaakiabhgfarjabcYha8jabcYcaSaqaaiabbkgaIjabbMha5jabbccaGiabbchaWjabbkhaYjabb+gaVjabbchaWjabbwgaLjabbkhaYjabbsha0jabbMha5jabbccaGiabcIcaOiabbkgaIjabcMcaPaqaaaqaaiabg2da9aqaaiabeg8aYnaaCaaaleqabaGaeyOeI0IaemyBa0gaaOGaeiiFaWNaeCysaK0aaSbaaSqaaiabd6gaUbqabaGccqGHsislcqWFMoGrdaahaaWcbeqaaiabhsfaubaakiab=z6agjabhkeacjabcYha8jabcYcaSaqaaaqaaaqaaiabg2da9aqaaiabeg8aYnaaCaaaleqabaGaeyOeI0IaemyBa0gaaOGaeiiFaWNaeCysaK0aaSbaaSqaaiabd6gaUbqabaGccqGHsislcqWHlbWscqWHcbGqcqGG8baFcqGGSaalaeaaaaaaaa@C050@

where Q = ΦB. Here, by property (c), we can easily calculate |C|, i.e.,

| C | = 1 | C 1 | = ρ m | I n K B | . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeiiFaWNaeC4qamKaeiiFaWNaeyypa0tcfa4aaSaaaeaacqWHXaqmaeaacqGG8baFcqWHdbWqdaahaaqabeaacqGHsislcqWHXaqmaaGaeiiFaWhaaOGaeyypa0tcfa4aaSaaaeaacqaHbpGCdaahaaqabeaacqWGTbqBaaaabaGaeiiFaWNaeCysaK0aaSbaaeaacqWGUbGBaeqaaiabgkHiTiabhUealjabhkeacjabcYha8baakiabc6caUaaa@4781@

By taking logarithm of |C|, we have

l o g | C | = l o g ρ m | I n K B | = m l o g ρ l o g | I n K B | . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeCiBaWMaeC4Ba8MaeC4zaCMaeiiFaWNaeC4qamKaeiiFaWNaeyypa0JaeCiBaWMaeC4Ba8MaeC4zaCwcfa4aaSaaaeaacqaHbpGCdaahaaqabeaacqWGTbqBaaaabaGaeiiFaWNaeCysaK0aaSbaaeaacqWGUbGBaeqaaiabgkHiTiabhUealjabhkeacjabcYha8baakiabg2da9iabd2gaTjabhYgaSjabh+gaVjabhEgaNjabeg8aYjabgkHiTiabhYgaSjabh+gaVjabhEgaNjabcYha8jabhMeajnaaBaaaleaacqWGUbGBaeqaaOGaeyOeI0IaeC4saSKaeCOqaiKaeiiFaWNaeiOla4caaa@5D70@

Note that the size of S 1 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabigdaXaqaaiabfA6agbaaaaa@2F9F@ is one larger than that of S 2 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabikdaYaqaaiabfA6agbaaaaa@2FA1@ . If the size of S 2 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabikdaYaqaaiabfA6agbaaaaa@2FA1@ is k, the size of S 1 Φ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uam1aa0baaSqaaiabigdaXaqaaiabfA6agbaaaaa@2F9F@ becomes k + 1.

Now we have the second term composed of kernel functions:

l o g | C 2 | | C 1 | = l o g | C 2 | l o g | C 1 | , = l o g | I k + 1 K 11 B 1 | l o g | I k K 22 B 2 | . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeWabiqaaaqaaiabhYgaSjabh+gaVjabhEgaNLqbaoaalaaabaGaeiiFaWNaeC4qam0aaSbaaeaacqWHYaGmaeqaaiabcYha8bqaaiabcYha8jabhoeadnaaBaaabaGaeCymaedabeaacqGG8baFaaGccqGH9aqpcqWHSbaBcqWHVbWBcqWHNbWzcqGG8baFcqWHdbWqdaWgaaWcbaGaeCOmaidabeaakiabcYha8jabgkHiTiabhYgaSjabh+gaVjabhEgaNjabcYha8jabhoeadnaaBaaaleaacqWHXaqmaeqaaOGaeiiFaWNaeiilaWcabaGaeyypa0JaeCiBaWMaeC4Ba8MaeC4zaCMaeiiFaWNaeCysaK0aaSbaaSqaaiabdUgaRjabgUcaRiabigdaXaqabaGccqGHsislcqWHlbWsdaWgaaWcbaGaeCymaeJaeCymaedabeaakiabhkeacnaaBaaaleaacqWHXaqmaeqaaOGaeiiFaWNaeyOeI0IaeCiBaWMaeC4Ba8MaeC4zaCMaeiiFaWNaeCysaK0aaSbaaSqaaiabdUgaRbqabaGccqGHsislcqWHlbWsdaWgaaWcbaGaeCOmaiJaeCOmaidabeaakiabhkeacnaaBaaaleaacqWHYaGmaeqaaOGaeiiFaWNaeiOla4caaaaa@7609@

The third term can be replaced with kernel functions using properties of trace:

t r [ C 1 C 2 1 ] = t r [ ( Φ 1 R 1 Φ 1 T + ρ I m ) ρ 1 ( I m Φ 2 B 2 Φ 2 T ) ] , = ρ 1 t r [ Φ 1 R 1 Φ 1 T ] ρ 1 t r [ Φ 1 R 1 Φ 1 T Φ 2 B 2 Φ 2 T ] + m t r [ Φ 2 B 2 Φ 2 T ] , = ρ 1 t r [ R 1 K 11 ] ρ 1 t r [ R 1 K 12 B 2 K 21 ] + m t r [ B 2 K 22 ] . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeaabmWaaaqaaiabhsha0jabhkhaYjabcUfaBjabhoeadnaaBaaaleaacqWHXaqmaeqaaOGaeC4qam0aa0baaSqaaiabhkdaYaqaaiabgkHiTiabhgdaXaaakiabc2faDbqaaiabg2da9aqaaiabhsha0jabhkhaYjabcUfaBjabcIcaOGGabiab=z6agnaaBaaaleaacqWHXaqmaeqaaOGaeCOuai1aaSbaaSqaaiabhgdaXaqabaGccqWFMoGrdaqhaaWcbaGaeCymaedabaGaeCivaqfaaOGaey4kaSIaeqyWdiNaeCysaK0aaSbaaSqaaiabd2gaTbqabaGccqGGPaqkcqaHbpGCdaahaaWcbeqaaiabgkHiTiabigdaXaaakiabcIcaOiabhMeajnaaBaaaleaacqWGTbqBaeqaaOGaeyOeI0Iae8NPdy0aaSbaaSqaaiabhkdaYaqabaGccqWHcbGqdaWgaaWcbaGaeCOmaidabeaakiab=z6agnaaDaaaleaacqWHYaGmaeaacqWHubavaaGccqGGPaqkcqGGDbqxcqGGSaalaeaaaeaacqGH9aqpaeaacqaHbpGCdaahaaWcbeqaaiabgkHiTiabigdaXaaakiabhsha0jabhkhaYjabcUfaBjab=z6agnaaBaaaleaacqWHXaqmaeqaaOGaeCOuai1aaSbaaSqaaiabhgdaXaqabaGccqWFMoGrdaqhaaWcbaGaeCymaedabaGaeCivaqfaaOGaeyOeI0IaeqyWdi3aaWbaaSqabeaacqGHsislcqaIXaqmaaGccqWH0baDcqWHYbGCcqGGBbWwcqWFMoGrdaWgaaWcbaGaeCymaedabeaakiabhkfasnaaBaaaleaacqWHXaqmaeqaaOGae8NPdy0aa0baaSqaaiabhgdaXaqaaiabhsfaubaakiab=z6agnaaBaaaleaacqWHYaGmaeqaaOGaeCOqai0aaSbaaSqaaiabhkdaYaqabaGccqWFMoGrdaqhaaWcbaGaeCOmaidabaGaeCivaqfaaOGaeiyxa0Laey4kaSIaemyBa0MaeyOeI0IaeCiDaqNaeCOCaiNaei4waSLae8NPdy0aaSbaaSqaaiabhkdaYaqabaGccqWHcbGqdaWgaaWcbaGaeCOmaidabeaakiab=z6agnaaDaaaleaacqWHYaGmaeaacqWHubavaaGccqGGDbqxcqGGSaalaeaaaeaacqGH9aqpaeaacqaHbpGCdaahaaWcbeqaaiabgkHiTiabigdaXaaakiabhsha0jabhkhaYjabcUfaBjabhkfasnaaBaaaleaacqWHXaqmaeqaaOGaeC4saS0aaSbaaSqaaiabhgdaXiabhgdaXaqabaGccqGGDbqxcqGHsislcqaHbpGCdaahaaWcbeqaaiabgkHiTiabigdaXaaakiabhsha0jabhkhaYjabcUfaBjabhkfasnaaBaaaleaacqWHXaqmaeqaaOGaeC4saS0aaSbaaSqaaiabhgdaXiabhkdaYaqabaGccqWHcbGqdaWgaaWcbaGaeCOmaidabeaakiabhUealnaaBaaaleaacqWHYaGmcqWHXaqmaeqaaOGaeiyxa0Laey4kaSIaemyBa0MaeyOeI0IaeCiDaqNaeCOCaiNaei4waSLaeCOqai0aaSbaaSqaaiabhkdaYaqabaGccqWHlbWsdaWgaaWcbaGaeCOmaiJaeCOmaidabeaakiabc2faDjabc6caUaaaaaa@D422@

Successfully, we substitute all mapping functions in the three terms of KL divergence by kernel functions so that we can calculate KL divergence between two sample sets in the feature space.

Results and discussion

To evaluate the performance of KLOD method, we performed several experiments using a synthetic data, two gene expression data sets, and a high-resolution mass spectrometry data. To obtain unbiased results, all experiments were repeated 30 times with 10-fold cross validation (CV) and the performance was averaged. The performance of KLOD was compared with one-class SVM and Mahalanobis distance based outlier detection methods. Given n samples, the Mahalanobis distance for each multivariate sample x i is as follows:

D i = ( x i μ ) T Σ 1 ( x i μ ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiraq0aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpdaGcaaqaaiabcIcaOiabhIha4naaBaaaleaacqWGPbqAaeqaaOGaeyOeI0IaeqiVd0MaeiykaKYaaWbaaSqabeaacqWHubavaaGccqqHJoWudaahaaWcbeqaaiabgkHiTiabigdaXaaakiabcIcaOiabhIha4naaBaaaleaacqWGPbqAaeqaaOGaeyOeI0IaeqiVd0MaeiykaKcaleqaaaaa@43B0@

where Σ and μ are the sample covariance matrix and sample mean vector, respectively. Samples with a large Mahalanobis distance are regarded as outliers.

Results on synthetic data

First, using a synthetic data, we evaluated KLOD to see the ability in detecting outliers. The synthetic data consists of 100 samples, denoted as N, each of which has 100 features generated from a mixture of Gaussian N MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8xdX7eaaa@3763@ (0, I). In addition, two sample sets called quasi-outlier set Q and perfect outlier set P were produced, each of which has 10 samples with 100 features, which were generated from a mixture of Gaussian N MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8xdX7eaaa@3763@ (0, I) and N MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8xdX7eaaa@3763@ (2, I), respectively. It is noted that Q was created from the same distribution as N. Here, we corrupted Q by changing the values in some features. To do so, some features from each sample in P were randomly selected. The values of the selected features replaced those of features randomly selected from the corresponding sample in Q. Finally, we merged N and Q, which were used as a synthetic data. Figure 2 illustrates an example of generating the synthetic data. In this experiment, we tested KLOD changing the number of corrupted features from 10 to 30 increasing by 2 and the size of a set, denoted as t, that consists of close samples of each sample from 5 to 20 increasing by 5. With the synthetic data, we measured how accurate our method is in identifying outliers in a way that the number of real outliers is counted out of the first 10 samples detected by KLOD.

Figure 2
figure 2

Generation of a synthetic data. This example shows a way used in this study to generate a synthetic data.

Figure 3 shows the experimental results. When the number of noisy features increases, the accuracy shows a tendency to increase as well. It should be noted that for all set sizes, when the number of noisy features is 18, an accuracy of over 90% was obtained. Particularly, for t = 10, 15 and 20, when the number of noisy features is 30, an accuracy of 100% was achieved.

Figure 3
figure 3

Accuracy of detecting outliers on a synthetic data. The data consists of 100 normal samples and 10 outliers, each having 100 features.

Performance evaluation after outlier removal

Before introducing the outlier removal for real biomedical data, we first introduce the performance evaluation metric we will use which is PCA (principal component analysis) + LDA (linear discriminant analysis). LDA maps the data into a very low dimensionality of c -1, where c is the number of classes. In the reduced space, a simple matching procedure is used for classification. However, in order to guarantee a non-degenerate result from LDA, before the LDA task, the dimensionality of the data must be reduced to at most n - c where n is the number of samples. Principal component analysis (PCA) is often used in the analysis of high dimensional data set. PCA performs a transformation of the original space into a lower dimensional space with little or no information loss while maximally preserving variance.

Lilien et al. used the PCA+LDA method in the analysis of mass spectrometry data sets [18]. In this framework, the PCA dimensionality-reduced samples are projected by LDA onto a hyperplane in the way of maximizing the between-class variance and minimizing the within-class variance of the projected samples. To evaluate the performance after outlier removal in our experiments, we employed the PCA+LDA strategy.

Results on gene expression data sets

In this study, two public microarray data sets were used.

In experiments with the two microarray data sets, specificity, sensitivity, and accuracy were measured using PCA+LDA classification strategy after removing outliers detected by KLOD with t = 10, Mahalanobis distance based method, and one-class SVM. We define the specificity as the ratio of correctly classified negatives to the actual number of negatives. For leukemia and colon microarray data sets, negatives are ALL and normal samples, respectively. For KLOD and Mahalanobis distance based method, the performance was measured after removing a sample having the largest distance from each class at each iteration. If the prediction rate (specificity or sensitivity) decreases more than a threshold γ compared to the prediction rate before the outlier removal, we stop the outlier detection in the corresponding class. In this study, we used γ = 0.5%. In contrast, for one-class SVM, after excluding all samples regarded as outliers in each class, the performance was assessed.

Table 1 shows the experimental results obtained using leukemia and colon microarray data sets. For the leukemia data set, KLOD achieved the best accuracy with 9 outliers (2 ALL and 7 AML samples).

Table 1 Performance after outlier detection in leukemia and colon data sets.

Mahalanobis distance based method and one-class SVM found 14 and 12 outliers, respectively. For the colon data set, KLOD found 6 outliers (1 normal and 5 tumor samples) with 84.95% specificity, 94.43% sensitivity, and 91.25% accuracy. It should be noted that the performance of Mahalanobis distance based method was degraded in terms of sensitivity and accuracy compared to the performance obtained using all samples without outlier removal, suggesting that outliers detected by Mahalanobis distance based method are unlikely to be real ones.

Results on mass spectrometry data

To evaluate the effectiveness of KLOD, we also used a public mass spectrometry data for liver cancer study that consists of 201 spectra containing hepatocellular carcinoma (HCC) (78), cirrhosis (51), and health (72) [3]. From http://microarray.georgetown.edu/ressomlab/, we downloaded the binned spectra that have 23,846 peaks for each spectrum. To test outlier detection methods, only cirrhosis and HCC spectra were used as in [3]. By using t-test with the significance level of 0.05 in cirrhosis and HCC spectra, we selected 10,682 peaks. That is, the top 10,682 peaks selected by t-test with cirrhosis and HCC spectra were used in outlier detection methods. The same way as performed with the microarray data sets was employed. Here cirrhosis samples are regarded as negatives. As shown in Table 2, KLOD obtained slightly higher performance with the smallest number of outliers than Mahalanobis distance based method and one-class SVM. From the results in experiments using mass spectrometry and microarray data sets, it seems that one-class SVM detects more outliers than KLOD and Mahalanobis distance based method.

Table 2 Performance after outlier detection in liver cancer mass spectrometry data.

Conclusion

We proposed a new outlier detection method based on KL divergence called KLOD. Our idea was derived from Markov blanket algorithm where redundant and irrelevant features are removed based on KL divergence. We tackled the outlier detection problem in a higher feature space after mapping the original data. The mapping leads to several issues. In particular, we showed how to calculate KL divergence in the higher feature space by using the properties of determinant and trace of matrix. To asses the usefulness of KLOD, we used a synthetic data and real life data sets. Compared to Mahalanobis distance based method and one-class SVM, KLOD achieved higher or comparable performance.

References

  1. Lee W, Stolfo S, Mok K: Mining audit data to build intrusion detection models. Proc Int Conf Knowledge Discovery and Data Mining (KDD 1998). 1998, 66-72.

    Google Scholar 

  2. Fawcett T, Provost F: Adaptive fraud detection. Data Mining and Knowledge Discovery. 1997, 1: 291-316.

    Article  Google Scholar 

  3. Ressom H, Varghese R, Drake S, Hortin G, Abdel-Hamid M: Peak selection from MALDI-TOF mass spectra using ant colony optimization. Bioinformatics. 2007, 23: 619-626.

    Article  CAS  PubMed  Google Scholar 

  4. Kadota K, Tominaga D, Akiyama Y, Takahashi K: Detecting outlying samples in microarray data: A critical assessment of the effect of outliers on sample classification. Chem-Bio Informatics Journal. 2003, 3: 30-45.

    Article  CAS  Google Scholar 

  5. Knorr E, Ng R: Algorithms for mining distance-based outliers in large datasets. Proc Int Conf Very Large Databases (VLDB 1998). 1998, 392-403.

    Google Scholar 

  6. Knorr E, Ng R, Tucakov V: Distance-based outlier: algorithms and applications. Proc Int Conf Very Large Databases (VLDB 2000). 2000, 237-253.

    Google Scholar 

  7. Angiulli F, Basta S, Pizzuti C: Distance-based detection and prediction of outliers. IEEE Trans on Knowledge and Data Engineering. 2006, 18: 145-160.

    Article  Google Scholar 

  8. Wang JS, Chiang JC: A cluster validity measure with outlier detection for support vector clustering. IEEE Trans on Systems, Man, and Cybernetics, Part B. 2008, 38: 78-89.

    Article  CAS  Google Scholar 

  9. Schölkopf B, Platt J, Shawe-Taylor J, Smola A, Williamson R: Estimating the support of a high-dimensional distribution. Neural Computation. 2001, 13: 1443-1471.

    Article  PubMed  Google Scholar 

  10. Manevitz L, Yousef M: One-class SVMs for document classification. Journal of Machine Learning Research. 2001, 2: 139-154.

    Google Scholar 

  11. Bandyopadhyay S, Santra S: A genetic approach for efficient outlier detection in projected space. Pattern Recognition. 2008, 41: 1338-1349.

    Article  Google Scholar 

  12. Aggarwal C, Yu P: Outlier detection for high dimensional data. Proc ACM SIGMOD. 2001, 37-46.

    Google Scholar 

  13. Malossini A, Blanzieri E, Ng R: Detecting potential labeling errors in microarrays by data perturbation. Bioinformatics. 2006, 22: 2114-2121.

    Article  CAS  PubMed  Google Scholar 

  14. Oh J, Gao J, Rosenblatt K: Biological data outlier detection based on Kullback-Leibler divergence. Proc IEEE Int Conf on Bioinformatics and Biomedicine (BIBM 2008). 2008, 249-254.

    Chapter  Google Scholar 

  15. Koller D, Sahami M: Toward optimal feature selection. Proc Int Conf on Machine Learnin. 1996

    Google Scholar 

  16. Tumminello M, Lillo F, Mantegna R: Kullback-Leibler distance as a measure of the information filtered from multivariate data. Physical Review E. 2007, 76: 256-67.

    Article  Google Scholar 

  17. Zhou S, Chellappa R: From sample similarity to ensemble similarity: probabilistic distance measures in reproducing kernel Hilbert space. IEEE Trans on Pattern Analysis and Machine Intelligence. 2006, 28: 917-929.

    Article  Google Scholar 

  18. Lilien R, Farid H, Donald B: Probabilistic disease classification of expression-dependent proteomic data from mass spectrometry of human serum. Journal of Computational Biology. 2003, 10: 925-946.

    Article  CAS  PubMed  Google Scholar 

  19. Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286: 531-537.

    Article  CAS  PubMed  Google Scholar 

  20. Alon U, Barkai N, Notterman D, Gish K, Ybarra S: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci U S A. 1999, 96: 6745-6750.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

This work was supported in part by NSF under grants IIS-0612152 and IIS-0612214.

This article has been published as part of BMC Bioinformatics Volume 10 Supplement 4, 2009: Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2008. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S4.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jean Gao.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JHO performed data analysis and wrote the manuscript. JG supervised the project and edited the paper.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Oh, J.H., Gao, J. A kernel-based approach for detecting outliers of high-dimensional biological data. BMC Bioinformatics 10 (Suppl 4), S7 (2009). https://doi.org/10.1186/1471-2105-10-S4-S7

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-10-S4-S7

Keywords