Abstract
Background
Support Vector Machines (SVMs) provide a powerful method for classification (supervised learning). Use of SVMs for clustering (unsupervised learning) is now being considered in a number of different ways.
Results
An SVMbased clustering algorithm is introduced that clusters data with no a priori knowledge of input classes. The algorithm initializes by first running a binary SVM classifier against a data set with each vector in the set randomly labelled, this is repeated until an initial convergence occurs. Once this initialization step is complete, the SVM confidence parameters for classification on each of the training instances can be accessed. The lowest confidence data (e.g., the worst of the mislabelled data) then has its' labels switched to the other class label. The SVM is then rerun on the data set (with partly relabelled data) and is guaranteed to converge in this situation since it converged previously, and now it has fewer data points to carry with mislabelling penalties. This approach appears to limit exposure to the local minima traps that can occur with other approaches. Thus, the algorithm then improves on its weakly convergent result by SVM retraining after each relabeling on the worst of the misclassified vectors – i.e., those feature vectors with confidence factor values beyond some threshold. The repetition of the above process improves the accuracy, here a measure of separability, until there are no misclassifications. Variations on this type of clustering approach are shown.
Conclusion
Nonparametric SVMbased clustering methods may allow for much improved performance over parametric approaches, particularly if they can be designed to inherit the strengths of their supervised SVM counterparts.
Introduction
Support Vector Machine
Support Vector Machines (SVMs) provide a very efficient mechanism to construct a separating hyperplane (see Fig. 1), surrounded by the thickest margin, using a set of training data. Prediction is made according to some measure of the distance between the test data and the hyperplane. Cover's theorem states that complex patternclassification problems can be transformed into a new feature space where the patterns are more likely to be linearly separable, provided that the transformation itself is nonlinear and that the feature space is in a high enough dimension [1]. The SVM method achieves such a transformation by choosing a qualified kernel.
Figure 1. Support Vector Machine uses Structural Risk Minimization to compare various separation models and to eventually choose the model with the largest margin of separation.
The SVM is briefly reviewed here using the notation of [2], more discussion is available there on implementation issues. Feature vectors are denoted by x_{ik}, where index i labels the M feature vectors (1 ≤ i ≤ M) and index k labels the N feature vector components (1 ≤ i ≤ N). For the binary SVM, labeling of training data is done using label variable y_{i }= ± 1 (with sign according to whether the training instance was from the positive or negative class). For hyperplane separability, elements of the training set must satisfy the following conditions: w_{β}x_{iβ}b ≥ +1 for i such that y_{i }= +1, and w_{β}x_{iβ}b ≤ 1 for y_{i }= 1, for some values of the coefficients w_{1},..., w_{N}, and b (using the convention of implied sum on repeated Greek indices). This can be written more concisely as: y_{i}(w_{β}x_{iβ}b) 1 ≥ 0. Data points that satisfy the equality in the above are known as "support vectors" (or "active constraints").
Once training is complete, discrimination is based solely on position relative to the discriminating hyperplane: w_{β}x_{iβ } b = 0. The boundary hyperplanes on the two classes of data are separated by a distance 2/w, known as the "margin," where w^{2 }= w_{β}w_{β}. By increasing the margin between the separated data as much as possible the SVM's optimal separating hyperplane is obtained. In the usual SVM formulation, the goal to maximize w^{1 }is restated as the goal to minimize w^{2}. The Lagrangian variational formulation then selects an optimum defined at a saddle point of L(w, b; α) = (w_{β}w_{β})/2  α_{γ }y_{γ }(w_{β}x_{γβ}b)  α, where α = ∑_{γ}α_{γ}, α_{γ }≥ 0 (1 ≤ γ ≤ M). The saddle point is obtained by minimizing with respect to {w_{1},..., w_{N}, b} and maximizing with respect to {α_{1},..., α_{M}}. Further details are left to [2]. A Wolfe transformation is performed on the Lagrangian [2], and it is found that the training data (support vectors in particular, KKT class (ii) above) enter into the Lagrangian solely via the inner product x_{iβ}x_{jβ}. Likewise, the discriminator f_{i}, and KKT relations, are also dependent on the data solely via the x_{iβ}x_{jβ }inner product. Generalization of the SVM formulation to datadependent inner products other than x_{iβ}x_{jβ }are possible and are usually formulated in terms of the family of symmetric positive definite functions (reproducing kernels) satisfying Mercer's conditions [3,4].
SVMinternal clustering
Clustering, the problem of grouping objects based on their known similarities is studied in various publications [2,5,7]. SVMInternal Clustering [2,7] (our terminology, usually referred to as a oneclass SVM) uses internal aspects of Support Vector Machine formulation to find the smallest enclosing sphere. Let {x_{i}} be a data set of N points in R^{d }(Input Space.) Similar to the nonlinear SVM formulation, using a nonlinear transformation φ, we transform x to a highdimensional space – Kernel space – and look for the smallest enclosing sphere of radius R. Hence we have:
where a is the center of the sphere. Soft constraints are incorporated by adding slack variables ζ_{j}:
We formulate the Lagrangian as:
where C is the cost for outliers and therefore C∑_{j}ζ_{j }is the penalty term. Taking the derivative of L w.r.t. R, a and ζ and setting them to zero we have:
Substituting the above equations back into the Lagrangian, we have the following dual formalism:
By KKT relations we have:
In the feature space, β_{j }= C only if ζ_{j }> 0; hence it lies outside of the sphere i.e. R^{2 }< φ(x_{j})  a ^{2}. This point becomes a bounded support vector or BSV. Similarly if ζ_{j }= 0, and 0 <β_{j }< C, then it lies on the surface of the sphere i.e. R^{2 }= φ(x_{j})  a ^{2}. This point becomes a support vector or SV. If ζ_{j }= 0, and β_{j }= 0, then R^{2 }> φ(x_{j})  a ^{2 }and hence this point is enclosed within the sphere.
SVMexternal clustering
Although the internal approach to SVM clustering is only weakly biased towards the shape of the clusters in feature space (the bias is for spherical clusters in the kernel space), it still lacks robustness. In the case of most realworld problems and strongly overlapping clusters, the SVMInternal Clustering algorithm above can only delineate the relatively small cluster cores. Additionally, the implementation of the formulation is tightly coupled with the initial choice of kernel; hence the static nature of the formulation and implementation does not accommodate numerous kernel tests. To remedy this excessive geometric constraint, an externalSVM clustering algorithm is introduced in [2] that clusters data vectors with no a priori knowledge of each vector's class.
The algorithm works by first running a Binary SVM against a data set, with each vector in the set randomly labeled, until the SVM converges. In order to obtain convergence, an acceptable number of KKT violators must be found. This is done through running the SVM on the randomly labeled data with different numbers of allowed violators until the number of violators allowed is near the lower bound of violators needed for the SVM to converge on the particular data set. Choice of an appropriate kernel and an acceptable sigma value also will affect convergence. After the initial convergence is achieved, the (sensitivity + specificity) will be low, likely near 1. The algorithm now improves this result by iteratively relabeling only the worst misclassified vectors, which have confidence factor values beyond some threshold, followed by rerunning the SVM on the newly relabeled data set. This continues until no more progress can be made. Progress is determined by an increasing value of (sensitivity+specificity), hopefully nearly reaching 2. This method provides a way to cluster data sets without prior knowledge of the data's clustering characteristics, or the number of clusters. In practice, the initialization step, that arrives at the first SVM convergence, typically takes longer than all subsequent partial relabeling and SVM rerunning steps.
The SVMExternal clustering approach is not biased towards the shape of the clusters, and unlike the internal approach the formulation is not fixed to a single kernel class. Nevertheless, there are robustness and consistency issues that arise in SVMExternal clustering approache. To rectify this issue, an external approach to SVM clustering is prescribed herein, that takes into account the robustness required in realistic applications.
Supervised cluster evaluation
Externally derived class labels require external categorization that assume "correct" labels for each category. Unlike classification, clustering algorithms do not have access to the same level of fundamental truth. Thus, the performance of unsupervised algorithms, such as clustering, can't be measured with the same certitude as for the classification problems. In this paper the result of the clustering is measured using the externally derived class labels for the patterns. Subsequently, we can use some of the classificationoriented measures to evaluate our results. These measures evaluate the extent to which a cluster contains patterns of a single class (see [4]).
The measures of prediction accuracy used here, Sensitivity, SN, and Specificity, SP, etc. are defined as follows:
The two measures used in this paper are Entropy and Purity, and are based on the pair {SP, nSP}, see Discussion for further details.
Cluster entropy and purity
Let p_{ij }be the probability that an object in cluster i belongs to class j. Then entropy for the cluster i, e_{i}, can be written as:
where J is the number of classes. Similarly, the purity for the cluster i, p_{i}, can be expressed as,
Note that the probability that an object in cluster i belongs to class j can be written as the number of objects of class j in cluster i, n_{ij}, divided by the total number of objects in cluster i, n_{i}, (i.e., p_{ij }= n_{ij}/n_{i}) Using this notation the overall validity of a cluster i using the measure f_{i }(for either entropy or purity) is the weighted sum of that measure over all clusters. Hence,
where, K is number of clusters and N is the total number of patterns. For our 2class clustering problem:
and:
Although similar, entropy is a more comprehensive measure than purity. Rather than considering either the frequency of patterns that are within a class or the frequency of patterns that are outside of a class, entropy, takes into account the entire distribution.
Unsupervised cluster evaluation
Unsupervised evaluation techniques do not depend on external class information. These measures are often optimization functions in many clustering algorithms. SumofSquaredError (SSE) measures the compactness of a single cluster and other measures evaluate the isolation of a cluster from other clusters.
SumofSquaredError. SSE, in input space, can be written as:
where for any similarity function s(x, x')
Due the arising mathematical complexity, it is often convenient to use Euclidean distance as the measure of similarity. Hence,
Let φ: X → F and k(x, y) = {φ (x), φ (y)}, then J_{e }can be rewritten (this time feature space) as:
for
Note that SSE, like any other unsupervised criterion, may not reveal the true underlying clusters, since the Euclidean distance simplification favors spherically shaped clusters. However, this geometry is imposed after the data was mapped to the feature space.
Results
Relabeler
The geometry of the hyperplane depends on the kernel and the kernel parameters (see Figure 2). In Fig. 3, the decision hyperplane is linear in the feature space, while in Fig. 4 the decision hyperplane is circular in the feature space. The clustering kernel used in Fig. 3 was the linear kernel. The clustering kernel used in Fig. 4 was the polynomial kernel (the linear kernel failed in this case).
Figure 2. The choice of kernel along with a genuine set of kernel parameters is important as the above table summarizes some of the most popular kernels used for classification and clustering.
Figure 3. The results of SVMRelabeler algorithm using the linear kernel.
Figure 4. The results of SVMRelabeler algorithm using a third degree polynomial kernel.
In the following experiments clustering is performed on a data set consisting of 8GC and 9GC DNA hairpin data (part of data sets used in [2]). The set consists of 400 elements. Half of the elements belong to each class. Although convergence is always achieved, convergence to a global optimum is not guaranteed. Figures 5a and 5b demonstrate the boost in Purity and Entropy (with the RBF kernel) as a function of Number of Iterations, while Fig 5c demonstrates this boost as decrement in SSE as a function of Number of Iterations. Note that the stopping criteria for the algorithm is based on the unsupervised measure of SSE. Comparison to fuzzy cmeans and kernel kmeans is shown on the same dataset (the solid blue and black lines in Fig. 5a and 5b). The greatly improved performance of the SVMExternal approach over conventional clustering methods, consistent with results found in prior work [2], strongly motivates further developments along these lines. This completes the Relabeler as an unsupervised method for clustering, further specifics on the Method are left to that section.
Figure 5. (a) and (b) show the boost in Purity and Entropy as a function of Number of Iterations after the completion the completion of the Relabeler algorithm. (c) demonstrates SSE as an unsupervised evaluation mechanism that mimics purity and entropy as the measure of true clusters. The blue and black lines are the result of running fuzzy cmean and kernel kmean on the same dataset.
Perturbation, random relabeling and hybrid methods
It is found that the result of the Relabeler algorithm can be significantly improved by randomly perturbing a weak clustering solution and repeating the SVMexternal labelswapping iterations as depicted in Fig. 6. To explore this further, a hybrid SVMexternal approach to the above problem is introduced to replace the initial random labeling step with kmeans clustering or some other fast clustering algorithm. The initial SVMexternal clustering must then be slightly and randomly perturbed to properly initialize the relabeling step; otherwise the SVM clustering tends to return to the original kmeans clustering solution. A complication is the unknown amount of perturbation of the kmeans solution that is needed to initialize the SVMclustering well – it is generally found that a weak clustering method does best for the initialization (or one weakened by a sufficient amount of perturbation). The results are shown in Figure 7. Note how the hybrid method can improve over kmeans and SVMclustering alone.
Figure 6. The result of Relabeler Algorithm with Perturbation. The top plots demonstrate the various Purity and Entropy scores for each perturbed run. The spikes are drops followed by recovery in the validity of the clusters as a result of random perturbation. The bottom plot is a similar demonstration, by tracking the unsupervised quality of the clusters. Note that after 4 runs of perturbation best solution is recovered.
Figure 7. (a) and (b) represent the SSE and Purity evaluation of hybrid Relabeler with Perturbation on the same dataset. Data is initially clustered using kmeans to initialize the Relabeler algorithm. The first segment of the plot (right before the spike at 16) is the result of Relabeler after 10% perturbation, while the second segment is the result after 30% perturbation. Purity Number of Iterations.
SVdropper
As explained in the Discussion section, identifying the hardtocluster data leads to the best overall performance. These weakly clustering data points have the greatest chance of being identified as noise, and therefore dropped out of the dataset. Results from running SVdropper on the data are shown in Figure 8. The methods in the above results can be used together: the Hybrid Relabeler can be run to get a solution, then backed off, using the SV dropper method, to establish highly accurate cluster "cores". These cores can then be used to seed the SVMABC algorithm, an area of ongoing work that is described in the Discussion section.
Figure 8. Purity vs Percentage Dropped after 55 iterations. At about 78% drop the data set becomes too small and the statistics begin to have much greater variance.
Discussion
Comparison to conventional clustering approaches
The proposed SVMExternal approach to clustering appears to inherit the strengths of SVM classification – an amazing prospect. It is a nonparametric approach due to its manipulation of label assignment at the individual data instance level – thus there is not a clearly stated objective function (this is a strength). Solutions are global since they correspond to the final, global, solution of the SVM classification process. SVMExternal solutions scale with the size of the dataset according to how an SVMclassifier would scale, multiplied by the number of iterations of the SVM (approximately 10, as a rough rule). Thus, clustering scales quadratically with the size of the data, but can be processed in chunks by the SVM according the SVMs nice distributable training properties. For clustering on multiple classes, the process would proceed along the lines of a multiclass discrimination solution with a binary SVM classifier – by iterative binary decomposition until subclusters can't be split (the stopping criterion for the decomposition). Comparison to established clustering methods are shown in Fig. 5 (see Results) and are more extensively explored in [4], where the overall strengths of the SVMExternal clustering approach are clearly evident.
SumofSquares as a cluster evaluation
The fitness of the cluster identified by the SVMexternal algorithm is modeled using the cohesion of the clusters. In other words, the fitness of a cluster can be tracked using the compactness of that cluster as the algorithm progresses. It is necessary to note that the notion of compactness as a way to evaluate clusters favours spherical clusters over clusters spread over a linear region. However, this limitation does not normally affect our methodology, since this limitation is imposed over the kernel space, and not the input feature space.
Stopping criteria and convergence for Relabeler algorithm
As depicted in Algorithm 1 below, Relabeler consists of two main parts: doSVM() and doRelabel() functions. Since doRelabel() [Alg. 2] consists of sequential relabeling of a finite set of labels with a finite alphabet ({1,+1}) the convergence of this function is always guaranteed. However, the convergence of doSVM() depends on the underlying SVM algorithm used. The particular SVM implementation used in Relabeler Algorithm is Sequential Minimal Optimization, introduced by John C. Platt [8].
The basic stopping condition for the Relabeler Algorithm is when there are no relabelings left to be performed. In terms of SSE, the unsupervised clustering measure, the algorithm halts when SSE remains unchanged. However, a decrease in SSE does not necessarily mean significant improvement to the quality of the clustering. Therefore, hypothesis testing and thresholding may prove useful depending on the application and the nature and geometry of the clustering data.
SVMABC
New subtleties of classificationseparation are possible with support vector machines via their direct handling and direct identification of data instances. Individual data points, in some instances, can be associated with "support vectors" at the boundaries between regions. By operating on labels of support vectors and focusing on training on certain subsets, the SVMABC algorithm offers the prospect to delineate highly complex geometries and graphconnectivity:
Split the clustering data into sets A, B and C
▪ A: Strong negatives
▪ C: Strong positives
▪ B: Weak negatives and weak positives
▪ Train an SVM on Data from A (labeled negative) and B (labeled positive)
▪ The support vectors: SV_AB
▪ Similarly train a new SVM on Data from C (labeled positive) and B (labeled negative)
▪ The support vectors: SV_CB
▪ Our objective is that the SV_AB and SV_CB sets have their labels flipped to be set A and C
▪ Regrow set A and C into the weak 'B' region.
▪ If an element of SV_AB is also in SV_BC, then the intersection of these sets are the elements that should be flipped to class B (if not already listed as class B).
▪ Stop at the first occurrence of any of these events
• Set B becomes empty
• Set B does not change
Scoring binary classification conventions
In this paper we adopt the following conventions:
Bioinformatics researchers, in gene prediction for example [5], take as primary pair: {SN, SP}; ROC people, or people using a Confusion Matrix diagram, take as primary pair {SN, nSN}; Purity/Entropy researchers use {SP, nSP}; no one uses the pair {nSN, nSP} since it is a trivial label flip from being {SN, SP}. Label flipping leaves the sensitivity {SN, nSN} pair the same, similarly for the specificity pair {SP, nSP}. In other work we use the conventions that are commonly employed in gene prediction, and other instances where the signal identification is biased towards identification of positives. Specifically, they have two sets they focus on, the actual positives (genes) and the predicted positives (gene predictions). For gene prediction there is either a gene, or there is nogene (i.e., junk, or background noise). In situations where your objective is to make the sets of actual positives (AP) and predicted positives (PP) maximally overlap, SP = specificity and SN = sensitivity are natural parameters and can be nicely described with a prediction accuracy Venn diagram (similar to a confusion matrix). For the problem here, we are adopting the purity/entropy measures to be inline with other efforts in the clustering research community, so work with the pair {SP, nSP}.
Rayleigh's criterion
Rayleigh's criterion, also known as the Rayleigh Limit, is used for the resolution of two light sources. In the case of two laser beam sources falling upon a single slit, the resolution limit is defined by the singleslit interference pattern where one source's maximum falls on the first minimum of the diffraction pattern of the second source. This definition is used for resolving distant stars as singletons or identification of binary star systems, etc. In the case of laser optics, the resolution of two sources can be pushed beyond the Rayleigh limit due to tracking the statistics of the individual photons that arrive. The relevance of all of this is that resolving two sources is equivalent to saying that a binary clustering solution exists, i.e., that the data is separable to some degree. If the clustering algorithm tracks the data instances individually, as it does with our SVMexternal approach, we have a scenario analogous to the resolution in laser optics beyond the Rayleigh limit.
Conclusion
A novel SVMbased clustering algorithm is explored that clusters data with no a priori knowledge of input classes. The resolution limit (i.e., clustering limit) of light sources, the classical Rayleigh Limit (see Discussion above), was eventually circumvented in laseroptics by probing the instancebased quantumstatistical description of quanta of light. With an SVM we have a nonparametric, instancebased, tracking as well. We hope to build upon the novel and powerful SVM clustering approach described here to eventually show clustering resolution beyond the "Parametric Limit", a limit that is otherwise imposed by the typical, parameterized, clustering methods, where the data isn't directly tracked but contributes to a parameterized model. Along these lines, it is hoped that further development of the SVM_ABC Algorithm described in the Discussion can offer recovery of subtle graphlike connectedness between cluster elements, a weakness of manifoldlike separability approaches such as parametricbased clustering methods.
Methods
SVMrelabeler
The SVM classification formulation is used as the foundation for clustering a set of feature vectors with no a priori knowledge of the feature vector's classification. The nonseparable SVM solution guarantees convergence at the cost of allowing misclassification. The extent of slack is controlled through the regularization constant, C, to penalize the slack variable, ξ. If the random mapping ((x_{1}, y_{1}),..., (x_{m}, y_{m})) ∈ X^{m }× y is not linearly separable when ran through a binary SVM, the misclassified features are more likely to belong to the other cluster. Moreover, by relabeling those heavily misclassified features and by repeating this process we arrive at an optimal separation between the two clusters. The basics of this procedure is presented in Algorithm 1, where is the new cluster assignment for x and θ contains ω, α, y'.
Algorithm 1: SVMRelabeler
Require: m, x
1. ← Randomly chosen from {1,+1}
2. repeat
The doSVM() procedure can be any standard and complete implementation of an SVM classifier with support for nonlinear discriminator function. The idea is that doSVM() has to converge regardless of the geometry of the data, in order to provide the doRelabel() procedure with the hyperplane and other standard SVM outputs. After this procedure, doRelabel() reassigns some (or all) of the misclassified features to the other cluster. If D(x_{i}, θ) is the distance between x_{i }feature and the trained SVM hyperplane, then heavily misclassified feature, x_{j ∈ J }could be selected by comparing D(x_{j}, θ) to D(x_{j'}, θ) for all j' ∈ J. Algorithm 2 clarifies the basic implementation of this procedure.
Algorithm 2: doRelabel() Procedure
Require: Input vector: x
SVM model: θ
Confidence Factor: αIdentify misclassified features:
x'^{+ }← K misclassified features with = +1
x'^{ }← L misclassified features with = 1
1. for all i^{th }component of x'^{+ }do
2. if i/K ∑^{K}_{j = 1 }D(x'^{+}_{j}, θ) <αD(x'^{+}_{j}, θ) then
4. end if
5. end for
6. for all ith component of x'^{ }do
7. if 1/L ∑^{L}_{j = 1 }D(x'^{}_{j}, θ) <αD(x'^{}_{j}, θ) then
9. end if
10. end for
SVdropper
In most applications of clustering, the dataset is composed of leverage and influential points. Leverage points are subsets of the dataset that are highly deviated from the rest of the cluster, and removing them does not significantly change the result of the clustering. In contrast, influential points are those in the highly deviated subset whose inclusion or removal significantly changes the decision of the clustering algorithm. Effective, identification of these special points is of interest to improve accuracy and correctness of the clustering algorithm. A systematic way to manage these deviants is given by the SVDropper algorithm.
As depicted in Algorithm 3, SVM is initially trained on the clustered data; the weakest of the cluster data – those closest to the hyperplane, i.e., the support vectors – are dropped thereafter. This processed is repeated until the desired ratio of accuracy and number of data dropped is achieved.
Algorithm 3: SVDropper Algorithm
Require: Input vector: x
1. let:
x^{+ }← K features with y = +1
2. repeat
3. θ ← doSVM(x, y)
4. for all features, x_{j},
5. drop feature, x_{j}, if D(x_{j}, θ) < 1 end for
6. until desired ratio of SSE and number of data dropped
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
The initial submission was written by SM and SWH, with revisions by SWH. The core SVM pattern recognition software and SVMclustering method were developed by SWH. SM developed a separate SVM classifier for test validation and performed the SVMclustering dataruns.
Acknowledgements
Federal funding was provided by NIH K22 (PI, 5K22LM008794), NIH NNBM R21 (coPI), and NIH R01 (subaward). State funding was provided from a LaBOR Enhancement (PI), a LaBOR Research Competitiveness Subcontract (PI), and a LaBOR/NASA LaSPACE Grant (PI). Funding also provided by New Orleans Children's Hospital and the University of New Orleans Computer Science Department.
This article has been published as part of BMC Bioinformatics Volume 8 Supplement 7, 2007: Proceedings of the Fourth Annual MCBIOS Conference. Computational Frontiers in Biomedicine. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/8?issue=S7.
References

Ratsch G, Tsuda K, Muller K, Mika S, Schölkopf B: An introduction to kernelbased learning algorithms.
IEEE Trans Neural Netw 2001, 12(2):181201. Publisher Full Text

WintersHilt S, Yelundur A, McChesney C, Landry M: Support Vector Machine Implementations for Classification & Clustering.
BMC Bioinformatics 2006, 7(Suppl 2):S4. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Vapnik VN: The Nature of Statistical Learning Theory. 2nd edition. SpringerVerlag, New York; 1998.

Burges CJC: A tutorial on support vector machines for pattern recognition.
Data Min Knowl Discov 1998, 2:12167. Publisher Full Text

Kumar V, Tan P, Steinbach M: Introduction to data mining. Pearson Education, Inc; 2006.

Burset M, Guigo R: Evaluation of gene structure prediction programs.
Genomics 1996, 34(3):353367. PubMed Abstract  Publisher Full Text

BenHur A, Horn D, Siegelmann H, Vapnik V: Suppor Vector Clustering.
Journal of Machine Learning Research 2001, 2:125137. Publisher Full Text

Platt J: Sequential Minimal Optimization: A Fast Algorith for Training Support Vector Machines.