Institute for Systems Biology, 401 Terry Ave N, Seattle, WA, 98109, USA

Center for Biophysics and Computational Biology, University of Illinois, Urbana, IL, 61801, USA

Abstract

Background

Relative expression algorithms such as the top-scoring pair (TSP) and the top-scoring triplet (TST) have several strengths that distinguish them from other classification methods, including resistance to overfitting, invariance to most data normalization methods, and biological interpretability. The top-scoring ‘N’ (TSN) algorithm is a generalized form of other relative expression algorithms which uses generic permutations and a dynamic classifier size to control both the permutation and combination space available for classification.

Results

TSN was tested on nine cancer datasets, showing statistically significant differences in classification accuracy between different classifier sizes (choices of

Conclusions

TSN preserves the strengths of other relative expression algorithms while allowing a much larger permutation and combination space to be explored, potentially improving classification accuracies when fewer numbers of measured features are available.

Background

Relative expression algorithms such as the top-scoring pair (TSP)

In this paper we present a new formulation of the relative expression classification algorithm that generalizes the idea of pairwise rank comparisons (TSP) and triplet rank comparisons (TST) into generic permutation rank comparisons, where the size of the classifier is not defined

The classification accuracy of the existing relative expression algorithms has been demonstrated in several studies. Classifiers identified using relative expression algorithms have been used to distinguish multiple cancer types from normal tissue based on expression data

We first demonstrate that both TSP and TST are special cases of the TSN algorithm. We illustrate the performance of a range of TSN classifier sizes on a set of nine cancer datasets. Finally, we demonstrate that TSN performs competitively when compared to a broad range of classification models, including artificial neural networks, classification trees, and support vector machines, using data and results from the FDA-sponsored Microarray Quality Control II project (MAQC-II)

Methods

Overview of relative expression algorithms TSP and TST

Given two classes of samples _{1}_{2}}, for which ranked expression data are available on _{1},…,_{M}}, the TSP algorithm
_{i}_{j}} that maximizes the TSP score Δ_{i,j}, defined as:

The TSP algorithm identifies the best pair of features for which the rank of _{i} falls lower than the rank of _{j} in most or all samples in class _{1}, and the rank of _{i} falls lower than the rank of _{j} in few or no samples of class _{2}. The max (Δ_{i,j} = 1) indicates a perfect classifier on the training set in which no samples deviate from this pattern. Classification is performed by comparing the ordering of features {_{i}_{j}} in each sample of the test set to the orderings associated with the two classes. A variant on this algorithm known as k-TSP makes use of multiple disjoint pairs to improve classification accuracy

The top-scoring triplet (TST) algorithm
_{1},…,π_{6} of each feature triplet {_{i}_{j}_{k,}} are now considered explicitly, where:

These permutation counts are accumulated for each sample of the training set, and the TST score Δ_{i,j,k} to be maximized is then calculated as follows:

The top-scoring N algorithm

The top-scoring ‘N’ algorithm, as the name implies, extends these relative expression algorithms to a generic permutation size. Within the context of feature permutations, TSP and TST can be thought of as special cases of the TSN algorithm, where a fixed

**Figure S1. Counting systems. **
**Figure S2: Three complete conversions from permutation to decimal. Figure S3: Pseudocode for the core operation of the TSN algorithm on the GPU. Figure S4: Cancer dataset statistical tests for differences between values of **
**
N
**

Click here for file

The Lehmer code

**The Lehmer code.** A complete translation from permutation to decimal, by way of the factoradic, for a permutation of size 4. Each permutation is mapped to a single unique decimal representation. Two additional translations from permutation to factoradic are shown in Additional file

The TSN algorithm works as follows: given two classes of samples _{1}, _{2}} with rank values for _{1},…,_{M}}, and a classifier size _{i},_{j},…_{N}} that maximizes the sum of the difference of the permutation probability distribution between the two classes:

where _{m} is the _{X} = 1.

In addition to the primary TSN score, a secondary score γ is calculated in the event of ties between two classifiers. This is simply the distance in rank between the first and last element of the classifier

where

In the case where _{2} = {_{i}_{j}}, and _{1}) = _{i} < _{j}). In the case where _{3} = {_{i}_{j}_{k}} and _{m}) = _{m}). Because the TSN algorithm uses factoradics to uniquely represent any permutation of any size classifier, it allows TSP and TSP classifiers to be used in concert as well as allowing for even larger classifiers to be explored.

The choice of

Classification with TSN

Once the highest scoring classifier

Inversions

**Inversions.** (**Top**) There are four inversions required to translate the sorted list [1 2 3 4] into the permutation [3 4 1 2]. The sum of the digits of the factoradic give the number of inversions required to translate one permutation into another. (**Bottom**) The grey squares indicate the set of permutations that have a single inversion distance from the original (black) permutations.

Implementation of TSN

While the TSN algorithm can theoretically explore a very large permutation space, the computational requirements of the algorithm rise very quickly and to avoid overfitting the number of permutations explored must be scaled to what is reasonable given available sample numbers. The complexity of TSN is

The GPU is a specialized hardware device normally used in graphics rendering. The nature of graphics rendering involves large numbers of vector and matrix operations performed in real-time, thus the GPU architecture emphasizes massive parallelism. Driven by the billion-dollar gaming industry, the GPU has developed into a powerful tool currently able to reach over 1 TFLOP (trillion floating point operations per second) on a single chip in single precision operations. With NVIDIA’s release of the Compute Unified Device Architecture (CUDA) in 2007, general-purpose computation on the GPU became accessible. GPUs are increasingly being applied to computationally intensive scientific problems, including molecular dynamics simulations

GPU

**GPU ****CPU running times.** Running times for

Results and discussion

Multiple values of N

TSN has been tested on nine cancer datasets that were used in the previous k-TSP and TST papers

Shown in Figure

**Raw data for statistical significance testing of cancer results referenced in Figure **
**.**

Click here for file

Results of TSN classification on cancer datasets

**Results of TSN classification on cancer datasets.** Results of 100 rounds of 5-fold cross validation over a range of

It is clear from Figure

Microarray quality control II datasets

Published in 2010, the Microarray Quality Control II dataset (MAQC-II)

**Dataset**

**Endpoint**

**Description**

**Platform**

**Hamner**

A

Lung tumorigen

Affymetrix Mouse 430 2.0

**Iconix**

B

Non-genotoxic liver carcinogens

Amersham Uniset Rat 1 Bioarray

**NIEHS**

C

Liver toxicants

Affymetrix Rat 230 2.0

**Breast Cancer**

D

Pre-operative treatment response

Affymetrix Human U133A

E

Estrogen receptor status

**Multiple Myeloma**

F

Overall survival milestone outcome

Affymetrix Human U133 Plus 2.0

G

Event-free survival milestone outcome

H

Gender of patient (positive control)

I

Random class labels (negative control)

**Code**

**Name**

**Classification algorithm(s) used**

**CAS**

Chinese Academy of Sciences

Naïve Bayes, Support Vector Machine

**CBC**

CapitalBio Corporation, China

k-Nearest Neighbor, Support Vector Machine

**Cornell**

Weill Medical College of Cornell University

Support Vector Machine

**FBK**

Fondazione Bruno Kessler, Italy

Discriminant Analysis, Support Vector Machine

**GeneGo**

GeneGo, Inc.

Discriminant Analysis, Random Forest

**GHI**

Golden Helix, Inc.

Classification Tree

**GSK**

GlaxoSmithKline

Naïve Bayes

**NCTR**

National Center for Toxicological Research, FDA

k-Nearest Neighbor, Naïve Bayes, Support Vector Machine

**NWU**

Northwestern University

k-Nearest Neighbor, Classification Tree, Support Vector Machine

**SAI**

Systems Analytics, Inc.

Discriminant Analysis, k-Nearest Neighbor, Machine Learning, Support Vector Machine, Logistic Regression

**SAS**

SAS Institute, Inc.

Classification Tree, Discriminant Analysis, Logistic Regression, Partial Least Squares, Support Vector Machine

**Tsinghua**

Tsinghua University, China

Classification Tree, k-Nearest Neighbor, Recursive Feature Elimination, Support Vector Machine

**UIUC**

University of Illinois, Urbana-Champaign

Classification Tree, k-Nearest Neighbor, Naïve Bayes, Support Vector Machine

**USM**

University of Southern Mississippi

Artificial Neural Network, Naïve Bayes, Sequential Minimal Optimization, Support Vector Machine

**ZJU**

Zejiang University, China

k-Nearest Neighbor, Nearest Centroid

The metric chosen by the MAQC-II consortium to rate the classification models was the Matthew’s Correlation Coefficient (MCC). The MCC has several advantages over the accuracy/sensitivity/specificity standard, as it is able to detect inverse correlations as well as being sensitive to the overall size of the training sets. MCC values range from +1 (perfect prediction) to −1 (perfect inverse prediction), with 0 indicating random prediction. Note that unbeknownst to the original study participants, endpoints H and I were replaced by a positive control (gender of the study participants) and a negative control (random class assignments), respectively. Therefore, it was expected that endpoint H would result in very high prediction MCC and endpoint I would result in MCC close to zero. The MCC is calculated as follows:

If any of the sums in the denominator of the MCC are zero, the denominator is set to be one, resulting in an MCC equal to zero.

As stated above, only five of the six MAQC-II datasets are currently available from GEO, therefore we were only able to compare TSN to these datasets. All filtering and classification was performed using only the training data for each dataset – the validation set was left completely out of these calculations. Where possible (Affymetrix platforms), the features of each training set were first filtered for a high percentage (66%) of present or marginal calls using a MATLAB implementation of the Affymetrix MAS5 call algorithm

**Raw data for cross validation and test set MCC scores and ΔMCC scores for all MAQC-II participants and TSN, referenced in Figures **
** and **
**.**

Click here for file

Results of TSN classification on MAQC-II Datasets

**Results of TSN classification on MAQC-II datasets.** MCC of MAQC-II endpoints A through I, based on models learned on the training set and then applied to the validation set. MCC values range from +1 (perfect prediction) to −1 (perfect inverse prediction), with 0 indicating random prediction. Boxplots show the MCC distribution of the models from the 15 groups, including TSN, that predicted all original and swap endpoints from the MAQC-II. The original and swap MCC values are averaged for each group. In addition to endpoints A through I, a boxplot showing the mean MCC over endpoints A through H is shown (ALL). We exclude endpoint I from this final boxplot because it is a negative control. The bottom and top of each box indicate the lower and upper quartiles of the data, respectively. The middle line represents the median. The whiskers indicate the extreme values. The asterisk represents the performance of TSN on that dataset. All raw data is included in Additional file

In addition to standard cross validation and validation set MCC, we also measured the statistical significance of different classifier sizes. As described with the cancer datasets above, we ran 100 iterations of TSN using fixed values of

**Raw data for statistical significance testing of MAQC-II results referenced in Figure S5.**

Click here for file

In order to test the amount of overfitting, we calculated the difference of the MCC values from each validation set and the corresponding MCC values from training set cross validation for each group. The cross validation performed for TSN was 5-fold cross validation, repeated 10 times, as recommended by the MAQC-II consortium. These results are presented in Figure

**Δ**MCC Results from MAQC-II Data

**ΔMCC Results from MAQC-II data.** Boxplots showing the distribution of **Δ**MCC values on the original data for each group, where **Δ**MCC = Cross Validation MCC – Validation Set MCC. This illustrates the amount of overfitting present during cross validation. The absolute value of each **Δ**MCC value was used in the calculations. The cross validation performed for TSN was 5-fold cross validation, repeated 10 times, as recommended by the MAQC-II consortium. Boxplots are sorted by the mean ΔMCC for each group (asterisk). All raw data is included in Additional file

For all analyses in this paper, up to sixteen differentially expressed genes were selected by the Wilcoxon rank sum test to input into the TSN algorithm. The fact that so few features were input to TSN in these analyses could explain the low levels of overfitting it exhibits. To test this, we ran all MAQC-II training sets (except for the negative control endpoint I, which would bias the results of ΔMCC towards zero) over a range of input feature sizes. For

**Raw data for ΔMCC values over a range of input feature sizes referenced in Figure S6.**

Click here for file

Conclusions

The goal of relative expression classification algorithms is to identify simple yet effective classifiers that are resistant to data normalization procedures and overfitting, practical to implement in a clinical environment, and potentially biologically interpretable. The top-scoring ‘N’ algorithm presented here retains these desirable properties while allowing a larger combination and permutation space to be searched than that afforded by earlier relative expression algorithms such as TSP and TST. TSN can also recommend the classifier size (

We have demonstrated the effectiveness of TSN in classification of the MAQC-II datasets in comparison with many other classification strategies, including artificial neural networks, classification trees, discriminant analysis, k-Nearest neighbor, naïve Bayes, and support vector machines, as implemented by several universities and companies from around the world. We do not claim that TSN is necessarily the best or most effective classifier for every circumstance. For example, TSN performs relatively poorly on endpoint H, which as the positive control in which classes were simply assigned as the gender of the study participants, should be among the easiest to classify. A major strength of the algorithm is the level to which the MCC values for cross validation agree with the MCC values on the independent validation set (ΔMCC). Importantly, these results indicate a very low level of overfitting, and increase our confidence that results generated through cross validation on future datasets will be effective classifiers on independent validation sets. That is, when TSN works on a dataset it is relatively more likely to be true, and conversely, when it is going to fall short in independent validation it typically does not work well in cross validation and so can be discarded as a candidate diagnostic early in the process. Analyses over a range of input sizes indicate that overfitting remains low even as input feature numbers increase, given sufficient sample sizes.

Of all the MAQC-II participants, including TSN, group SAS yielded the lowest mean ΔMCC score (0.074), indicating low levels of overfitting. Group SAI yielded the highest mean MCC (0.4893) for original and swap datasets, indicating high levels of validation set accuracy based on the training set. Both of these groups utilized multiple classification strategies across all endpoints. For example, group SAS used logistic regression for endpoints A, E, and I, support vector machines for endpoints B, G, and H, partial least squares regression for endpoints D and F, and a decision tree for endpoint C. Group SAI used support vector machines for endpoints A, B, E, F, G, and I, k-nearest neighbor for endpoints C and H, and a machine learning classifier for endpoint D. Group SAI also used a range of different feature selection methods for each endpoint. Both groups also used different classification strategies for the swap datasets. For example, group SAS used logistic regression for the original endpoint E data but partial least squares regression on swap endpoint E. Group SAI used a machine learning classifier for the original endpoint D, and discriminant analysis for swap endpoint D

Abbreviations

CPU: Central processing unit; DEG: Differentially expressed genes; GPU: Graphics processing unit; MAQC-II: Microarray quality control II; MCC: Matthews correlation coefficient; TSN: Top-scoring ‘N’; TSP: Top-scoring pair; TST: Top-scoring triplet.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

AM conceived of the study, wrote the software, and drafted the manuscript. NP participated in the study design and helped to write the manuscript. All authors read and approved the final manuscript.

Acknowledgements

The authors thank Dr. Don Geman and Bahman Afsari for valuable discussions during the development of this paper. This work was supported by a National Institutes of Health Howard Temin Pathway to Independence Award in Cancer Research [R00 CA126184]; the Camille Dreyfus Teacher-Scholar Program, and the Grand Duchy of Luxembourg-ISB Systems Medicine Consortium.