Department of Information and Computer Sciences, University of Hawai`i at Mānoa, Honolulu, HI 96822, USA

Abstract

Background

Classification is the problem of assigning each input object to one of a finite number of classes. This problem has been extensively studied in machine learning and statistics, and there are numerous applications to bioinformatics as well as many other fields. Building a multiclass classifier has been a challenge, where the direct approach of altering the binary classification algorithm to accommodate more than two classes can be computationally too expensive. Hence the indirect approach of using binary decomposition has been commonly used, in which retrieving the class posterior probabilities from the set of binary posterior probabilities given by the individual binary classifiers has been a major issue.

Methods

In this work, we present an extension of a recently introduced probabilistic kernel-based learning algorithm called the Classification Relevance Units Machine (CRUM) to the multiclass setting to increase its applicability. The extension is achieved under the error correcting output codes framework. The probabilistic outputs of the binary CRUMs are preserved using a proposed linear-time decoding algorithm, an alternative to the generalized Bradley-Terry (GBT) algorithm whose application to large-scale prediction settings is prohibited by its computational complexity. The resulting classifier is called the Multiclass Relevance Units Machine (McRUM).

Results

The evaluation of McRUM on a variety of real small-scale benchmark datasets shows that our proposed Naïve decoding algorithm is computationally more efficient than the GBT algorithm while maintaining a similar level of predictive accuracy. Then a set of experiments on a larger scale dataset for small ncRNA classification have been conducted with Naïve McRUM and compared with the Gaussian and linear SVM. Although McRUM's predictive performance is slightly lower than the Gaussian SVM, the results show that the similar level of true positive rate can be achieved by sacrificing false positive rate slightly. Furthermore, McRUM is computationally more efficient than the SVM, which is an important factor for large-scale analysis.

Conclusions

We have proposed McRUM, a multiclass extension of binary CRUM. McRUM with Naïve decoding algorithm is computationally efficient in run-time and its predictive performance is comparable to the well-known SVM, showing its potential in solving large-scale multiclass problems in bioinformatics and other fields of study.

Background

The problem of classifying an object to one of a finite number of classes is a heavily studied problem in machine learning and statistics. There are numerous applications in bioinformatics, such as cancer classification using microarrays

Recently, a novel kernel-based learning algorithm called the Classification Relevance Units Machine (CRUM) for binary classification was introduced

In this paper, we extend the CRUM algorithm into the more general multiclass setting, allowing for applications beyond binary classification. This is achieved by decomposing the multiclass problem into a set of binary classification problems using the error correcting output codes (ECOC)

In this study, the McRUM is evaluated on two sets of experiments. First, the McRUM is applied to a variety of small-scale datasets from the UCI repository

In the second set of experiments, the McRUM is applied to the problem of classifying small noncoding RNAs (ncRNAs) to validate the use of the method on a problem of a larger scale than that of the first set of experiments. This second set of experiments deal with a three-class classification problem, specifically, the identification of sequences from two classes of post-transcriptional gene regulatory ncRNAs - mature microRNA (miRNA) and piwi-interacting RNA (piRNA) - from other ncRNAs. This is of interest to small RNA sequencing projects (under 40 nt) where novel miRNAs and piRNAs can be found amidst a set of unannotated reads. For the miRNAs, it is especially interesting since the miRNA precursors may not be sequenced in those small ncRNA sequencing project, and thus losing the usual avenue of finding novel miRNAs via identification of their precursors

The experimental results on datasets taken from the UCI repository together with the preliminary results on small ncRNAs show that, under certain settings, the McRUM can achieve comparable or higher accuracy than previous analyses of these problems. Thus the results suggest CRUM's potential in solving multiclass problems in bioinformatics and other fields of study.

Methods

Classification relevance units machine

The sparse kernel-based binary classification model called the Classification Relevance Units Machine (CRUM) obtains probabilistic predictions ^{d}_{+}|**x**) that an object **x **∈ Ψ is a member of the positive class _{+ }using the following model

where _{i }**u**_{i }^{d}_{-}|**x**) = 1 - _{+}|**x**).

For a given **x**_{1}, **x**_{2},..., **x**_{N}**u**_{i}_{i}

The multiclass classification problem and solutions

Multiclass classification is the generalization of binary classification to an arbitrary number of classes _{1}, _{2},..., _{K}

There are two major approaches to converting a binary classifier to a multiclass classifier: the direct approach and through the aggregation of multiple binary classifiers.

Direct approach

In the direct approach, the internals of the binary classifier are changed to reflect the

where the **u**_{m}_{mi}_{i }^{3 }increase in the run-time complexity of the CRUM training algorithm compared to the binary case, due to the inversion of the (

Likewise, reformulating the SVM for multiclass classification leads to high cost training algorithms

Decomposition of a multiclass problem into binary classification problems

The idea of the aggregation approach is to decompose the multiclass problem into multiple binary problems that can then be solved with binary classifiers. The most popular framework for this approach is the method of error correcting output codes (ECOC)

where each column of **M **specifies one binary classifier.

For example, the one-versus-rest (OVR) matrix for three classes is a 3 × 3 identity matrix:

There are three columns and thus this decomposition will require the training of three binary classifiers. The first binary classifier is trained with the training data belonging to class _{1 }as the positive class set and the data belonging to classes _{2 }and _{3 }as the negative class set. The second binary classifier is trained with the training data belonging to class _{2 }as the positive class set and the data belonging to classes _{1 }and _{3 }as the negative set. The third binary classifier is trained similarly. The name of this decomposition is called one-versus-rest (OVR) because each binary classifier is trained with only one class serving as the positive class and all other classes serving as the negative class. In general, the OVR matrix for

The all-pairs (AP) matrix for three classes is also a 3 × 3 matrix:

The Δ symbol denotes omission of the class in the training of the binary classifier. Therefore in this case, the first binary classifier is trained with the training data belonging to class _{1 }as the positive class set, data from _{2 }as the negative class set, and data from _{3 }is omitted. The next two binary classifiers are trained in a similar way. This decomposition is called one-versus-one or all-pairs (AP) as each binary classifier is trained with only a single class serving as the positive class and another single class as the negative class. Since there are

In general any coding matrix **M **defined by Equation (3) can be used under the following constraints:

1. All rows and columns are unique

2. No row is solely composed of Δ

3. Each column has at least one 1 and 0

Aggregating the binary outputs

Given a coding matrix **M **and the outputs of the _{k}**x**)? Let us first consider the simple case of hard decoding, leading to a hard decision. Assume that the binary classifiers _{i}**M**, return hard decisions where an output of 1 denotes the positive class and 0 denotes the negative class. Then the collective output of the binary classifiers on **x **can be collected into a row vector **g**(**x**) = [_{1}(**x**), _{2}(**x**),..., _{L}**x**)]. The predicted class that **x **belongs to is determined by finding the row of **M **with the smallest distance to **g**(**x**). Let **y**, **z **∈ {0, 1, Δ}^{1 × L}_{. }

where

Let **M**(**M **and,

Then the predicted class of **x **is **M **can be interpreted as the unique codewords representing the **g**(**x**) is one of those codewords corrupted by noise. In this context, the above algorithm decodes **g**(**x**) into the closest codeword, thus performing error correction on the corrupted bits and giving the name of this approach to classification, ECOC.

Unfortunately, computing the posterior probabilities _{k }_{k}**x**) for all _{i}_{i}**x**) is the probability of the positive class of the _{i }

Given the probabilistic outputs of the binary classifiers

Through these relations the posterior probabilities **p **= [_{1}, _{2},..., _{K}^{T }**p **that minimizes the negative log-likelihood,

under the constraints that each _{k }

Note that the optimization of Equation (13) must be done for every object **x **that we want to make a prediction on. This could be too expensive in large-scale prediction applications. Furthermore, the computational complexity of the algorithm is not completely characterized. While Huang et al.

We make the naive assumption that the output of each binary classifier is independent. Under the interpretation of error-correcting codes, the formulation below is a soft-decoding of the observed **g**(**x**) to the codewords in **M **under the assumption that bit errors are independent. Then we can compute the class posteriors as simple products of the binary posteriors, as follows

where the output of classifiers not trained on data from class _{k }_{2}|**x**, **M**) = (1 - _{1}(**x**))_{3}(**x**). Given the outputs of the binary classifiers, the algorithm is linear in

The above formulation is a generalization to any valid **M **of the Resemblance Model for AP decomposition proposed in **M**. Thus in general, this method is possibly only a crude approximation.

The following pseudocodes summarize the training and prediction processes of McRUM.

**Algorithm 1: **Training McRUM

Input: **M**, labeled training data

1: **for ****to **

2:

3:

4: **for ****to **

5: **if M**_{ij }

6: Add data from class

7: **else if M**_{ij }

8: Add data from class

9: **end if**

10: **end for**

11: Set _{i }

12: **end for**

13: **return g **= [_{1}, _{2},..., _{L}

**Algorithm 2: **Prediction

Input: **M**, _{i}**x**

1: Set **p **= [_{1}, _{2},..., _{K}

2: **for ****to **

3: **for ****to **

4.

5: **end if**

6: **end for**

7:

8: **return p**

Optimal coding matrix

The next question is whether there is any theory that can help guide us to designing the optimal coding matrix that gives the smallest error. There is, but it is not practically useful. These are some of the properties that would achieve a good ECOC-based classifier

1. The minimum distance (using Hamming distance, Equation (6)) between rows of **M **should be maximized

2. The number of **Δ **entries should be minimized

3. The average error of the

All the criteria are at odds with each other. Consider OVR decomposition, Equation (4), again. Since all but one class is considered to be in the negative class, the training data is likely to be imbalanced. To see why this is a problem, let us consider an extreme case where 99% of the training data is negative and only 1% of the data is positive. Then a binary classifier that always predicts the negative class would achieve 1% error. Under the framework of empirical or structural risk minimization, classifier training would tend to converge to his solution as it provides low empirical risk under 0-1 loss. Therefore a large imbalance between the size of the positive and negative sets would bias the classifier against the smaller class. So while OVR does not have any Δ entries, the average error of the binary classifiers could be high.

In the case of the AP decomposition shown in Equation (5), each individual binary classifier only has a single class serving as the positive data and another single class serving as the negative. If the overall training set was balanced between all

Therefore knowing which coding matrix is superior to another a priori is not possible and the choice of coding matrix **M **is application-dependent. So we must experimentally try different matrices to find which one is the best suited to the particular application.

ncRNA dataset preparation and features

The ncRNA dataset is gathered from mirBase's collection of miRNA

In the gathered data, miRNAs are observed to be 15 ~ 33 nt long and piRNAs are observed to be 16 ~ 40 nt long. For the other ncRNAs, the training and evaluation of the McRUM does not necessarily use the entire sequence. We chose to use fragments of length 20 nt, which is in the overlapping range of the lengths between miRNAs and piRNAs, so that the fragment has the possibility of being an miRNA or piRNA had the identity of the fragment been unknown. If the other ncRNA sequence is of length longer than 20 nt, we take a random fragment of 20 nt from the sequence instead. Due to the imbalance of the dataset among the three classes, the training set is a sample of the available data. After holding out 20% of the miRNA sequences for an independent test set, we are left with 7,552 miRNAs in the training set. Therefore we sample 7,552 piRNAs and other ncRNAs each to form a balanced 1:1:1 training set. Together with the hold out of 1,887 miRNAs, the remaining 73,595 piRNAs and 87,257 other ncRNAs serve as an independent test set.

Since mature miRNAs and piRNAs lack strong secondary structures, internally the McRUM represents each ncRNA using

Performance measures

Receiver Operating Characteristic (ROC) curve is a visualization of performance of binary classifiers at various thresholds. On the

where

For classification of more than two classes, we can compute ROC curves by considering one class as the positive class and the remaining classes jointly as the negative class. For the small ncRNA experiment, we have three classes. For Figures

3-fold cross-validation results for miRNA being the positive class

**3-fold cross-validation results for miRNA being the positive class**. The ROC curves are generated from observed FPR and TPR under varying posterior probability thresholds from 0.30 to 0.99 in increments of 0.01 for the two McRUM models (AP and OVR settings) with the Naïve decoding algorithm, and for the Gaussian and linear SVMs with all-pairs decomposition.

3-fold cross-validation results for piRNA being the positive class

**3-fold cross-validation results for piRNA being the positive class**. The ROC curves are generated from observed FPR and TPR under varying posterior probability thresholds from 0.30 to 0.99 in increments of 0.01 for the two McRUM models (AP and OVR settings) with the Naïve decoding algorithm, and for the Gaussian and linear SVMs with all-pairs decomposition.

The timing results for the Naïve and GBT decoding algorithms in the benchmark experiments were obtained using MATLAB implementations on a PC with a 2.83 GHz Intel Core 2 Quad processor and 8 GB of memory.

Results and discussion

In this section we present two sets of experiments: benchmark experiments and small ncRNA experiments. The purpose of benchmark experiments is to assess performance of McRUM for four different decomposition settings and the two different decoding algorithms. For the experiments, we use a group of datasets from the UCI Machine Learning Repository

For both sets of experiments, we also run the multiclass SVM implemented in LIBSVM

Benchmark experiments

For the experiments, we try the McRUM on five small datasets from the UCI Machine Learning Repository website ^{2})^{-1}, where

Throughout the benchmark experiments, we consider the following decompositions: (i) all-pairs (AP), (ii) one-versus-rest (OVR), (iii) random dense, and (iv) random sparse. Random coding matrices **M **are generated with and without using Δ symbols for the random sparse and random dense cases, respectively. For each random type, 100 random **M **are generated and the **M **with the smallest minimum distance among its rows is chosen. Controlling the number of columns in the random sparse matrix, we can aim to create a decomposition that is a compromise between AP and OVR. This is useful should the number of classes

The class label is assigned based on which class has the largest posterior probability, as determined by the Naïve and generalized Bradley-Terry (GBT) decoding algorithms

Wine dataset

The three-class wine dataset contains 178 instances

McRUM results on wine dataset using 10-fold cross-validation

**Naïve**

**GBT**

**Train Acc**

**Test Acc**

**Train Acc**

**Test Acc**

AP

99.44 (0.20)

97.78 (2.87)

99.44 (0.20)

97.78 (2.87)

OVR

99.44 (0.20)

97.75 (3.92)

99.44 (0.20)

97.75 (3.92)

Dense

84.40 (1.09)

83.63 (6.99)

99.44 (0.20)

98.30 (2.74)

Sparse

91.01 (1.76)

90.00 (6.31)

99.50 (0.26)

97.22 (4.72)

Train/Test Acc is the mean accuracy on the training/test dataset. The standard deviation is shown in the parenthesis. (AP: all-pairs, OVR: one-versus-rest, Dense: random dense, Sparse: random sparse)

The prediction running-times are also measured for the AP and OVR decompositions where the Naïve and GBT algorithms show comparable predictive performance. As shown in Table

Prediction time of McRUM on benchmark datasets (in seconds)

**Naïve**

**GBT**

**AP**

**OVR**

**AP**

**OVR**

wine

0.001868

0.001662

7.373405

1.073139

(0.000206)

(0.000164)

(1.182947)

(0.197278)

iris

0.001409

0.001376

6.625228

2.908828

(0.000139)

(0.000141)

(0.678363)

(0.834858)

yeast

0.040737

0.0492920

2722.317544

2795.766035

(0.000534)

(0.001219)

(117.545388)

(139.314294)

thyroid

0.131689

0.123968

939.692526

179.426632

satellite

0.239550

0.139304

10612.598301

2816.632703

We provide the prediction time only for AP and OVR cases since their predictive performances are competitive to each other for all benchmark datasets while the performances for random decompositions are much lower than the AP and OVR cases with Naïve algorithm. The prediction time is averaged over 10-fold cross-validation for the first three datasets while it is estimated once for the last two datasets as their explicit partitioning of training and test sets were given. Included in the parenthesis is the standard deviation of the prediction time. (AP: all-pairs, OVR: one-versus-rest)

We observed the mean accuracies of 99.69% (std = 0.33) and 98.89% (std = 2.34) from Gaussian SVM for the training and test set, respectively, which is comparable to the AP and OVR McRUM results of 99.44% (std = 0.20) and 97.78% (std = 2.87). In addition, the best mean accuracy reported for a 10-fold cross-validation using a multiclass RVM for this wine dataset is 96.24%

Iris dataset

Table

McRUM results on iris dataset using 10-fold cross-validation

**Naïve**

**GBT**

**Train Acc**

**Test Acc**

**Train Acc**

**Test Acc**

AP

97.56 (0.70)

96.00 (5.62)

97.56 (0.70)

96.00 (5.62)

OVR

97.85 (0.82)

96.67 (5.67)

97.85 (0.82)

96.67 (5.67)

Dense

67.56 (2.88)

68.00 (16.27)

97.70 (0.74)

96.00 (6.44)

Sparse

66.67 (1.16)

66.67 (10.42)

97.78 (0.92)

95.33 (5.49)

Train/Test Acc is the mean accuracy on the training/test dataset. The standard deviation is shown in the parenthesis. (AP: all-pairs, OVR: one-versus-rest, Dense: random dense, Sparse: random sparse)

The mean accuracies on the training and test set we observed from Gaussian SVM are 97.85% (std = 0.65) and 96.67% (std = 3.51). The best mean accuracy reported for a 10-fold cross-validation using a multiclass RVM for this iris dataset is 93.87%

Yeast dataset

Table

McRUM results on yeast dataset using 10-fold cross-validation

**Naïve Test Acc**

**GBT Test Acc**

AP

59.43 (6.27)

59.10 (6.21)

OVR

58.90 (5.29)

59.51 (4.02)

Dense

3.43 (1.49)

57.75 (3.11)

Sparse

2.02 (1.62)

59.23 (2.79)

Test Acc is the mean accuracy on the test dataset. The standard deviation is shown in the parenthesis. Note that, due to the large dataset size and the high computational complexity of the GBT algorithm, the mean accuracy on the training partitions cannot be provided. (AP: all-pairs, OVR: one-versus-rest, Dense: random dense, Sparse:random sparse)

For this dataset, Gaussian SVM shows 60.85% accuracy with standard deviation of 4.08. Best results from AP McRUM (59.43% (std = 6.27)) and OVR McRUM (59.51% (std = 4.02) are again comparable to the SVM results considering the standard deviation. The results from both McRUM and SVM are an improvement over the 56.5% achieved in the dataset's original analysis using PSORT

The prediction running-times in Table

Thyroid disease dataset

The results in Table

McRUM results on thyroid training and test datasets

**Naïve**

**GBT**

**Train Acc**

**Test Acc**

**Train Acc**

**Test Acc**

AP

98.44

97.29

97.93

97.11

OVR

95.47

95.22

95.52

95.04

Dense

92.47

92.71

94.38

93.90

Sparse

24.68

25.50

97.11

96.18

Train/Test Acc is the accuracy on the training/test dataset. Note that an explicit partitioning of the data into training and test sets were provided. Therefore, cross-validation experiment was not performed and, as a result, no information on standard deviation is available. (AP: all-pairs, OVR: one-versus-rest, Dense: random dense, Sparse: random sparse)

As shown in Table

Landsat satellite (statlog) dataset

Table

McRUM results on satellite image training and test datasets

**Naïve**

**GBT**

**Train Acc**

**Test Acc**

**Train Acc**

**Test Acc**

AP

89.85

88.05

89.76

87.70

OVR

89.56

87.65

89.58

87.50

Dense

10.60

11.85

80.56

77.95

Sparse

23.40

23.50

85.73

84.55

Train/Test Acc is the accuracy on the training/test dataset. Note that an explicit partitioning of the data into training and test sets was provided. Therefore, cross-validation experiment was not performed and, as a result, no information on standard deviation is available. (AP: all-pairs, OVR: one-versus-rest, Dense: random dense, Sparse: random sparse)

Table

Small ncRNA experiments

To validate the McRUM on a larger scale problem and to explore its use for the task of NGS data analysis, we investigated the classification of mature miRNAs and piRNAs from other ncRNAs. This is a problem of interest in the analysis of small RNA sequencing (RNA-seq) data. Further details of the dataset and sequence features used by the McRUM are given in the Methods section. For this experiment, two McRUM models are used for the AP and OVR settings using the Naïve decoding algorithm, and their performance is illustrated relative to the Gaussian and linear multiclass SVMs.

Cross-validation experiments

Figures

Figure

In contrast to miRNA, Figure

Finally, Figure

3-fold cross-validation results for non-miRNA and non-piRNA being the positive class

**3-fold cross-validation results for non-miRNA and non-piRNA being the positive class**. The ROC curves are generated from observed FPR and TPR under varying posterior probability thresholds from 0.30 to 0.99 in increments of 0.01 for the two McRUM models (AP and OVR settings) with the Naïve decoding algorithm, and for the Gaussian and linear SVMs with all-pairs decomposition.

In Additional file

**The fraction of unclassified sequences for cross-validation experiment**. It is a figure in tif format named 'MenorBaekPoisson-Figure S1.tif' showing the fraction of the validation set left unclassified for the AP and OVR McRUMs and the Gaussian and linear SVMs at different posterior probability threshold values.

Click here for file

Independent test experiments

Further evaluation of the McRUM and the multiclass SVMs on a larger, independent dataset was also conducted and the ROC curves are given in Figures

Evaluation results on independent test set for miRNA being the positive class

**Evaluation results on independent test set for miRNA being the positive class**. The ROC curves are generated from observed FPR and TPR under varying posterior probability thresholds from 0.30 to 0.99 in increments of 0.01 for the two McRUM models (AP and OVR settings) with the Naïve decoding algorithm, and for the Gaussian and linear SVMs with all-pairs decomposition.

Evaluation results on independent test set for piRNA being the positive class

**Evaluation results on independent test set for piRNA being the positive class**. The ROC curves are generated from observed FPR and TPR under varying posterior probability thresholds from 0.30 to 0.99 in increments of 0.01 for the two McRUM models (AP and OVR settings) with the Naïve decoding algorithm, and for the Gaussian and linear SVMs with all-pairs decomposition.

Evaluation results on independent test set for non-miRNA and non-piRNA being the positive class

**Evaluation results on independent test set for non-miRNA and non-piRNA being the positive class**. The ROC curves are generated from observed FPR and TPR under varying posterior probability thresholds from 0.30 to 0.99 in increments of 0.01 for the two McRUM models (AP and OVR settings) with the Naïve decoding algorithm, and for the Gaussian and linear SVMs with all-pairs decomposition.

In Additional file

**The fraction of unclassified sequences for independent test experiment**. It is a figure in tif format named 'MenorBaekPoisson-Figure S2.tif' showing the fraction of the test set left unclassified for the AP and OVR McRUMs and the Gaussian and linear SVMs at different posterior probability threshold values.

Click here for file

Recently, a Fisher Linear Discriminant (FLD) based classifier called piRNApredictior has been proposed for binary piRNA classification by Zhang et al.

The ROC curves generated from observed results of the FLD-based classifiers are presented in Figures

piRNA prediction results

**piRNA prediction results**. The ROC curves for FLD-based classifiers are generated from observed prediction results on our test dataset with the threshold varying from -0.2 to 0.46 in increments of 0.001. (-0.2 and 0.46 are the lower and upper bounds of the observed predicted values of FLD-based classifiers.) FLD_1 classifier was trained with the dataset used in

piRNA prediction results

**piRNA prediction results**. The ROC curves in Figure 7 are zoomed in to match the FPR range shown in Figure 5. The ROC curves for FLD-based classifiers are generated from observed prediction results on our test dataset with the threshold varying from -0.2 to 0.46 in increments of 0.001. (-0.2 and 0.46 are the lower and upper bounds of the observed predicted values of FLD-based classifiers.) .) FLD_1 classifier was trained with the dataset used in

Note that about 99% of the sequences in the positive training dataset for FLD_1 are from NONCODE 2.0's collection of piRNA. Our positive test dataset is gathered from a later version of NONCODE database and, as a result, 98.57% of the sequences in our positive test dataset are already included in the positive training dataset used for FLD_1. Therefore the prediction results may be biased in favor of FLD_1. Then, FLD_2 showing better performance than FLD_1 may seem contradictory when the training set used for FLD_2 is independent of the test set. It can be because FLD_1 is not specifically trained on the ncRNAs shorter than 25 nt. The training dataset for FLD_1 contains about 4.67% ncRNAs shorter than 25 nt while our training dataset used for FLD_2 contains 66.41% sequences shorter than 25 nt. In the test dataset, 55.93% of the sequences in the test dataset are shorter than 25 nt, for which correct prediction can be hard for FLD_1.

Conclusions

In this study, the binary CRUM model is generalized to the multiclass setting via ECOC framework. The probabilistic nature of the binary CRUM is preserved using either the GBT or the proposed linear-time decoding algorithms. The proposed linear-time algorithm allows for efficient application to large-scale prediction settings, where the GBT algorithm's complexity is prohibitive, while still maintaining comparable predictive performance under certain decompositions of the given multiclass problems, as evidenced by the benchmark experiments. The applicability of the McRUM to larger scale problems is demonstrated by an analysis of small ncRNA sequences. The results demonstrate that McRUM can be an advantageous solution to resolve multiclass problems especially when applied to large datasets.

The preliminary results on small ncRNA classification presented in this paper demonstrate that the McRUM has potential in addressing the problem of classifying small ncRNAs. In this study, we restricted the length of the other ncRNA fragments to be maximum of 20 nt, but we plan to conduct further experiments with various lengths of fragments. We also plan to include short byproducts of small RNA biogenesis, such as miRNA*, in the class of other ncRNAs. In the future, we will also extend the current study by including other classes of small ncRNAs and optimizing the use of the McRUM for large-scale datasets such as those generated by NGS sequencing projects. Features other than the simple k-mers will be considered to improve the predictive performance, especially for classifying the mature miRNAs. Finally, the interesting preliminary results obtained by the multiclass Gaussian SVM on the problem of small ncRNA classification show that it could be an advantageous alternative to McRUM on smaller datasets and thus we intend to develop in tandem both classifiers for further experiments. The resulting small ncRNA classifiers will be integrated into a combined prediction tool that will offer both the multiclass SVM and McRUM options providing more alternative choices to users.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

MM implemented the McRUM and performed the experiments. All authors participated in the design of the study, development of CRUM and the data analyses. KB and GP supervised and coordinated the whole research work. All authors have read, revised and approved the final manuscript.

Acknowledgements

This work is supported in part by NIH Grants from the National Institute of General Medical Sciences (P20GM103516 and P20GM103466). The paper's contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH.

Declarations

The publication costs for this article were funded by grant number P20GM103516 from the National Institute of General Medical Sciences, of the National Institutes of Health.

This article has been published as part of