^{1}

^{2}

^{1}

^{*}

In this paper, sixty-eight research articles published between 2000 and 2017 as well as textbooks which employed four classification algorithms: K-Nearest-Neighbor (KNN), Support Vector Machines (SVM), Random Forest (RF) and Neural Network (NN) as the main statistical tools were reviewed. The aim was to examine and compare these nonparametric classification methods on the following attributes: robustness to training data, sensitivity to changes, data fitting, stability, ability to handle large data sizes, sensitivity to noise, time invested in parameter tuning, and accuracy. The performances, strengths and shortcomings of each of the algorithms were examined, and finally, a conclusion was arrived at on which one has higher performance. It was evident from the literature reviewed that RF is too sensitive to small changes in the training dataset and is occasionally unstable and tends to overfit in the model. KNN is easy to implement and understand but has a major drawback of becoming significantly slow as the size of the data in use grows, while the ideal value of K for the KNN classifier is difficult to set. SVM and RF are insensitive to noise or overtraining, which shows their ability in dealing with unbalanced data. Larger input datasets will lengthen classification times for NN and KNN more than for SVM and RF. Among these nonparametric classification methods, NN has the potential to become a more widely used classification algorithm, but because of their time-consuming parameter tuning procedure, high level of complexity in computational processing, the numerous types of NN architectures to choose from and the high number of algorithms used for training, most researchers recommend SVM and RF as easier and wieldy used methods which repeatedly achieve results with high accuracies and are often faster to implement.

In the last few decades, a large number of methods for classification have been developed [

Algorithms based on decision trees (DT), are easy to apply, as a fewer number of parameters need to be estimated; hence, these have high degrees of automation [

Several studies demonstrate that, MLAs are more accurate than statistical techniques such as discriminant analysis or logistic regression, especially when the feature space is complex or the input datasets are expected to have different statistical distributions [

The basic steps to decide which algorithm to use will depend on a number of factors such as the number of examples in training set, dimensions of featured space, whether there are correlated features and whether overfitting is a problem [

To avoid human introduced biases, Raczko and Zagajewski [

The performances of various classification methods however, still depend greatly on the general characteristics of the data to be classified [

In this review paper, the performances, strengths and shortcomings of the KNN, SVM, RF and NN classifiers are examined and compared. Answers to the following questions are sought. What are the strengths and weaknesses of these algorithms on a set of classification problems? Which one performs better and under what conditions does one classifier perform better than the others? The four nonparametric classification methods were therefore, evaluated on the following; robustness to training data, sensitivity to changes, data fitting, stability, ability to handle large data sizes, sensitivity to noise, time invested in parameter tuning, and accuracy.

Support Vector Machines (SVM) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. The SVM algorithm developed by Cortes and Vapnik [

The SVM algorithm is often reported to achieve better results than other classifiers [

The SVM is an extension of the support vector classifier and is obtained as a result from the enlargement of the feature space in a specific way, using kernels [

Representation of linear support vector classifier is as shown in Equation (1):

f ( x ) = β 0 + ∑ i = 1 n α i 〈 x , x i 〉 (1)

where α 1 , ⋯ , α n and β 0 are parameters which are estimated by ( n 2 ) inner products 〈 x i , x ′ i 〉 between all pairs of training observations. Replacing the inner product with K ( x i , x ′ i ) , where K is some function called the kernel. Linear kernel is represented as shown in Equation (2):

K ( x i , x ′ i ) = ∑ j = 1 p x i j x ′ i j (2)

Polynomial kernel of degree d (where d is positive) can be represented as shown in Equation (3):

K ( x i , x ′ i ) = ( 1 + ∑ j = 1 p x i j x ′ i j ) d (3)

Classification results of the combination of non-linear kernel and support vector classifier are called the SVM (Equation (3)).

The SVM classifier, which is particularly designed for binary classification, is a kernel-based supervised learning algorithm that classifies the data into two or more classes and it is not recommended when there are a large number of training examples [

The performance of the SVM classifier relies on the choice of the regularization parameter C which is also known as box constraint and the kernel parameter which is also known as the scaling factor. Together they are known as the hyperplane parameter [

SVMs have been successfully applied in many diverse fields including text and hypertext categorization [

In pattern recognition, the KNN algorithm is an instance based learning method used to classify objects based on their closest training examples in the feature space. An object is classified by a majority vote of its neighbors, that is, the object is assigned to the class that is most common amongst its k-nearest neighbors (

The KNN algorithm is implemented using Euclidean distance metrics to locate the nearest neighbor [

d ( x , y ) = ∑ i = 1 N x i 2 − y i 2 (4)

where, N is the number of features such that, x = { x 1 , x 2 , x 3 , … , x N } and y = { y 1 , y 2 , y 3 , … , y N } .

The KNN classifier is one of the many approaches that attempt to estimate the conditional distribution of Y given X, and then classify a given observation to the class with highest estimated probability. Given a positive integer K and a test observation, x 0 , the KNN classifier first identifies the K points in the training data that are closest to x 0 , represented by N 0 . It then estimates the conditional probability for class j as the fraction of points in N 0 whose response values equal j as indicated in Equation (5):

Pr ( Y = j | X = x 0 ) = 1 K ∑ i ∈ N 0 I ( y i = j ) (5)

where, I ( y i = j ) is an indicator variable that equals 1 if y i = j and zero if y i ≠ j .

KNN is robust to noisy training data and is effective for large numbers of training examples. But for this algorithm, the value of parameter k (number of nearest neighbors) and the type of distance to be used have to be determined. The computation time can be lengthy as one needs to compute the distance of each query instance to all training samples and it gets significantly slower as the number of examples and/or predictors/independent variables increase [

KNN’s main disadvantage of becoming significantly slower as the volume of data increases makes it an impractical choice in environments where predictions need to be made rapidly [

To select the K that is right for a dataset, the KNN algorithm is run several times with different values of K and the K that reduces the number of errors encountered is chosen while maintaining the ability of the algorithm to accurately make predictions when it is applied to data for which it has no prior contact [

As the value of K decreases to 1, the predictions become less stable. Inversely, as the value of K is increased, the predictions become more stable due to majority voting/averaging, and thus, more likely to make more accurate predictions (up to a certain point). Eventually, an increasing number of errors is witnessed. It is at this point that one recognizes that the appropriate value of K has been exceeded. The value of K is usually an odd number to have a tiebreaker in cases where a majority vote among labels is required, for example, picking the mode in a classification problem [

Recently there has been a lot of interest in ensemble learning, that is, methods that generate many classifiers and aggregate their results. Two well-known methods are boosting [

An RF classifier consists of a number of trees, with each tree grown using some form of randomization (

The number of trees necessary for good performance grows with the number of predictors. The best way to determine how many trees are necessary is to compare predictions made by a forest to predictions made by a subset of a forest. When the subsets work as well as the full forest, it indicates there are enough trees. For selecting, m_{try}, Breiman [_{try} may give better performance. A lot of trees are necessary to get stable estimates of variable importance and proximity. Since the algorithm falls into the “embarrassingly parallel” category, one can run several random forests on different machines and then aggregate the votes components to get the final result [

The RF classifier adds an additional layer of randomness to bagging [

chosen at that node [

RF is essentially, a set of DTs combined where each tree votes on the class assigned to a given sample, with the most frequent answer winning the vote [

The following types of scientific and engineering data are amenable to RF: DNA data, micro-array data, spectral data: NMR chemical data and molecular structure prediction, quality assessment of manuscripts published in a particular journal, finding clusters of patients based on, for example, tissue marker data, symptoms of a particular disease among others.

A NN classifier can be described as a parallel computing system consisting of an extremely large number of simple processors with interconnections [

McCulloch and Pitts [

the premise that if two neurons were active at the same time the strength between them should be increased [

With classification, it is sometimes necessary to use accuracy to assess the performance of a model. Consider analyzing a highly imbalanced data set. For example, trying to determine if a transaction is fraudulent or not, but only 0.5% of the data set contains a fraudulent transaction. Then one could predict that none of the transactions will be fraudulent and have a 99.5% accuracy score which is very misleading. So usually the sensitivity and specificity are used. Using the fraud detection problem, the sensitivity is the proportion of fraudulent transactions identified as fraudulent. The specificity is the proportion of non-fraudulent transactions identified as non-fraudulent.

Therefore, in an ideal situation, what is required are high sensitivity and specificity, although that might change depending on the context. For example, a bank might want to prioritize a higher sensitivity over specificity to make sure it identifies fraudulent transactions. The ROC curve (receiver operating characteristic) is good to display the two types of error metrics described above. The overall performance of a classifier is given by the area under the ROC curve (AUC). Ideally, it should hug the upper left corner of the graph, and have an area close to 1.

KNN classifies data based on the distance metric whereas SVM need a proper phase of training. Due to the optimal nature of SVM, it is guaranteed that the separated data would be optimally separated [

KNN is easy to implement and understand, but has a major drawback of becoming significantly slow as the size of that data in use increases [

Different factors affect the capacity of NN to generalize, that is, to predict new data from the learning carried out with training data. The intrinsic factors to network design include the number of neurons and network architecture [

Unlike most methods based on machine learning, RF only needs two parameters to be set for generating a prediction model, that is, the number of regression trees and the number of evidential features (m) which are used in each node to make regression trees grow [

Adding more data would lengthen NN training times to unacceptable levels so that it would be highly impractical to work with them. Larger input datasets will lengthen classification times for NN more than for SVM and RF [

The performance characteristics and attributes of the four types of non-parametric classification algorithms are summarized in

Algorithms | Attributes | |
---|---|---|

Positive | Negative | |

SVM | 1) Kernel functions such as Gaussian basis function or polynomials aids in non-linear separable classes. 2) Works well as a linear classifier. 3) Performs well on datasets that have many attributes. 4) Guarantees optimal separation of data. 5) Works well with heterogeneous distributed points. | 1) Computationally intensive when dealing with unlabeled dataset. 2) Has limitation in speed and size during both training and testing phase of the algorithm. 3) Has limitation in speed with regards to the selection of the kernel function parameters. |

KNN | 1) Easy to implement and understand. 2) Considered high computational complexity because one needs to calculate Euclidean distance of input feature with all the features in the database. However, it is free of training phase but computational in classification phase. 3) Works well with heterogeneous distributed points. | 1) It scales badly, if there are a million labeled examples in the dataset. 2) It would take a long time to find K nearest neighbors when there are a million labeled examples in the dataset. |

NN | 1) With a higher number of units of the hidden layer, the network capacity becomes greater to represent the training data patterns. | 1) Higher hidden layer has a high number of units and produces a loss in the networks’ generalization power. 2) More data would lengthen NN training times to unacceptable levels so that it would be highly impractical to work with them. 3) Has time-consuming parameter tuning procedure. |

RF | 1) Generalization error always converges even with increasing number of trees. 2) It is not easy to overfit to one particular feature. However, overfitting to training data remains a problem. 3) Achieve results often faster. | 1) Larger input datasets will lengthen classification times. |

The assessed algorithms have different difficulties in their training. DT based algorithms (RF) involve a lesser difficulty in their training. This applies to both simple regression trees and ensembles of trees (RF). When the data are very scarce RF show a better performance compared to NN and SVM which become more complex. SVMs are based on different kernel types, according to which the combination of parameters to be optimized is different. However, it should be highly emphasized that no broader generalizations can be made about the superiority of any method for all types of problems as the performance of the methods might vary for other datasets.

RF is too sensitive to small changes in the training dataset and is occasionally unstable and tends to overfit in the model. KNN is easy to implement and understand but has a major drawback of becoming significantly slow as the size of data in use grows while the ideal value of K for the KNN classifier is difficult to set. The NN method contains a high level of complexity in computational processing, causing it to become less popular in classification applications. SVM and RF are insensitive to noise or overtraining, which shows their ability in dealing with unbalanced data. Among the nonparametric methods, SVM and RF are becoming increasingly popular in image classification research and applications. Larger input datasets will lengthen classification times for NN and KNN more than for SVM and RF. NN has the potential to become a more widely used classification algorithm, but because of their time-consuming parameter tuning procedure, the numerous types of neural network architectures to choose from and the high number of algorithms used for training NN, most researchers recommend SVM or RF as easier methods which repeatedly achieve results with high accuracies and are often faster.

The idea was developed by EYB and JO. Literature was reviewed by all authors. All authors contributed to manuscript writing and approved the final manuscript.

AcknowledgementsWe thank the anonymous reviewers whose comments made this manuscript more robust.

FundingThis study attracted no funding.

Conflicts of InterestThe authors declare no conflicts of interest regarding the publication of this paper.

Boateng, E.Y., Otoo, J. and Abaye, D.A. (2020) Basic Tenets of Classification Algorithms K- Nearest-Neighbor, Support Vector Machine, Random Forest and Neural Network: A Review. Journal of Data Analysis and Information Processing, 8, 341-357. https://doi.org/10.4236/jdaip.2020.84020