Basic Tenets of Classification Algorithms K-Nearest-Neighbor, Support Vector Machine, Random Forest and Neural Network: A Review

In this paper, sixty-eight research articles published between 2000 and 2017 as well as textbooks which employed four classification algorithms: K-NearestNeighbor (KNN), Support Vector Machines (SVM), Random Forest (RF) and Neural Network (NN) as the main statistical tools were reviewed. The aim was to examine and compare these nonparametric classification methods on the following attributes: robustness to training data, sensitivity to changes, data fitting, stability, ability to handle large data sizes, sensitivity to noise, time invested in parameter tuning, and accuracy. The performances, strengths and shortcomings of each of the algorithms were examined, and finally, a conclusion was arrived at on which one has higher performance. It was evident from the literature reviewed that RF is too sensitive to small changes in the training dataset and is occasionally unstable and tends to overfit in the model. KNN is easy to implement and understand but has a major drawback of becoming significantly slow as the size of the data in use grows, while the ideal value of K for the KNN classifier is difficult to set. SVM and RF are insensitive to noise or overtraining, which shows their ability in dealing with unbalanced data. Larger input datasets will lengthen classification times for NN and KNN more than for SVM and RF. Among these nonparametric classification methods, NN has the potential to become a more widely used classification algorithm, but because of their time-consuming parameter tuning procedure, high level of complexity in computational processing, the numerous types of NN architectures to choose from and the high number of algorithms used for training, most researchers recommend SVM and RF as easier and wieldy used methods which repeatedly achieve results with high accuracies and are often faster to implement. How to cite this paper: Boateng, E.Y., Otoo, J. and Abaye, D.A. (2020) Basic Tenets of Classification Algorithms KNearest-Neighbor, Support Vector Machine, Random Forest and Neural Network: A Review. Journal of Data Analysis and Information Processing, 8, 341-357. https://doi.org/10.4236/jdaip.2020.84020 Received: July 13, 2020 Accepted: November 20, 2020 Published: November 23, 2020 Copyright © 2020 by author(s) and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY 4.0). http://creativecommons.org/licenses/by/4.0/

Algorithms based on decision trees (DT), are easy to apply, as a fewer number of parameters need to be estimated; hence, these have high degrees of automation [12]. However, this comparative advantage of DT with respect to ANN can be hidden by a tendency to overfit data [13]. For these reasons, both ANN and DT are, in recent years, being replaced by more advanced, simpler to train machine learning algorithms (MLAs). During the past decade, the family of kernel methods such as SVM [14] [15] and ensembles of trees such as RF [16] [17] have emerged as very promising methodologies for classification purposes.
Several studies demonstrate that, MLAs are more accurate than statistical techniques such as discriminant analysis or logistic regression, especially when the feature space is complex or the input datasets are expected to have different statistical distributions [4] [9]. As computational power has increased, MLAs have gained greater attention and the quality of pattern recognition systems has also increased correspondingly [18]. Thus, in most classification studies, RF, KNN and SVM are reported as the foremost classifiers producing high accuracies [19].
The basic steps to decide which algorithm to use will depend on a number of factors such as the number of examples in training set, dimensions of featured space, whether there are correlated features and whether overfitting is a problem [20]. Once these concerns have been addressed, the algorithm to use is then decided. Using methods of statistical physics, the generalization performance of SVMs, which have been recently introduced as a general alternative to neural networks (NN), were investigated [21]. It was evident from the study that for nonlinear classification rules, the generalization error saturates on a plateau when the number of examples is too small to properly estimate the coefficients of the nonlinear part. When trained on simple rules, it was found that SVMs overfit only weakly [21]. The performance of SVMs is strongly enhanced when the distribution of the inputs has a gap in feature space.
To avoid human introduced biases, Raczko and Zagajewski [22] used a 0.632 bootstrap procedure to evaluate three nonparametric classification algorithms (SVM, RF and ANN) in an attempt to classify the five most common tree species. The classification results indicated that, ANN achieved the highest median overall classification accuracy (77%) followed by SVM with 68% and RF with Journal of Data Analysis and Information Processing 62%. Analysis of the stability of results concluded that RF and SVM had the lowest variance of overall accuracy and κ (kappa) coefficient (12 percentage points) while ANN had 15 percentage points variance in results. A study showed that there exist some data distributions where maximal unpruned trees used in the RF do not achieve as good performance as the trees with smaller number of splits and/or smaller node size [23]. This was an improvement on the work reported earlier that RF do not overfit as the number of trees grows [10]. Thus, application of RF in general requires careful tuning of the relevant classifier parameters [24]. Bosch et al. [25] demonstrated that using random forests/ferns with an appropriate node test reduces training and testing costs significantly over a multi-way SVM and has comparable performance.
The performances of various classification methods however, still depend greatly on the general characteristics of the data to be classified [26]. The exact relationship between the data to be classified and the performance of various classification methods still remains to be determined. Thus far, there has been no classification method that works best on any given problem [26]. There have been various problems associated with classification methods in current use [20]. Therefore, to determine the best classification method for a certain dataset, a trial and error approach is used to decide on the best performance.
In this review paper, the performances, strengths and shortcomings of the KNN, SVM, RF and NN classifiers are examined and compared. Answers to the following questions are sought. What are the strengths and weaknesses of these algorithms on a set of classification problems? Which one performs better and under what conditions does one classifier perform better than the others? The four nonparametric classification methods were therefore, evaluated on the following; robustness to training data, sensitivity to changes, data fitting, stability, ability to handle large data sizes, sensitivity to noise, time invested in parameter tuning, and accuracy.

Support Vector Machines (SVM)
Support Vector Machines (SVM) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. The SVM algorithm developed by Cortes and Vapnik [8] tries to find the optimal hyperplane in n-dimensional classification space with the highest margin between classes ( Figure 1).
The SVM algorithm is often reported to achieve better results than other classifiers [9], although it has been indicated that the main reason to use an SVM instead is because the problem might not be linearly separable [27]. In that case, an SVM with a non-linear kernel such as the Radial Basis Function (RBF) would be suitable. Another related reason to use SVMs, is if one is in a high dimensional space. For example, SVMs have been reported to work better for text classification although this requires a lot of time for training [28]. The SVM is an extension of the support vector classifier and is obtained as a result from the enlargement of the feature space in a specific way, using kernels [29].
Representation of linear support vector classifier is as shown in Equation (1): Polynomial kernel of degree d (where d is positive) can be represented as shown in Equation (3): Classification results of the combination of non-linear kernel and support vector classifier are called the SVM (Equation (3)).
The SVM classifier, which is particularly designed for binary classification, is a kernel-based supervised learning algorithm that classifies the data into two or more classes and it is not recommended when there are a large number of training examples [8]. A kernel function is a mapping procedure done to the training set to improve its resemblance to a linearly separable data set. The purpose of mapping is to increase the dimensionality of the data set and it is done efficiently using a kernel function. Some of the commonly used kernel functions are linear, RBF, quadratic, Multilayer Perceptron kernel and Polynomial kernel [30]. The The linear kernel function is also less prone to overfitting compared with the RBF kernel function [31].
The performance of the SVM classifier relies on the choice of the regularization parameter C which is also known as box constraint and the kernel parameter which is also known as the scaling factor. Together they are known as the hyperplane parameter [32]. During the training phase, SVM builds a model, maps the decision boundary for each class and specifies the hyperplane that separates the different classes. Increasing the distance between the classes by increasing the hyperplane margin helps increase the classification accuracy. SVMs can also be used to effectively perform non-linear classification [33].

K-Nearest Neighbor (KNN)
In pattern recognition, the KNN algorithm is an instance based learning method used to classify objects based on their closest training examples in the feature space. An object is classified by a majority vote of its neighbors, that is, the object is assigned to the class that is most common amongst its k-nearest neighbors ( Figure 2), where k is a positive integer [44]. In the KNN algorithm, the classification of a new test feature vector is determined by the classes of its k-nearest neighbors.
The KNN algorithm is implemented using Euclidean distance metrics to locate the nearest neighbor [45]. The Euclidean distance metrics ( ) , d x y between two points x and y is calculated using Equation (4).

( )
where, N is the number of features such that, The KNN classifier is one of the many approaches that attempt to estimate the conditional distribution of Y given X, and then classify a given observation to the class with highest estimated probability. Given a positive integer K and a test observation, 0 x , the KNN classifier first identifies the K points in the training data that are closest to 0 x , represented by 0 N . It then estimates the conditional probability for class j as the fraction of points in 0 N whose response values equal j as indicated in Equation (5)  ( ) KNN is robust to noisy training data and is effective for large numbers of training examples. But for this algorithm, the value of parameter k (number of nearest neighbors) and the type of distance to be used have to be determined.
The computation time can be lengthy as one needs to compute the distance of each query instance to all training samples and it gets significantly slower as the number of examples and/or predictors/independent variables increase [24].
Nevertheless, there is no need to build a model, tune several parameters or make additional assumptions. KNN is a simple, versatile, easy to implement supervised MLA that can be used to solve classification, regression and search problems. The algorithm assumes that similar items exist in close proximity. In other words, similar items are near to each other and that 'birds of a feather flock together'. The KNN algorithm hinges on this assumption being true enough for it to be useful [6].
KNN's main disadvantage of becoming significantly slower as the volume of data increases makes it an impractical choice in environments where predictions need to be made rapidly [46]. Moreover, there are faster algorithms that can produce more accurate classification and regression results. However, provided there are sufficient computing resources to speedily handle the data for making predictions, KNN can still be useful in solving problems that have solutions that depend on identifying similar objects [46].
To select the K that is right for a dataset, the KNN algorithm is run several times with different values of K and the K that reduces the number of errors encountered is chosen while maintaining the ability of the algorithm to accurately make predictions when it is applied to data for which it has no prior contact [47]. There are other ways of calculating distance and one way might be preferable depending on the problem that is being solved. However, the straight-line distance, also called the Euclidean distance, is a popular and familiar choice [48].
As the value of K decreases to 1, the predictions become less stable. Inversely, E. Y. Boateng et al. Journal of Data Analysis and Information Processing as the value of K is increased, the predictions become more stable due to majority voting/averaging, and thus, more likely to make more accurate predictions (up to a certain point). Eventually, an increasing number of errors is witnessed.
It is at this point that one recognizes that the appropriate value of K has been exceeded. The value of K is usually an odd number to have a tiebreaker in cases where a majority vote among labels is required, for example, picking the mode in a classification problem [49]. The KNN algorithm can be used for classification, regression, and search problems. It is useful in solving problems that have solutions that depend on identifying similar objects.

Random Forest
Recently there has been a lot of interest in ensemble learning, that is, methods that generate many classifiers and aggregate their results. Two well-known methods are boosting [50] and bagging [51] of classification trees. In boosting, successive trees give extra weight to points incorrectly predicted by earlier predictors. In the end, a weighted vote is taken for prediction. In bagging, successive trees do not depend on earlier trees, each is independently constructed using a bootstrap sample of the data set. In the end, a simple majority vote is taken for prediction [10].
An RF classifier consists of a number of trees, with each tree grown using some form of randomization (Figure 3). The leaf nodes of each tree are labeled by estimates of the posterior distribution over the image classes. Each internal node contains a test that best splits the space of data to be classified [17]. An image is classified by sending it down every tree and aggregating the reached leaf distributions. Randomness can be injected at two points during training: in sub-sampling the training data so that each tree is grown using a different subset, and in selecting the node tests [11].
The number of trees necessary for good performance grows with the number of predictors. The best way to determine how many trees are necessary is to compare predictions made by a forest to predictions made by a subset of a forest.
When the subsets work as well as the full forest, it indicates there are enough trees. For selecting, m try , Breiman [10] suggests trying the default, half of the default, and twice the default, and then select the best. If one has a very large number of variables but expects only very few to be "important", using a larger m try may give better performance. A lot of trees are necessary to get stable estimates of variable importance and proximity. Since the algorithm falls into the "embarrassingly parallel" category, one can run several random forests on different machines and then aggregate the votes components to get the final result [52].
The RF classifier adds an additional layer of randomness to bagging [10]. In addition to constructing each tree using a different bootstrap sample of the data, RFs change how the classification or regression trees are constructed. In standard trees, each node is split using the best split among all variables while in an RF, each node is split using the best among a subset of predictors randomly Journal of Data Analysis and Information Processing chosen at that node [53]. This somewhat counterintuitive strategy turns out to perform very well compared with many other classifiers, including discriminant analysis, SVMs and NNs, and is robust against overfitting [4] [10]. In addition, RF is very user-friendly in the sense that it has only two parameters (the number of variables in the random subset at each node and the number of trees in the forest), and is usually not very sensitive to their values [9] [54].
RF is essentially, a set of DTs combined where each tree votes on the class assigned to a given sample, with the most frequent answer winning the vote [55]. This algorithm can handle categorical features very well, can also handle high dimensional spaces as well as a large number of training examples [10]. RF are quite versatile and hence their popularity and application in diverse fields. A decision tree is a set of conditions organized in a hierarchical structure. It is a predictive model in which an instance is classified by following the path of satisfied conditions from the root of the tree until reaching a leaf, which will correspond to a class label. A DT can easily be converted to a set of classification rules [16].
The following types of scientific and engineering data are amenable to RF: DNA data, micro-array data, spectral data: NMR chemical data and molecular structure prediction, quality assessment of manuscripts published in a particular journal, finding clusters of patients based on, for example, tissue marker data, symptoms of a particular disease among others.

Neural Networks
A NN classifier can be described as a parallel computing system consisting of an extremely large number of simple processors with interconnections [56] [57]. One commonly used type of neural network is a multilayered feed-forward perceptron that consists of several layers of neurons connected with each other (Figure 4). The multilayered perceptron can separate data that are nonlinear and generally consists of three or more types of layers [58].
McCulloch and Pitts [59] are generally credited as the designers of the first neural network and earliest mathematical models. Many of their ideas, like many simple units combine to give increased computational power and the idea of a threshold are still used today. The first learning rule on NN was developed on  the premise that if two neurons were active at the same time the strength between them should be increased [60]. Further improvements and simulations were achieved [61]. During the decades of 1950 and 1960, many researchers worked on the perceptron amidst great excitement, however, by the year 1969, enthusiasm for NN research had waned [62]. Interest for NN research was rekindled in the mid-1980's rekindling [63]. Because of their ability to reproduce and model nonlinear processes, NN have found applications in a wide area of sectors: computer vision, speech recognition, machine translation, social network filtering, playing board and video games, medical diagnosis: data mining, cancers, including lung cancer, prostate cancer, colorectal cancers, quantum chemistry among others.

Assessing the Performance of a Model
With classification, it is sometimes necessary to use accuracy to assess the performance of a model. Consider analyzing a highly imbalanced data set. For example, trying to determine if a transaction is fraudulent or not, but only 0.5% of the data set contains a fraudulent transaction. Then one could predict that none of the transactions will be fraudulent and have a 99.5% accuracy score which is very misleading. So usually the sensitivity and specificity are used. Using the fraud detection problem, the sensitivity is the proportion of fraudulent transactions identified as fraudulent. The specificity is the proportion of non-fraudulent transactions identified as non-fraudulent.
Therefore, in an ideal situation, what is required are high sensitivity and specificity, although that might change depending on the context. For example, a bank might want to prioritize a higher sensitivity over specificity to make sure it identifies fraudulent transactions. The ROC curve (receiver operating characteristic) is good to display the two types of error metrics described above. The overall performance of a classifier is given by the area under the ROC curve (AUC). Ideally, Journal of Data Analysis and Information Processing it should hug the upper left corner of the graph, and have an area close to 1.

Attributes of the Classification Algorithms
KNN classifies data based on the distance metric whereas SVM need a proper phase of training. Due to the optimal nature of SVM, it is guaranteed that the separated data would be optimally separated [9]. Generally, KNN is used as multi-class classifiers whereas standard SVM separate binary data belonging to one class or the other. Although, SVMs look more computationally intensive, once training of data is done, that model can be used to predict classes even when applied to new unlabeled data [52]. However, in KNN, the distance metric is calculated each time a set of new unlabeled data is introduced. Hence, in KNN the distance metric always has to be defined [16]. SVMs have two major cases in which classes might be linearly separable or non-linearly separable [46]. When the classes are non-linearly separable, a kernel function such as Gaussian basis function or polynomials is used. Hence, in KNN, only the K parameter have to be set and the distance metric suitable for classification selected whereas in SVMs the R parameter (Regularization term) and also the parameters for kernel if the classes are not linearly separable have to be selected [8]. A main advantage of SVM classification is that it performs well on datasets that have many attributes, even when there are only a few cases that are available for the training process [7]. However, several disadvantages of SVM classification include limitations in speed and size during both training and testing phase of the algorithm and the selection of the kernel function parameters [53].
KNN is easy to implement and understand, but has a major drawback of becoming significantly slow as the size of that data in use increases [24]. KNN works by finding the distances between a query and all the examples in the data, selecting the specified number of examples (K) closest to the query, then votes for the most frequent label (in the case of classification) or averages the labels (in the case of regression) [5]. In the case of classification and regression, choosing the right K for a set of data is done by trying several Ks and picking the one that works best. However, KNN is less computationally intensive and easy to implement than SVM hence it is mostly used in the classification of multi-class data [15]. The algorithm that guarantees reliable detection in unpredictable situations depends upon the data. If the data points are heterogeneously distributed, both KNN and SVM work well [18] [64]. For homogeneous data, one might be able to classify better by putting in a kernel into the SVM. For most practical problems, KNN is a bad choice because it scales badly, if there are a million labelled examples, it would take a long time (linear to the number of examples) to find K nearest neighbors [14].
Different factors affect the capacity of NN to generalize, that is, to predict new data from the learning carried out with training data. The intrinsic factors to network design include the number of neurons and network architecture [6]. The problem of how to define the most suitable network architecture is related to the nature of the hidden layer. There is no rule for determining the number of hidden layers, but, theoretically, one single hidden layer can represent any Boolean function [65]. In general terms, the higher the number of units of the hidden layer, the greater the NN capacity to represent the training data patterns. However, the fact that the hidden layer has a high number of units also produces a loss in the networks' generalization power [4] [65] [66].
Unlike most methods based on machine learning, RF only needs two parameters to be set for generating a prediction model, that is, the number of regression trees and the number of evidential features (m) which are used in each node to make regression trees grow [19]. It has been demonstrated that with RF, by increasing the number of trees the generalization error always converges; hence, overtraining is not a problem [51]. On the other hand, reducing the number of m brings as a result a reduction in the correlation among trees, which increases the model's accuracy [49].
Adding more data would lengthen NN training times to unacceptable levels so that it would be highly impractical to work with them. Larger input datasets will lengthen classification times for NN more than for SVM and RF [4]. NN has the potential to become a more widely used classification algorithm, but because of their time-consuming parameter tuning procedure, the numerous types of neural network architectures to choose from, and the high number of algorithms used for training NN, some researchers recommend SVM or RF as easier methods which repeatedly achieve results with high accuracies and are often faster [67] [68].
The performance characteristics and attributes of the four types of non-parametric classification algorithms are summarized in Table 1. 1) Higher hidden layer has a high number of units and produces a loss in the networks' generalization power.
2) More data would lengthen NN training times to unacceptable levels so that it would be highly impractical to work with them.
3) Has time-consuming parameter tuning procedure.
RF 1) Generalization error always converges even with increasing number of trees.
2) It is not easy to overfit to one particular feature. However, overfitting to training data remains a problem.
1) Larger input datasets will lengthen classification times.

Conclusions
The assessed algorithms have different difficulties in their training. DT based algorithms (RF) involve a lesser difficulty in their training. This applies to both simple regression trees and ensembles of trees (RF). When the data are very scarce RF show a better performance compared to NN and SVM which become more complex. SVMs are based on different kernel types, according to which the combination of parameters to be optimized is different. However, it should be highly emphasized that no broader generalizations can be made about the superiority of any method for all types of problems as the performance of the methods might vary for other datasets. RF is too sensitive to small changes in the training dataset and is occasionally unstable and tends to overfit in the model. KNN is easy to implement and understand but has a major drawback of becoming significantly slow as the size of data in use grows while the ideal value of K for the KNN classifier is difficult to set. The NN method contains a high level of complexity in computational processing, causing it to become less popular in classification applications. SVM and RF are insensitive to noise or overtraining, which shows their ability in dealing with unbalanced data. Among the nonparametric methods, SVM and RF are becoming increasingly popular in image classification research and applications. Larger input datasets will lengthen classification times for NN and KNN more than for SVM and RF. NN has the potential to become a more widely used classification algorithm, but because of their time-consuming parameter tuning procedure, the numerous types of neural network architectures to choose from and the high number of algorithms used for training NN, most researchers recommend SVM or RF as easier methods which repeatedly achieve results with high accuracies and are often faster.

Declarations Authors' Contributions
The idea was developed by EYB and JO. Literature was reviewed by all authors.
All authors contributed to manuscript writing and approved the final manuscript.