^{1}

^{*}

^{1}

^{*}

Two important performance indicators for data mining algorithms are accuracy of classification/ prediction and time taken for training. These indicators are useful for selecting best algorithms for classification/prediction tasks in data mining. Empirical studies on these performance indicators in data mining are few. Therefore, this study was designed to determine how data mining classification algorithm perform with increase in input data sizes. Three data mining classification algorithms—Decision Tree, Multi-Layer Perceptron (MLP) Neural Network and Na ïve Bayes— were subjected to varying simulated data sizes. The time taken by the algorithms for trainings and accuracies of their classifications were analyzed for the different data sizes. Results show that Na ïve Bayes takes least time to train data but with least accuracy as compared to MLP and Decision Tree algorithms.

A large volume of data is poured into our computer networks, the World Wide Web (WWW), and various data storage devices every day from business, society, science and engineering, medicine, and almost every other aspect of daily life. This explosive growth of available data volume emanates as a result of the computerization of our society and the fast development of powerful data collection and storage tools [

Data mining is used for the extraction of information (patterns, relationships, or significant statistical connections) from very large databases or data warehouses [

Efficiency and scalability are always considered when comparing data mining algorithms. Data mining algorithms must be efficient and scalable in order to effectively extract information from huge amounts of data in many data repositories or in dynamic data streams. In other words, the running time of a data mining algorithm must be predictable, short, and acceptable by applications. Efficiency, scalability, performance, optimization, and the ability to execute in real time are key criteria that drive the development of many new data mining algorithms [

This study was set out to empirically study the performance of three classification algorithms in terms of the times taken for training and accuracies of their predictions. The algorithms in question are Decision Tree (DT), Multi-Layer Perceptron (MLP) Neural Network and Naïve Bayes.

In the rest of this paper, related works are highlighted in Section 2 while methodology adopted is discussed in Section 3. The results obtained from the experiment are discussed in Section 4 while conclusion is drawn in Section 5.

Classification has been identified as an important problem in the emerging field of data mining. Over the years, there has been quite a number of tremendous studies on classification algorithms [

While classification is a well-studied problem, in recent times there has been focus on algorithms that can handle large databases. Applications of classification arise in diverse fields, such as retail target marketing, customer retention, fraud detection and medical diagnosis. Several classification models have been proposed over the years, such as Artificial Neural Networks (ANNs), statistical models, decision trees and genetic models [

Scalability implies that as a system gets larger, its performance improves correspondingly. Data mining scalability connotes taking advantage of parallel database management systems and additional CPUs as one can solve a wide range of problems without needing to change the underlying data mining environment [

A Neural Network is a set of connected input/output units in which each connection has a weight associated with it. During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label of the input tuples.

Decision trees are powerful and popular for both classification and prediction. They are also useful for exploring data to gain insight into the relationships of a large number of candidate input variables to a target variable [

Naive Bayesian is a simple but important probabilistic model, because the Naïve Bayesian classifiers are statistical classifiers. In simple terms, a naive Bayesian classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. The Naïve Bayesian classifier is one of the most popular data mining techniques for classifying large dataset. The classification task is to map the set of attributes of sample data onto a set of class labels, and naïve Bayesian classifier particularly suitable as proven universal approximates [

Daniela, Christopher and Roger [

Abirami, Kamalakannan and Muthukumaravel [

Gopala, Bharath, Nagaraju and Suresh [

Anshul and Rajni [

Performance evaluation is a multi-purpose tool used to measure actual performance against expected performance. Evaluating the performance of a data mining technique is a fundamental aspect of machine learning. Evaluation method is the yardstick to examine the efficiency and performance of any model.

The versions of OS used in this evaluation study was Windows 8.1. This was the latest version of Windows OS as at the time of this study.

Waikato Environment for Knowledge Analysis (WEKA) data mining tool (version 3.6.11) was used for the experiments. Different characteristics of the application using classifiers to measure accuracy, performance metrics and time taken to build models considering different data sizes of the dataset were explored.

The source of data for this study was from a simulated data. An application program using Java Programming Language was developed to simulate Ebola disease data. The simulated data were stored in a MySQL database. The Ebola dataset has its own properties like the number of instances, the number of attributes and number of classes.

The Ebola disease dataset used for the tests was from an anonymous simulated data. The dataset consists of 250 to 10,000 instances (records) with nine attributes (representing symptoms). Each of the attributes being reclassified as 0 for “No” and 1 for “Yes”. The target variable (that is, “Remark”) consists of two classes: “Yes” for positive to Ebola and “No” for negative to Ebola. The sample structure of the Ebola Disease data set is shown in

Computer System | Specifications |
---|---|

HP 250 | Processor: Intel (R) Pentium (R) CPU N3510 @ 1.99GHz Memory (RAM): 1.9GB System Type: 64-bit Operating System, x64-based processor HDD: 320Gb Windows: Windows 8.1 Single Language |

Fever | Nausea | Headache | Tiredness | Vomiting | Diarrhea | Coughing | Bleeding | Remark |
---|---|---|---|---|---|---|---|---|

0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | Yes |

0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | No |

1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | No |

0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | No |

The Ebola disease dataset was experimented with three classification algorithms: Decision Tree (J48), Naïve Bayesian (Naïve Bayes) and Artificial Neural Network (ANN, Multilayered Perceptron). Each algorithm was trained with the Ebola Disease data using 66% split and Cross-Validated with 10 Fold option. The training was carried out with respect to different data sets: 250, 500, 1000, 2000, 3000, 3500, 4500, 5000 and 10,000.

Two performance metrics: time to build model (Training Time) and percentage accuracy (correct classifications) were obtained for each of the data sets using the three classification algorithms. The performances were then compared statistically using Analysis of Variance, (ANOVA) and simple correlations.

From

The rank correlation coefficients of the three algorithms between data sizes and time used for trainings are shown in

ANOVA Result showed a high significant difference in the time complexities among the three algorithms (F = 13.669 and p = 0.0, where p is the level of significance). Further Tukey HSD test indicates that there were significant differences in the time complexities between MLP and J48 (p = 0.0), MLP and Naïve Bayes (p = 0.0) while J48 and Naïve Bayes had no significant differences in their time complexities (p = 1.0). The mean difference was significant at the 0.05 level.

Algorithm | Correlation |
---|---|

J48 | 0.96 |

Naive Bayes | 0.85 |

MLP | 0.99 |

The rank correlation coefficients of the three algorithms between data sizes and percentage correct classifications are shown in

Algorithm | Correlation |
---|---|

J48 Naive Bayes MLP | 0.53 −0.82 0.38 |

ANOVA Result showed a high significant difference in the accuracies of the three algorithms (F = 202.96 and p = 0.0, where p is the level of significance). Further Tukey HSD test indicates that there were significant differences in the percentage accuracies between Naïve Bayes and J48 (p = 0.0), Naïve Bayes and MLP (p = 0.0) while J48 and MLP had no significant differences in their accuracies (p = 0.96). The mean difference was significant at the 0.05 level.

Results from this study show that there is a trade-off between accuracy and time complexities of the three algorithms (Multi-layer Perceptron, Naïve Bayes and Decision Tree) used. Low accuracy means low time complexity and vice versa. For instance, Naïve Bayes, having least time complexity for training has low accuracy but Multi-Layer Perceptron and Decision Tree with higher time complexity had higher accuracy in their classifications. Naïve Bayesian (Naïve Bayes) classification algorithm tends to have more error rate with respect to the growth of the size of data-instances. This result indicates that users have to choose in between accuracy and time needed for training when choosing any of these three algorithms for classification tasks.

Neural networks usually have long training times and are therefore more suitable for applications where this is feasible. They require a number of parameters that are typically best determined empirically such as the network topology or “structure” [

Although decision tree classifiers have good accuracies, as confirmed in this study, however, successful use may depend on the nature and size of data at hand. While decision trees classify quickly, the time for building a tree may be higher than another type of classifier. Decision trees suffer from a problem of errors propagating throughout a tree; a very serious problem as the number of classes increases.

Naïve Bayesian models are popular in machine learning applications, due to their simplicity in allowing each attribute to contribute towards the final decision equally and independently from the other attributes. This simplicity equates to computational efficiency, which makes Naïve Bayesian techniques attractive and suitable for many domains [

Performance evaluation of data mining algorithms is very essential as this will help users to choose the best algorithm needed for their classification/prediction tasks. In this study, the performances of Decision Tree, Multi- Layer Perceptron and Naïve Bayes classification algorithms were studied with respect to their times taken for training and accuracy of prediction. The study shows that even though Naïve Bayesian algorithm takes less time for its prediction, its accuracy becomes low as data size increases.

S. OlalekanAkinola,O. JephtharOyabugbe, (2015) Accuracies and Training Times of Data Mining Classification Algorithms: An Empirical Comparative Study. Journal of Software Engineering and Applications,08,470-477. doi: 10.4236/jsea.2015.89045