Classification of Hematological Data Using Data Mining Technique to Predict Diseases

Over the years, the amount of information about patients and medical information has grown substantially. Moreover, due to an increase of blood diseases patients, conventional diagnostic tests have been using by the medical pathologists which are low in cost and result in an inaccurate diagnosis. To recognize optimal disease pattern from hematological data, a reliable prediction methodology is needed for medical professionals. Data mining approaches permit users to examine data from various dimensions, group it and sum up the relationships identified. Classification is a vital data mining technique with extensive applications. Classification algorithms are applied to categorize every item in a set of data into one of a known set of classes. The objective of this paper is to compare different classification algorithms using Waikato Environment for Knowledge Analysis and to find out a most effective algorithm for end-user functioning on hematological data. The most efficient algorithm found is Random Forest having accurateness at 96.47% and the overall time is taken to construct the model is 0.16 seconds which is more efficient than different existing works. On the contrary, Multilayer Perceptron classifier has the lowest accuracy of 75.29% with 1.92 seconds to construct the model.


Introduction
Data mining is the process of finding useful and relevant information from the various types of databases.Different approaches to data mining were suggested to face the challenges of storing and processing all types of data [1].Nowadays data mining has increasing applications in Medical Science, Railway and so on [2].Data mining provides doctors to provide necessary treatments, and thus patients are treated better along with more cheap health services, becoming popular day by day [3] [4] [5] [6].In pathology, it has become familiar with a strong technique in dealing with enormous pathological information to search knowledge that is given.Additionally, comparison of different classification techniques using WEKA (Waikato Environment for knowledge analysis) for blood-related data is a demanding task in medical science research.To find out better classification algorithms, it is hard to compare different classification algorithms in different collections of data [7].The main concern is the classification of hematological data to predict diseases.With this purpose to perform better, hematological data analysis is divided into three phases: Hematological data collection, classification algorithms and evaluation of results and performance.Major data mining techniques are three which are known as regression, classification and clustering.
The application of data mining now goes towards clinical research such as AML (Acute Myeloid Leukemia) where predictive model plays an important role [8] [9] [10] [11].
The remainder of this paper is organized as follows.Section 2 reviews the related works.Section 3 describes material and methods.Dataset and preprocessing are explained in Section 4. In Section 5, experimental results and discussion are illustrated.At Section 6, the conclusion is given.

Related Works
Several types of research have been made to evaluate the performance of data mining classification algorithms using WEKA.In the study [3] [12], the researchers evaluated the performance of data mining classification algorithm in WEKA.Another research in [1] compared different classification techniques using different datasets.The research in [2] compared the various clustering algorithms of WEKA tools.Moreover, performance analysis and evaluation of various data mining algorithms used for cancer cell classification had done [13].
This is also used in artificial intelligence and predicting abnormality in peripheral blood smear [14] [15].Data mining classifiers were used in the study [16] to develop an automated diagnosis of thalassemia [17].Also, analysis of various clustering algorithms of data mining on health informatics was performed [18].
The area of bioinformatics has also used data mining tools and various classification techniques which were compared [19]- [24].Data mining techniques were also used to differentiate between the patients with a normal blood disease and patients with blood tumor [25].Another study highlighted on contrasting of two classification techniques J48 and Random tree by means of WEKA to classify Sickle Cell Diseases (SCD).More recently, anemia has foreseen using different data mining classification algorithms [12] [26] [27] where J48 algorithm confirmed its best performance in classifying types of anemia [28].Besides, WEKA Journal of Computer and Communications has been used in this experiment as hidden predictive information can be extracted using this algorithm from large database [29].In addition, the experiment has been conducted for CBC (Compete Blood Count), which is quite rational to extract data using the intended algorithm as the WEKA is being employed for data mining widely.

Material and Methods
In

Dataset and Preprocessing
The dataset of experiment1 comprises of 425 samples and dataset of experiment 2 consists of 298 samples.The attributes characterize the Complete Blood Count (CBC) features as in Table 1.
In the preprocessing of the dataset, irrelevant attributes were eliminated, refilled the missing values and removed/refilled the outlier values on the outlier samples.Table 2 represents the dataset attributes which are used in this investigation.

Result and Discussion
In this study, the experiment that employs the data mining classifiers will be separated into two branches: the experimentation with full and reduced features.
The outcomes from these two branches and in-depth classification accuracy analysis highlighting on the classification errors will be displayed in following sections.Three experiments were conducted in each type: the first one is to measure the performance of the Random Forest Tree classifier; the second one is to measure the performance of the Bayesian Network classifier, the third one to measure the performance of the Neural network (Multilayer Perceptron).The

Experiment with Full Features
In these experiments, whole traces aspects of each sample were used.The Random Forest tree classifier gives an accuracy of 96.47%, the Neural Network (Multilayer Perceptron) presents accuracy of 75.29%, and finally, the Bayesian network classifier provides accuracy of 84.70% as shown in Figure 1 and in Table 3.

Experiment with Reduced Feature
The results from these experiments are given in Table 4.The Random Forest Tree classifier puts the accuracy of 86.44%, while the Neural Network classifier provides accuracy of 52.54% and the Bayesian Network classifier gives an accuracy of 74.57% as shown in Figure 2 and in Table 4.
After considering Figure 1, Figure 2 and Table 5, it is seen that the maximum accuracy is 96.47% and the minimum accuracy is 52.54%.It can be concluded that Random Forest tree classifier is better than other classifiers considered.

Conclusion
This paper evaluated and investigated three preferred classification algorithms based on WEKA.By utilizing the hematological data, the superlative algorithm found is Random Forest Classifier with an accuracy of 96.47% and the total time taken to build the model is at 0.16 s.Neural Network has the accuracy of 52.54% which is the lowest accuracy in comparison with others, which is an affirmative side of this study.These results will aid the researchers to get competent results for a particular dataset.The finding will help users to analyze disease in minimal time which is a good contribution of this study.

F
. Akter et al.DOI: 10.4236/jcc.2018.6400777 Journal of Computer and Communications this study, an open to all data mining tool WEKA (version 3.8.0)has been used.Two dissimilar data sets have been utilized and the performance of classification algorithms (classifiers) has been examined.The analysis has been carried out by SONY VIAO Windows version 8 with Intel® Core™ i3 Central Processing Unit, 1.70 Gigahertz Processor and 4 Gigabyte RAM.The data sets have been selected so that they vary in size, predominantly with the number of attributes.The hematological parameters consist of White blood cell o (WBC), Red blood cell count (RBC), Hemoglobin (Hb), Hematocrit (Hct), Mean corpuscular volume (MCV), Mean corpuscular hemoglobin (MCH), Mean corpuscular hemoglobin concentration (MCHC), Platelet count (PLT), Neutrophil count (NEU), Lymphocyte (LYMP), Monocyte (MONO), Eosinophil (EO), and Basophil (BASO) (SysMex 1000i Sysmexcorporation, Kobe, Japan).Hematological data were evaluated by the hand of a medical technologist.Data which are collected are allocated to multiple tags: indicative of anaemia of unceasing disorder, Eosinophilia, Microcytic hypochromic anaemia, Normocytic anaemia, Neutrophil leucocytosis, Neutrophilia, Non-specific findings, High ESR.
feed-forward back-propagation neural network classifier was regulated with 500 training cycles, learning rate 0.3, and momentum 0.2.

Table 4 .
Simulation result of experiment 2.

Table 5 .
Comparison of various classifiers.