Evaluation of TB Patients Characteristics Based on Predictive Data Mining Approaches

According to the World Health Organization, Tb is the biggest cause of death among the infectious diseases. Due to the high percentage of people with tuberculosis infection and the high number of death among these patients, this study is a prospective study aimed to categorize and find the relationship between different clinical and demographic characteristics. The study was conducted on 600 patients from Masih-e-Daneshvari tuberculosis research center during 2015-2016. The K-Means clustering data mining algorithms and decision trees are used to perform the categorization and determine common indicators among patients. 2 clusters according to Dunn index were chosen as the optimal clusters. Common factors between clusters are provided in detail in the findings section. According to the results of this study, the most important factors identified by the clustering include hemoglobin, age, sex, smoking, alcohol consumption and creatinine. The RBF neural network tree has 98% accuracy. According to the results of this study, the most important factors identified are sex, smoking, alcohol consumption and WBC, albumin.


Introduction
Almost a third of the world population (around 2 billion people) are infected by TB and at the risk of infection by this fatal disease.According to the World Health Organization report, 9 million people have been infected by active TB and about 1.5 to 2 million people lose their lives annually due to this disease.TB which is the biggest cause of death among uni-factorial infectious disease (even more than AIDS, malaria and measles), ranks the tenth among global disease and it is projected that maintain its current position by 2020 (or rise to the se-venth place) [1].Tuberculosis is one of the oldest human diseases with the highest mortality rate among infectious diseases which has attracted the attention of the world.The death rate from tuberculosis since 1990 has decreased by 41% and the goal of achieving a 50% reduction by 2015 is determined about that.
However, the global burden of tuberculosis is still great.In 2011, there was an estimation of 8.7 million new TB cases (13% infection with HIV).TB contributed to one third of the 1.2 million deaths from HIV/AIDS and HIV was responsible for 25% of the 1.5 million TB deaths and 1.4 million people had lost their lives due to tuberculosis.
Data mining is one of the new fields which can extract useful information and patterns using statistical data.Data mining represents a significant advance in a variety of available analytical tools and is considered as a reliable, sensitive and valid method to discover patterns and relationships between data [2].One of the fields that we can use this knowledge effectively and achieve remarkable results is medical data.Enhancing accuracy, reducing costs and human resources have been proven by Khajavi and Jayalakshy [3] [4] as the benefits of data mining in medical analysis.Classification of data mining in medicine includes investigating the effect of the drug on the disease, identifying the side effects of medications, specifying the type of treatment, the analysis of data available in Electronic Health Records (EHR), diagnosis and prognosis of diseases such as cancer, analysis of medical images such as mammography, ultrasonic, X-ray and MRI, providing descriptive models based on medical data, controlling the hospital infection and utilization of health services [5].Alizadeh et al. (2014) for example, have identified and introduced most influential factors on osteoporosis using the C.5.0, CHAID algorithms and artificial neural network.The effective characteristics of the disease have been identified using data mining and its methods.Some rules are derived using decision tree that can be used as a model to predict the patient's status.The precision of built models using C.5.0, CHAID algorithms and artificial neural network is compared with each other.The results of this comparison show that all of these algorithms have shown better performance in group predicting of people [6].
Many studies on lung diseases, particularly tuberculosis were conducted using data mining techniques.The conducted studies can be divided to 3 main groups.
The first group is categorized as forecasting the TB type.Nagabhushanam et al.
(2013) used multi-layer Neural Networks and ANFIS with 97% precision to predict tuberculosis, which is in this group [7].The second category is TB diagnosis.
Tamer et al. (2012) created a method using ANFIS and hard sets algorithm to diagnose the TB with 97% and 92% precision level of the model respectively, which is in the second category [8].And the last group is categorizing TB patients.Karahuka et al. (2011) classified TB patients based on the laboratory and demographic characteristics using neural networks and ANFIS with 97% precision [9].
One of the most important causes of failure of global efforts to control the TB disease is delayed and wrong diagnosis of the treatment.The aim of this study is to investigate the TB patients' features to acquire new knowledge in the field and F. F. Jahantigh, H. Ameri identify these people with the hope that it can be possible to diagnose the TB disease faster and more accurately with the proper patterns, so that as a result the number of patients with multidrug-resistant tuberculosis bacilli (MDR-TB) is reduced.

Proposed Classification Methodology
Our proposed method consists of two main steps: preprocessing phase and classification phase, as illustrated in Figure 1.In the first phase, preprocessing steps are performed in order to find most important characteristics; these steps consist of applying Kmeans algorithm to cluster features.In the second phase, different Artificial Neural Networks algorithm is used as a classification algorithm to classify those informative characteristics.

Materials and Methods
In addition to relevant data, an appropriate process and data mining methods should also be used to have an effective data mining process.The way includes all the data mining steps, from data collection, data preparation, modelling and evaluation [10].Therefore, based on the CRISP (Cross Industry Process for Data Mining) methodology, the data mining process of the current study is completed.

Data Pre Processing
We conduct collecting data, describing and reviewing them, inspecting and validating the quality of the data in this phase.The required data is collected from Masih-e-Daneshvari TB research center of Tehran during 2015-2016.There were 600 primary records from patients; after filtering and removal of records that they didn't consist of primary information, we achieved 525 final records.
The average age of patients was 53 years.50 percent of patients were men and the rest were women.83 percent of patients had contact with TB patients.Laboratory characteristics of patients were investigated and identified in this stage.

Feature extraction Classifiers
The data processing is used to remove a number of inconsistencies and incomplete data associated with the data.Many data processing techniques are developed by Chin et.al and Hen et al. [10] [11] [12].In this study, the items with zero value for laboratory and demographic characteristics are removed.
Chen et al. [13] demonstrated that the wise removal is an efficient method instead of replacing values with techniques such as mean, random assignment, the regression assignment and Bayesian model.the number of white blood cells (WBC), the amount of hemoglobin in blood (HB), platelet count (PLT), erythrocyte sedimentation rate (Erythrocyte sedimentation rate), fasting blood sugar (FBS), Creatinine and Albumin variables with numerical values as range are used and coded based on scientific valid resources and sites [14] [15] [16] [17] and approval by a physician.As a result, after refining the data records we got the records with the characteristics in Table 1.

Modeling and Assessment
There are many data mining methods for modeling.Clustering medical data into small yet meaningful clusters can aid in the discovery of patterns by supporting the extraction of numerous appropriate features from each of the clusters thereby introducing structure into the data and aiding the application of conventional data mining techniques.In this phase, we find the model and optimum pattern using data mining techniques.Clustering is an unsupervised method that groups the similar samples in terms of the data volume.Data are entered into the K-means model to perform the clustering.The clustering is a form of learning by observation.With clustering, similar samples are placed in the same group [10].
Here, k-means algorithm assumes that each data point has a single comparable numeric value.Otherwise, when the data points have multi-attribute values, as it is the case in our patient data, distance between data points are calculated using Euclidian distance.Assume that two data points d 1 and d 2 have n attribute values each: d 1 (a 11 , a 12 , …, a 1n ) and d 2 (a 21 , a 22 , …, a 2n ).Then, the distance between these two points is calculated as follows: We obtain the optimal clustering using Dunn index from different clusters that we entered as input to the model.The indicator is used to obtain focused clusters with fixed boundaries.Dunn index is calculated as follows: ( ) where d (c i , c j ) and diam (c i ) are calculated as follows:  This aims of the indicator is to maximize the within-cluster distance while minimizing inter-cluster distance.It is more favorable that the values of the index are bigger.The number of clusters that increases the value of the index is the optimal number of clusters [18].The optimum number of clusters using Dunn index is 2 clusters (Figure 2).
After modeling we should evaluate the results of the modeling.Assessment results are used to improve the model and make it usable.According to the Dunn index, 2 clusters have been chosen as the optimal number.The most important factor in clustering is similarity.This means that objects within a cluster are similar.The similarity of each cluster is evaluated based on the average objects of that cluster.When the objects in each cluster have placed in separate categories and do not interfere with each other, the clustering process is known as optimal.The more focused the clusters are, the more efficient clustering operation is performed.After clustering, most important features are extracted then fed to classifier model.
The model composition is not the end of a project and the aim of the data mining projects is knowledge discovery and applying the discovered knowledge in the future.The discovered Knowledge should be organized and usable for others.The main objective of this project is to find the common features between tuberculosis patients and categorize these patients.

Findings
The goal of data mining is to extract knowledge from information stored in the database and create a clear and understandable description of patterns.Factors that are recognized in 2 optimum clusters as important factors using the K-Means clustering method include: the number of sputum test, ESR, hemoglobin, night sweats, the white blood cells, albumin, age , alcohol consumption, smoking and its duration, fever, AIDS, type of job, weight loss and gender.
Each feature depends on the amount and scope of the changes within the cluster is introduced as an important, not-important or marginal feature.For example, in cluster 1, 264 records include 262 women and 2 men, cluster 2 with a total of 260 records include 257 men and 3 women.So the "sex" feature in all 2 clusters is determined as important factor.Importance factors for each cluster are shown in Table 2.It is obvious from clusters that men and women shows different behaves.After extract important features, the cluster filed added as an output filed.The data set partitioned into two parts, 70% as training and 30% for testing the model.Different decision trees are used and their accuracy is compared in Table 3.
The accuracy of a classifier refers to the ability of a given classifier to correctly predict the class label of new or previously unseen data [12].
Table 3 shows that neural network model with prune method has the highest accuracy.Mathematically, neural nets are nonlinear.Each layer represents a non-linear combination of non-linear functions from the previous layer.Some of the more important parameters in terms of training and network capacity are the number of hidden neurons and have the higher importance rate.A method proposed by Garson 1991 identifies the relative importance of explanatory variables for specific response variables in a supervised neural network by deconstructing the model weights.The basic idea is that the relative importance (or strength of association) of a specific explanatory variable for a specific response variable can be determined by identifying all weighted connections between the nodes of interest.That is, all weights connecting the specific input node that pass through the hidden layer to the specific response variable are identified.This is repeated for all other explanatory variables until the analyst has a list of all weights that are specific to each input variable.Table 4 shows the variable importance and its rates.

Discussion
In this study we tried to extract communications between the different characteristics of patients with tuberculosis using data mining algorithms.For this purpose, we used the K-Means non-supervisory clustering algorithm and decision trees.The most important factors identified by using of neural network include sex, fever, job, night sweat, smoking and WBC.
In terms of the frequency and relationships of the TB patients' characteristics, reports were made available to us using the statistical software.In the field of  The mean age (SD) was not significantly different between the cases and controls .TB patients were those who had less education and the infection more likely common among male patients [21].
F. F. Jahantigh, H. Ameri According to the conducted research, most of the factors that have been studied include age, cough, sputum, fever, night sweats, weight loss and AIDS.In this study, we have tried to examine these factors as well as clinical and laboratory factors.The results are approved by physicians.
Our proposal for future work is examining the relation of these characteristics and comorbidities of TB patients with the aim of controlling risk factors and helping to reduce the incidence of these diseases in people with tuberculosis.

Figure 2 .
Figure 2. Determining the optimal number of clusters using Dunn index.

Table 1 .
Data and corresponding values after preprocessing.

Table 2 .
Important features by clusters.

Table 3 .
Compare the accuracy of different decision trees.

Table 4 .
Variable importance of TB patients.was no full report available to us.So we tried to investigate some works closer to our effort.Asha et al. (2011) used 700 real data collected from an urban hospital for TB diagnosis using clustering and classification techniques.The data used in this study included age, cough, weight loss, fever, night sweats, blood-tinged sputum, chest pain, AIDS, radiographic findings, sputum, wheezing and TB [19].Baker et al. (2007) examined the records of 233 patients with tuberculosis.Features used include: age, gender, weight loss, coughing more than 3 weeks, night sweats, fever, sputum and blood-tinged sputum.The method used in this research is discretization using regression [20].Abdullah et al.
(2012) examined factors related to the epidemic of non-pulmonary tuberculosis in East Sudan.