Predicting the Relapse Category in Patients with Tuberculosis : A Chi-Square Automatic Interaction Detector ( CHAID ) Decision Tree Analysis

Predicting the outcome of treatment among TB patients is a big concern of the Department of Health. Data mining in health care system can be used for decision making. The most widely used for data exploration is decision tree based on divide and conquer technique. The objectives of this article are to create a predictive data mining model for TB patient category to find the relapse treatment and to classify the factors influencing the relapse treatment to provide assistance, guidance, and appropriate warning to TB patients who are at risk. The dataset of TB patient records is verified and applied in CHAID classification tree algorithm using SPSS Statistics 17.0. The classification tree model identified the set of two statistically significant independent variables (DSSM Result, Age) as predictors of patient category.


Introduction
Philippine Tuberculosis (TB) is a foremost community health problem and remains a major cause of death and it is one of the nations with high TB incidence.
"Philippines ranked ninth among the 22 high TB burdened countries" [1].In 2015, 14,000 Filipinos died from tuberculosis and 4.8 million from this number are mostly poor [2].TB is a treatable and preventable disease, yet many are still infected and are continuously suffering.
In TB treatment, one major problem is guaranteeing patients to pursue their How to cite this paper: Dela Cruz, A.P. (2018) Predicting the Relapse Category in Patients with Tuberculosis: A Chi-Square Automatic Interaction Detector (CHAID) Open Journal of Social Sciences treatment, together with medication and medical checkups till completion.Hence, there's a desire to boost the adherence and retention in care.Whereas there could also be several reasons for the lack of endurance and there could also be ways to boost completion of treatment programs by maintaining better contact between health workers and TB patients.Treatment results fill in as intermediary proportions of the nature of tuberculosis treatment provided by the health care system, and it is essential to assess the effectiveness of Directly Observed Therapy-Short course program in controlling the disease, and diminishing treatment failure, default and death [3].TB patient relapse needs immediate retreatment, or fail following preliminary treatment success."Results among patients getting a standard World Health Organization Category II retreatment routine are imperfect, resulting in increased risk of disease, transmission, and drug resistance" [4].Tuberculosis category relapse is divided into two classes: firstly, a patient on whom the first onset has been treated, but the remaining mycobacterium tuberculosis restarts into a second onset of TB; and secondly, a patient with reinfection with new mycobacterium tuberculosis [5].
In this study, the Chi-Square Automatic Interaction Detection (CHAID) decision tree algorithm is employed to predict the patient category relapse in Cabanatuan City, Philippines."CHAID applications focused on the field of medical and psychiatric research although it can be employed also in researches of different fields.The technique was developed in South Africa and was published in 1980 by Gordon V. Kass, who had completed a PhD thesis on this topic" [6].CHAID is utilized for prediction, for classification and for recognition of interconnection among variables.CHAID, because of its usefulness, was utilized in several studies.The authors in [7] use CHAID to form associations/structured relationships between factors in the classification of observations of the quality of housing eligibility in Kupang Regency while researchers in [8] utilized CHAID to dig up information on the potentials of local food availability of corn in regencies and cities in Java Island.
Other researches use CHAID to: explore the adverse effects of social networking sites on students' academic performance in secondary schools [9]; compare the quality of the information obtained on tourism market segmentation [10]; and determine the anemic status of infants as well as the risk factors in a representative downtown area of Beijing [11].
The reasons behind the selection of decision trees as the basis method of this study can be enumerated like: 1) CHAID decision tree model is understandable, easy, and interpretable model 2) and it is fast to build on the predictive methods, then it's highly appropriate and flexible for future changes of data as it has low training time [12].Furthermore, for this kind of task, the decision tree technique has high prediction accuracy in many fields to make them preferable and trustable choices.With the support of this method, treatment category relapse patterns are exposed and a mostly accurate predictor is attained for treatment priority prediction.
This research aims to create a predictive data mining model for relapse cate-Open Journal of Social Sciences gory rate of TB by using Integrated Tuberculosis Information System (ITIS) data on reported cases and found variables influencing the relapse treatment of TB.

Methodology
The study is a quantitative research design that uses the statistical method to The study used data mining as a tool with CHAID classification tree as a technique to design the TB patient category relapse prediction model.According to [16], "data mining is the process of sorting through large data sets to identify patterns and establish relationships to solve problems through data analysis.
Data mining tools allow enterprises to predict future trends".Classification trees are broadly utilized in various fields such as botany, medicine, computer science, and psychology [13]."These classification trees promptly give themselves to being presented graphically, assisting to make them easier to analyze than they would be if only a strict numerical interpretation were possible"."CHAID considered one of the classification tree algorithms is the name quantified to one version of the Automatic Interaction Detector that has been developed for categorical variables" [14].Actually, CHAID is a system that halves a population into independent and particular portions.These portions called nodes are split in such a way that the disparity of the response variable is limited inside the portions and make the most of among the parts.The output of CHAID prediction model is presented in hierarchical tree-structured method, in which the root is the population, and the branches are the associating segments such that the variation of the response variable is limited within all the segments, and maximized among all the segments."The important step in CHAID prediction model structure is selecting the significant features for classification and the purpose of feature selection techniques supports the reduction of computation time and increases the predictive accuracy of the model" [15].

CHAID Algorithm in Predicting Patient Category
The

CHAID Analysis Application
In order to form a decision tree by means of CHAID algorithm, according to its nature, initially, a description of the used variables was achieved as follows: The were more than 51 years old.In terms of sex, the data revealed that there were more males than females.In their Bacterial status, 49% of the patients were under clinically-diagnosed TB while 51% belong to Bacteriologically-confirmed TB.
Almost 100% of the Tuberculosis cases on the data were under pulmonary while in patients' category majority linked to non-relapse.

Modeling Results Analysis
The most significant independent variable in Figure 1  Most of the respondents (231) go to node 1 where in the value of DSSM Result are "2+" and "ODT".The 87 respondents belong to node 2 containing DSSM Result is equal to "1+".Node 3 has 130 respondents where DSSM Result is equal to "0", and the rest of the participants (62) belong to node 4 where DSSM Result is equal to "3+".For DSSM Result, within the first level of the tree, node 3 is parent node, while nodes 1, 2, and 4 are all terminal.
The second level of the tree is variable Age which is statistically significant.Independent Variable Age is significant for splitting of node 3 (Chi-square = 144.870,df = 1, p-value = 0.001).In congruence to this, the following two groups of respondents are found: TB patients with Age of less than or equal to 51 (≤51) belong to node 5, while Age is greater than 51 (>51) belong to node 6.All the nodes are terminal in the final level of decision tree.
The four (4) terminal nodes in the formed tree structure are marked as 1, 2, 4 and 5 relates to Non-Relapse, however node 6 refers to Relapse.Actually, the lanes from the root to terminal nodes produce a set of rules for classification of TB patients into one of the defined categories of the variable Patient Category.This obviously specifies that the developed model and knowledge described in the decision tree can be formulated as if-then rules.In line with the abovementioned, it can be specified that overall accuracy of the model is 90%.The model has accurately categorized 459 out of 510 TB patients in the observed sample.Observed by the categories of the dependent variable, significant differences in classification accuracy can be seen.

Conclusions
In this study, CHAID classification tree technique is used for prediction on the dataset of 510 TB patients to predict and analyze the patient category relapse.The CHAID prediction model was very convenient and useful to evaluate the coherence among variables that are utilized to predict the relapse in TB treatment category.A model was developed based on TB patient correlated input variables gathered from the ITIS database of city health office.The variables DSSM Result, and Age are the strongest indicators for the prediction of patient category relapse treatment.From the classification matrix, it is clear that 90% is the overall accuracy of the model, and only 10% in prediction risk.
As a future work, the author is planning to create models with a three-year period of dataset to attain more precise results, and engage additional techniques from the dataset.
quantify and analyze the data to generalize results from a sample population.It was done in Cabanatuan City, Nueva Ecija, Philippines.The data came from dataset of TB patient records of Cabanatuan which were extracted from the database of Integrated Tuberculosis Information System (ITIS) of Cabanatuan City Health Office last 2017.This confirms the correctness and comparability of data, which are significant features in CHAID model.The collected data were coded and scrutinized using the Statistical Package for Social Sciences (SPSS Statistics 17.0).The study protocol of data collection and interview was approved by the Office of the City Mayor and the director of the City Health Office.
is DSSM Result of TB patients.It has the most power in division of observations into groups and most strongly associated with the dependent variable.(Statistical significance of DSSM Result was determined using following values: chi-square = 111.459,df = 3, p-value = 0.000.)As the first discriminator, the DSSM Result splits the root node into four groups with 510 respondents presented as node 1, node 2, node 3, and node 4.
Chi-square statistic and corresponding p-value.The CHAID analysis builds a predictive model to help define how variables are best unified to describe the result in the specified dependent variable.
CHAID (Chi-Square Automatic Interaction Detection) algorithm is one of the most prevalent statistically based methods of supervised learning for decision tree development proposed by a statistician Kass in the late 1970's.The CHAID acronym denotes automatic and iteration technique of tree development based on Pearson's variable, Patient Category is defined as a dependent variable.Patient Category is a nominal variable with two values (non-relapse, relapse), the creation model can be based on Chi-square splitting criterion.As observed on Table 1, as to age, 404 patients were 51 and below and 106

Table 1 .
Structure of variables used in CHAID analysis.
Legend: MS is Measurement Scales; NV is Nominal Variable; SV is Scale Variable; fi is frequency; % is Percentage; IV is Independent Variable; DV is Dependent Variable.

Table 3 .
Classification matrix.Classification matrix was presented in Table3containing by categories of the empirical and modeled values, dependent variable, and predicted classifications.