Tuberculosis remains an important problem in public health that threatens the world, including the Philippines. Treatment relapse continues to place a severe problem on patients and TB programs worldwide. A significant reason for the development of decline is poor compliance with medical treatments. The objectives of this research are to generate a predictive data mining model to classify the treatment relapse of TB patients and to identify the features influencing the category of treatment relapse. The TB patient dataset is applied and tested in decision tree J48 algorithm using WEKA. The J48 model identified the three (3) significant independent variables (DSSM Result, Age, and Sex) as predictors of category treatment relapse.
Tuberculosis (TB) remains the deadliest infectious disease worldwide, with 10.4 million infections and a death toll of 1.7 million people in 2016, according to the World Health Organization (WHO) statistics [
Infection with tuberculosis can lead to life-threatening complications. Unfortunately, there is little information about how patients are considered relapse in treatment. Data mining is the exploration of large datasets to extract hidden and previously unknown patterns and relationships [
Technically, the term “data mining” is the process of finding patterns or relationships among lots of attributes in large relational databases.
Several studies in data mining have been widely used for the prognosis and diagnoses of many diseases. Ferreira et al. [
However, none of the authors mentioned above ventured on the utilization of J48 decision tree algorithm which engaged to predict the TB patient category treatment relapse which this paper addressed. J48 is very instrumental in this kind of study since it handles both categorical and continuous attributes to build a decision tree. To handle continuous attributes, J48 splits the attribute values into two partitions based on the selected threshold such that all the values above the threshold as one child and the remaining as another child. It also handles missing attribute values. J48 uses the gain ratio as an attribute selection measure to build a decision tree. It removes the biases of information gain when there are many outcome values of an attribute. J48 uses pessimistic pruning to remove unnecessary branches in the decision tree to improve the accuracy of classification [
This article aims to make a predictive data mining model for treatment relapse to classify which factors influence the category treatment relapse of TB. The model was built by applying data mining technique to the data provided by the Integrated Tuberculosis Information System (ITIS) website of the Department of Health (DOH).
This research is a quantitative research design that uses the computational, mathematical, and statistical tools to analyze and examine the data to simplify outcomes from the datasets. The dataset of TB patients came from Cabanatuan City, which was mined last 2017 from the Integrated Tuberculosis Information System database of City Health Office. This approves the comparability and accuracy of data, which are essential features in the J48 model. The collected data were scrutinized and coded using the WEKA (Waikato Environment for Knowledge Analysis). WEKA is a collection of machine learning algorithms for data mining tasks. The study procedure of interview and data collection was permitted by the Office of the City Mayor and the City Health Director of Cabanatuan City.
The researchers used data mining as a tool with the J48 decision tree as a method to design the prediction model treatment relapse of TB patients. Data mining sort enormous datasets to classify patterns and established relationship to unravel complications with the use of data analysis. Data mining tools were utilized by enterprises to forecast coming trends [
The WEKA platform [
Attribute | Value (modalities) | Structure | Type of Attribute | ||
---|---|---|---|---|---|
fi | % | ||||
Age | <10 | 166 | 33 | Numeric | Independent |
10 - 19 | 32 | 6 | |||
20 -28 | 56 | 11 | |||
29 - 38 | 56 | 11 | |||
39 - 47 | 59 | 12 | |||
48 - 57 | 92 | 18 | |||
58 - 66 | 37 | 7 | |||
67 - 76 | 8 | 2 | |||
>76 | 4 | 1 | |||
Sex | Male | 343 | 67 | Nominal | Independent |
Female | 167 | 33 | |||
BacStatus (Bacteriologically Status) | Bacteriologically-Confirmed TB | 258 | 51 | Nominal | Independent |
Clinically-Diagnosed TB | 252 | 49 | |||
DSSMResult (Direct Sputum Smear Microscopy) | ODT (Observed Direct Treatment) | 174 | 34 | Nominal | Independent |
0 | 130 | 26 | |||
1+ | 87 | 17 | |||
2+ | 57 | 11 | |||
3+ | 62 | 12 | |||
Classification | Pulmonary | 507 | 99 | Nominal | Independent |
Extra-Pulmonary | 3 | 1 | |||
Patient Category | Non-Relapse | 455 | 89 | Nominal | Dependent |
Relapse | 55 | 11 |
age group of 58 to 66 group; 8 patients were in the age group of 67 to 76 years; and 4 patients were more than 76 years old. In attribute sex, the data showed that there were more males than females. In their bacteriologically status, 51% of the TB patients were under bacteriologically-confirmed TB, while 49% belonged to clinically-diagnosed TB. In DSSM result, “ODT” and “O” had the highest percentage (60%) of TB patients. Almost 100% of the tuberculosis cases on the classification were in pulmonary value; while in the patient category, the majority was under non-relapse.
In
The graphical representation in
The main objective of data visualization is to link information efficiently and clearly. It makes composite data more usable, accessible, and understandable.
J48 Model Rule Sets
Rule 1:
If (DSSMResult = “1+”) or (DSSMResult = “2+”) or (DSSMResult = “3+”) or
(DSSMResult = “ODT”) then Prediction = “Non-Relapse”
Rule 2:
If (DSSMResult = “0”) and (Age <= 41) then
Prediction = “Non-Relapse”
Rule 3:
If (DSSMResult = “0”) and (Age > 41) and (Sex = “F”) then
Prediction = “Non-Relapse”
Rule 4:
If (DSSMResult = “0”) and (Age > 41) and (Sex = “M”) then
Prediction = “Relapse”
This J48 model shows that the three important factors for predicting relapse are DSSMresult, age, and sex. The attribute DSSMResult appears as the first splitting attribute. This specifies the significance of this information. The model can be interpreted as follows: if the TB patient’s DSSMResult is equal to “1+”, or “2+”, or “3+”, or “ODT”, the model predicts non-relapse. However, if the patient’s DSSMResult is equal to “O”, then the model examines the age of the patient. If the age of the patient is less than or equal to 41, the model predicts non-relapse.
On the other hand, if the age is higher than 41 years, the model examines the sex of the patient. If the sex of the patient is female (F), the model predicts non-relapse. Otherwise, relapse is predicted. According to this model, when the DSSMResult is “O” and age is greater than 41 years, and sex is equal to “M”, the patient is at high risk to repeat the TB treatment, or the TB patient becomes a relapse.
In this research, the researchers used the Waikato Environment for Knowledge Analysis tool in
Accuracy ( ACC ) = TN + TP TP + FP + TN + FN
Accuracy ( ACC ) = 434 + 27 27 + 28 + 434 + 21
Accuracy ( ACC ) = 9 0. 39 %
Predicted | TB Treatment Category | ||
---|---|---|---|
n = 510 | Relapse | Non-Relapse | |
Relapse | True Positive TP = 27 | False Positive FP = 28 | |
Non-Relapse | False Negative FN = 21 | True Negative TN = 434 |
Accuracy (ACC) signifies the amount of the total number of TB patient predictions that are correct. True Positive (TP) means the amount of actual outcome of TB patient relapse that is accurately classified as predicted relapse category, and True Negative (TN) refers to the number of TB patient non-relapse that are rightly classified as predicted TB patient non-relapse category. The accurately classified instance is equal to 90.39%.
This paper deals with efficient data mining procedure for predicting the TB relapse from medical records of patients. The J48 classifier was developed by the researchers using WEKA and trained it on a preprocessed TB dataset. The J48 classifier is used to increase the accuracy rate of the data mining procedure. From the results, algorithm J48 predicted the patient category of TB data with the accuracy of 90.39%, which is reasonable enough for the system to be depended on for prediction of category relapse. In order to measure the unbiased prediction accuracy of the method, the 10-fold cross-validation procedure was used. The J48 prediction model was handy and advantageous to assess the consistency among attributes that are used to predict the TB treatment category relapse. A J48 model was established based on the input variables gathered from the ITIS database of the City Health Office of Cabanatuan. The attributes DSSM Result, Age, and Sex are the most critical factors for the prediction of patient category relapse.
It is recommended other IT and computer science experts to venture on studies that are medically related using J48 predication model and continue to investigate and evaluate the technique [
The researchers are grateful for the support offered to them by the City Health Office of Cabanatuan and the Nueva Ecija Provincial Health Office.
The authors declare no conflicts of interest regarding the publication of this paper.
Cruz, A.P.D. and Tumibay, G.M. (2019) Predicting Tuberculosis Treatment Relapse: A Decision Tree Analysis of J48 for Data Mining. Journal of Computer and Communications, 7, 243-251. https://doi.org/10.4236/jcc.2019.77020