Hyperparameter Tuning Based Machine Learning Classifier for Breast Cancer Prediction ()
1. Introduction
Cancer is one of the foremost mundane cognitive disorders that kill individuals. Breast cancer is the second-most prevalent malignancy globally, especially among women. Nearly 22.5 new instances of breast cancer per 100,000 females were reported in Bangladesh [1] . When compared to other types of cancer, Bangladeshi women have the greatest occurrence rate between the ages of 15 and 44 (19.3 per 100,000). According to WHO data published in 2020, Bangladesh’s death rate has reached 6808 or 0.95%. If breast cancer is discovered early, it can be treated easily and with fewer risks, which lowers the mortality rate by 25%.
To determine a patient’s cancer status and whether they have it or not, the majority of clinicians perform a biopsy. Having benign cancer suggests the patient is safe because it is less harmful than malignant cancer. Benign cancer can be treated, in contrast to malignant cancer which is irreversible and spreads to other body parts [2] . For this cancer, indeed, neither a definitive cure nor even perfect outpatient care has been inferred. All doctors can currently only do this by saving the lives of those who are afflicted by this illness and giving them a second shot at life by stripping the ailing body part. Early detection and diagnosis are thus more important in lowering the mortality rate from breast cancer.
After finding a breast tumor, the most arduous task is determining if the tumor is benign or malignant. Modern day breast cancer early detection uses a diversity of ML methods. ML techniques allow us to swiftly extract information from massive amounts of data, which then are used to predict outcomes. Therefore, ML classification is helpful in many sectors for early prediction and diagnosis. Many strategies are utilized to predict BC; however if utilizing ML techniques, the prediction rate is soaring day by day. Data collection, selecting the optimal model, training the model, and testing are the four basic phases in ML for classification.
For the purpose of predicting breast cancer, Roy et al. employed the WDBC Dataset in ML (LR, K-NN, SVM, NB, DT, and RF). Support vector machines and logistic regression are the most efficient algorithms we’ve looked at so far. SVM and LR have been shown to be the most accurate algorithms, with LR and SVM both scoring 98.245% accuracy [2] . Indeed, this study has the potential to use a new methodology and dataset to increase their performance.
According to Chaurasiya et al. [3] analysis of the accuracy ratings on the WDBC dataset of four popular ML classification models (LR, KNN, random forest tree (RDT), and SVM), Random Forest Tree (RDT) has the highest accuracy of 95% out of all the classifiers. To make a more conclusive generalization and further lower the incidence of misclassification, this study’s shortcomings are its lack of use of other classification algorithms on various and comparably extensive data sets.
In this investigation, Kim et al. [4] presented a simple to use machine learning prediction tool for pathological Complete Response (pCR) in breast cancer survivors medicated with Neoadjuvant Chemotherapy (NAC) and generated their training set by using Two-class Bayes point machine technique. They made use of information from clinical traits and gene xpression patterns. The accuracy was 0.875 and the AUC of the ROC curve was 0.909 in this gene-based prediction model. The AUC of the ROC curve and accuracy were both 0.800 in a different model absent gene data. The first drawback of this study is the small number of patients who were recruited for it. A second limitation is that only internal validation has been conducted.
According to a literature assessment of approaches employed by numerous researchers [2] - [14] to predict breast cancer using the WDBC dataset, they all demonstrated how to evaluate the performance of a model via accuracy rate, precision, recall, and F1 score. However, more attention must be paid to this area if the accuracy rate is to be boosted through a different method, data preprocessing and so on. Since this illness is extremely detrimental to every patient and is becoming more and more prevalent. Therefore, if the accuracy rate was raised to a higher level, it would aid healthcare professionals in predicting breast cancer early on before it becomes fatal.
This study’s axiom is to apply five ML classifiers to the WDBC dataset for the prognosis of breast cancer. These classifiers include logistic regression, decision trees, random forest, K-nearest neighbors, and Naive Bayes. In order to enhance performance and choose adequate classifier parameters, here we apply key tactic hyperparameters that have been fine-tuned using a grid search methodology. Every dataset does not perform well with the default settings of classifier algorithms; hence hyperparameter tuning is chosen. In order to obtain a more accurate result, the best parameters for the dataset were selected in this technique.
The following sections are included in the work: After introduction a related work is shown. Thirdly, the research methodology, including data collection, data pre-processing, the algorithms utilized and their general introduction is described. Fourthly, the experimental findings are displayed, and the overall conclusion reached together with suggestions for future research is presented, the acknowledgment and references are displayed in the rest of the paper.
2. Related Work
The world’s most hazardous and predominant illness that primarily distresses women is cancer. There are extensive forms of cancer, including breast, lung, ovarian, and brain diseases. Out of all these malignancies, breast cancer is the most damning form of the disease globally [15] . This section mostly provides a thematic summary of the contributions and attributes of the current breast cancer prediction techniques that have been made. Researchers have devised innumerable machine-learning classification strategies to predict breast cancer.
On the WBC dataset for the identification and diagnosis of breast cancer, Bazazeh et al. [5] analyze machine learning classifiers (SVM, RF, NB) and compare these classifiers with important characteristics similar to accuracy, precision, recall, and the ROC curve. The finding reveals that RF has the highest accuracy out of all of them when comparing the accuracy according to the classifiers SVM (96.6%), RF (99.9%), and NB (99.1%).
Chaurasiya et al. [3] scrutinize the accuracy values of four well-known ML classification models (LR, KNN), random forest tree (RDT, and SVM) while taking into account how well, each model performed on the WDBC dataset and among all the classifiers in this system, Random Forest Tree (RDT) achieved the greatest accuracy of 95%.
Assegie [6] asserts a model for detecting breast cancer utilizing an improved KNN. To increase the model’s accuracy in detecting breast cancer, conduct hyper-parameter tuning using a grid search to identify the best value of K, this method’s accuracy was 94.35%, while the KNN default hyper-parameter value is 90.10%.
Nurul et al. [7] examined the efficacy of several ML techniques to predict breast cancer survival. Furthermore, cross-validation of ten, five, three, and two-times procedures were used to attain the highest predictive performance on ML approaches, such as KNN, RF, SVM, and ensemble methods on WBCD datasets. AdaBoost ensemble approaches provided accuracy rates and cross-validation of 98.77% with 10 times, 98.41% with 2 times, and 98.24% with 3 times. SVM has the lowest error rate and the greatest accuracy rate at 98.60%, which is based on the results of 5-fold cross-validation.
Gupta et al. [8] advocate the application of deep learning (Adam Gradient Descent) and machine learning (DT, KNN, RF, LR, SVM) on malignant and benign cells on WBC datasets. Since deep learning combines the advantages of AdaGrad and RMSProp, which produces the most accurate results with the least amount of loss (98.24%). RMSProp performs well with nonstationary signals, while AdaGrad is ideally suited to computer vision issues.
The objectives of Ara et al. [9] is to analyze the WBC dataset, assess several classifiers for ml, and the effectiveness of breast cancer prediction using DT, SVM, K-NN, LR, RF, and NB. The finding shows an accuracy of 96.5%, RF and SVM perform better than other classifiers.
Amrane et al. [10] provide two distinct ML classifiers, which are Naive Bayes (NB) and k-nearest neighbor (KNN) on WBC and are two classifications that equate methods for breast cancer. Cross-validation is then used to assess the two significant and immediate outcomes and assess their correctness. In contrast to the NB classifier (96.19%), the findings show that KNN offers greater accuracy (97.51%) and a lower error rate.
The results of the extensive literature investigations are shown in Table 1. The reference numbers are displayed in column 1. The year appears in column 2. The datasets are given in column 3, the research algorithms employed are displayed in column 4, and finally, column 5 illustrates the efficiency of the algorithms used.
Table 1. Comparison of publicly available prediction models.
3. Methodology
To ascertain if the tumor is either cancerous (malignant) or harmless (benign), we have set up a series of methods to get the most trustworthy results and information for decision-making. The subsections can be used to present our general methodology: Dataset Description, Data Collection, Data Pre-processing, and Feature Selection.
In Figure 1, the WDBC dataset was initially compiled. The data was then examined to determine if there were any duplicates or missing data. Handling missing data was omitted since no missing data was discovered. The data was separated into training (70%) and testing (30%) after being checked. The feature scaling was performed using standard scaling. Then, in order to assess and contrast the performances, we constructed both the traditional method and the hyper-tuned parameter algorithm.
3.1. Dataset Description
The WDBC dataset has been generated by Dr. William H. Wolberg of the University of Wisconsin Hospital in Madison, Wisconsin, in the United States. It contains 32 columns, “ID” is the first and the second is the “diagnosis outcome” (0-benign and 1-malignant). The rest of the columns (3 - 32) contain 3 measurements (Mean, SD, and Worst-Case Mean) for each of the remaining 10 attributes. They exhibit more variability in the qualities of the size and form of the intended cancer cell’s nucleus. In a biopsy test, a breast sample of cells is taken using the Fine Needle Aspiration (FNA) technique. In a pathology lab, each cell’s nucleus is examined under a microscope to detect these traits. All feature values are maintained with a maximum of 4 meaningful digits. No null value was observed within the sample. The ten genuine qualities are given in Table 2.
3.2. Dataset Collection
The WDBC dataset was aggregated from Kaggle and is used to predict breast cancer; it has 569 instances with a total of 32 features. Here is a sample.
3.3. Data Pre-Processing
The WDBC dataset is checked before working with this data at first, and then the unnecessary features such as the id and unnamed column are extracted. Since variables like ID and nameless objects are redundant for predicting breast cancer, they have been removed from the dataset to improve the exploit and increase veracity. The feature scaling was performed using standard scaling.
3.4. Feature Selection
Benign vs Malignant cells: There are 569 records in the dataset, 357 (62.7%) of which are Benign, and 212 (37.3%) are Malignant. The comparison of benign and malignant cells in this study data is shown in Figure 2. We chose not to utilize a particular feature selection technique in this case since we obtained good results when compared to other feature selection strategies, such as correlation coefficient, and because the data in question pertains to medicine.
3.5. Algorithm Used
In this section, we explored the WDBC dataset to determine which algorithm performs best with this small dataset. In this study, five of the most well-liked ML algorithms are used, but KNN and DT performed well on small datasets while RF, NB, and LR performed well on large datasets. The paramount goal is to benchmark each approach against one another and determine the most efficient and robust technique for the WDBC dataset.
K-Nearest Neighbor (KNN): The simplest technique used for classification is K-Nearest Neighbor. As this algorithm does not learn anything from its dataset and attributes [11] . During the training phase, this algorithm stores new data sets and classifies them into a well-suited category that is most similar to the available category [24] . KNN can be a suitable option for smaller datasets but may not be applicable for larger ones.
Decision Tree (DT): A supervised ML approach known as a decision tree is utilized for both classification and regression [25] . It looks like a tree structure according to its name for classifying different classes. This tree has three entities. One is decision nodes, which is used to make any decision by applying features of the dataset. The second one is brunches, which are used for any kind of decision rule. And the last one is the leaf node; it represents the output [2] . The output is taken by a yes/no question and answer. DT works well for the classification which has fewer class labels.
Random Forest (RF): Building numerous DTs on different subsets of the supplied dataset and taking the average to increase the prediction accuracy of the dataset at training time constitutes the Random Forest ensemble approach, [26] which is used for classification, regression, and other applications. Random Forest is good for large datasets.
Naive Bayas (NB): This is one of the most well-known and straightforward classification algorithms for predictive modeling. It is also known as a probabilistic classifier that is used for quick prediction where one needs to make a prediction based on the probability of a particular task [24] . As this is a powerful algorithm, it works well on large datasets.
Logistic regression (LR): This is a machine learning method from the statistics world used for solving classification problems [15] . It mostly applies to binary classification problems and forecasts a binary dependent variable using a logistic function. This algorithm works well on very large datasets.
4. Experimental Results
In this section, we examined the effectiveness of the dataset after constructing the ML algorithms. This is accomplished by running the algorithms on the test dataset that was previously established. The test dataset contained 30% of the total dataset. To determine the accuracy, precision, recall, F1 score and AUC & ROC curve for each method utilized, a confusion matrix (Figure 3) made up of TP, FP, TN, and FN is constructed for the actual and predicted results. The interpretation of the terms is listed below.
TP: True Positive (Correctly Identified)
FP: False Positive (Correctly Rejected)
TN: True Negative (Incorrectly Identified)
FN: False Negative (Incorrectly Rejected)
4.1. Accuracy
Accuracy tells you how many times the ML model was correct overall. It is determined as the sum of all the data set’s occurrences divided by the number of precise forecasts. It is important to note that the accuracy varies for various testing sets depending on the classifier’s threshold selection. For calculating accuracy, use the formula (1).
(1)
4.2. Precision
Precision is how good the model is at predicting a specific category. Utilizing the proportion of all expected positives to actual positives, the mathematical formula is shown in Equation (2).
(2)
4.3. Recall
Recall refers to the number of correctly predicted data that were recognized (found), i.e., the number of perfect finds that were also identified. The mathematical formula is shown in Equation (3).
(3)
4.4. F1 Score
This refers to the merging variables that would normally be in opposition, recall, and precision. This simply summarizes the prediction capability of a model. The mathematical formula is shown in Equation (4).
(4)
4.5. AUC & ROC Curve
The ROC curve is a graphical representation of the True Positive Rate (TPR) plotted against the False Positive Rate (FPR) for different threshold values of the model’s predicted probabilities. AUC is a metric that quantifies the area under the ROC curve. It has a value ranging from 0 to 1, where 0.5 indicates a random classifier, and 1 represents a perfect classifier. The performance of the tuned model is illustrated in Figure 4 using the AUC and ROC curve.
The results shown in Table 3 & Table 4 demonstrate that the KNN classifier performs well on this study (hyper tuning) according to accuracy, precision, recall and F1 score. Based on the findings, the KNN model is the most accurate classifier among the five suggested classifiers for predicting breast cancer. According to this Figure 5 shows a graphical representation for better understanding.
Figure 4. AUC and ROC curve after tuning.
Table 3. Performance evaluation without hyperparameter tuning.
Table 4. Performance evaluation with hyperparameter tuning.
Table 5 compares the effects of the study model, hyperparameter tuning BC prediction using the WDBC only with the accuracy of KNN. Finally, we draw the conclusion that the suggested method surpasses all other approaches mentioned in the literature by comparing the results of KNN with other state-of-the-art studies in Table 5. According to this Figure 6 shows a graphical representation for better understanding.
Figure 6. Result comparison with existing work.
Table 5. Result comparison with existing work.
5. Conclusion
The leading cause of mortality in women is breast cancer. This study integrated a postulated method for forecasting breast cancer. There are five different ML classifiers using WDBC dataset with LR, DT, RF, KNN, and NB to produce the breast cancer prognostic model. When it comes to tuning hyperparameters using grid search, the study is isolated from the conventional system. While the accuracy rates of the DT, RF, KNN, NB, and LR classifiers without hyperparameter adjustment are 93.56%, 97.08%, 96.49%, 95.91%, and 96.49%, respectively. However, the DT, RF, KNN, NB and LR classifiers in the improved set take the accuracy rate of 94.15%, 97.08%, 98.83%, 95.91% and 97.08% using the hyperparameters tuning approach. We compared the classifiers and discovered that KNN provides the highest accuracy (98.83%) and works well with the study approach. By expanding the data size in the future, this accuracy can be robustically enhanced and also more work can be carried out not only in cancer prediction but also in detecting the stage of a cancer patient.
Acknowledgements
We would like to use this chance to express our gratitude to the Center for Artificial Intelligence and Robotics (CAIR).