1. Introduction
Heart disease is a major global health concern and the world’s top cause of mortality. The World Health Organisation estimates that 17.9 million fatalities worldwide in 2016 were related to cardiovascular illnesses, or 31% of all deaths [1]. Heart disease mortality and major complications can be greatly decreased by early detection and prevention of the condition. It is impossible to overestimate the importance of correctly forecasting cardiac disease since making a mistake might have grave repercussions. 17.9 million deaths globally are attributed to heart disease each year [2]. Numerous unhealthy behaviours, including obesity, hypertension, and excessive cholesterol, are to blame for the growth of heart disease [3]. However, it might be difficult to diagnose heart disease correctly since its symptoms often resemble those of other illnesses [4].
Machine learning has become increasingly important in the medical field, with the potential to improve the diagnosis and prediction of diseases [5]. Studies have shown that machine learning models can be used to classify and predict heart disease diagnosis, with some achieving accuracy rates of up to 94.60% [6]. Machine learning algorithms such as CART [7], deep neural networks [8], random forest [9], and support vector machines [10] have been used to detect heart disease. Despite the promising results, there are challenges associated with using machine learning for heart disease prediction, which is associated with high data dimensionality. High dimensionality of data can result in overfitting, requiring the use of feature engineering and selection techniques to reduce data redundancy and processing time [11]. Dimensionality reduction techniques such as PCA can also be used to store valuable information in new components [12].
Several studies [13]-[15] have been conducted using the heart disease 2020 project dataset, a popular benchmark dataset for heart disease prediction [16]. Previous studies used K-Nearest neighbours, Random Forest, Decision Tree, Support Vector Machines, and extreme gradient Boosting classifiers. The studies have achieved high accuracy rates, including random forest with 83.2% accuracy, decision tree with 86.1% accuracy, and SVM with 82.7% accuracy, among others. However, there is still room for improvement, particularly in terms of reducing the dimensionality of the dataset and using appropriate classification techniques. Therefore, this study adopted the use of random forest as the features selection estimator, and Extra Tree machine learning classifier for the prediction of heart disease.
Research Problem and Objective of the Study
Heart disease is a significant global health issue, being the leading cause of death worldwide. Early and accurate prediction of heart disease is crucial to prevent severe complications and fatalities. Machine learning algorithms have shown promising results in predicting heart disease, but there are still challenges to be addressed. The main problem is the high dimensionality of the data used in these algorithms, which can lead to overfitting and require large amounts of memory for processing. Previous studies have used machine learning models such as Logistic Regression (LR), K-Nearest Neighbours (KNN), Decision Trees (DT), Extreme Gradient Boosting (XGBoost), Naïve Bayes (NB), Support Vector Machines (SVM), and Random Forest (RF) to predict heart disease using the heart disease 2020 project dataset. The outcome of these models yielded 91%, 93.3%, 86.5%, 91.4%, 84.7%, 91.5%, and 90.5% accuracy scores respectively. This implies that there is room for improvement as no study has been able to use the Extra Tree machine learning classifier. Therefore, this study develops a machine learning model using the Extra Tree classifier to predict heart disease in the face of the heart disease 2020 project dataset.
The findings of this study are expected to contribute to the field of healthcare by providing a reliable and accurate method for predicting heart disease and also making heart disease diagnosis available to the public. This can aid healthcare professionals in early diagnosis and treatment planning, ultimately improving patient outcomes.
2. Review of Literature
A Novel Study on Machine Learning Algorithm-Based Cardiovascular Disease Prediction was carried out in [17]. The authors aimed to develop a machine learning algorithm for accurate prediction and decision making for cardiovascular disease (CVD) patients. Machine learning algorithms (DT, RF, LR, NB, SVM) were implemented for classification and prediction. The random forest (RF) algorithm had the highest accuracy, sensitivity, and recursive operative characteristic curve for CVD classification and prediction. However, the study did not show the sample size or representativeness of the study population and there was a lack of discussion on potential biases or limitations of the machine learning algorithms used. Another study was done by [18], where predictive modeling of cardiovascular disease using Machine Learning Techniques. Random forest, decision tree, gradient-boosted tree, logistic regression. Gradient-boost tree algorithm obtained more accurate results than other methods.
In an attempt to enhance heart disease prediction through ensemble learning techniques with hyperparameter optimization Algorithms, the outcome of the study in [19] looked promising, but the authors did not demonstrate a realistic approach that led to the development of the ensemble model. The study in [20] predicted cardiovascular disease using feature selection techniques. The study used six different supervised machine learning techniques for analysis. The accuracy, precision, recall, and f-measure were compared for each classifier. The study compares the accuracy, precision, recall, and f-measure of six different machine learning techniques. Logistic regression and SVM have better accuracy than other classifiers with 89.6% and 89.8% accuracy scores respectively.
A study developed an ensemble model in [21] through the meta-analytic approach that assesses the predictive ability of machine learning algorithms in cardiovascular diseases dataset. Comprehensive search strategy executed within MEDLINE, Embase, and Scopus databases was done. The study included 103 cohorts with a total of 3,377,318 individuals. The study shows that there is potential in ensemble models to predict heart disease. The accuracy score of the best model was benchmarked on 90% accuracy score. The study of [22] aimed to predict death caused by cardiovascular disease within ten years of follow-up. The machine learning used are the Logistic Regression, Support Vector Machine, Random Forest, Naïve Bayes, Extreme Grading Boosting, and Adaptive Boosting. The evaluation metrics used are Accuracy, Precision, Recall, F1-Score, Specificity, and AUC. The Utilized study used clinical CVD risk factors and biochemical data, and the outcome shows that the Logistic Regression had the most reliable algorithm assessment with 72.20% accuracy.
In implementing an early identification and treatment outcomes for cardiovascular disease, the authors in [23] made use of neural networks, random forests, Bayesian networks, C5.0, and QUEST on a dataset obtained from Kaggle with 70,000 patient records used for training and validation. An accurate score of 93% was recorded to show a promising sign for early detection. The study in [24] uses a fundamental model capable of performing various heart disease analysis algorithms. The model caters to a specific type of data. Optimization techniques were employed for data classification and prediction accuracy enhancement. The outcome provided a predictive framework for various heart diseases using decision tree, random forest, logistic regression, and KNN algorithms. The framework offers an interface that’s easy to use.
In the quest to improve the predicting accuracy of the existing model, the study in [25] developed a model using k-modes clustering with Huang starting that can improve classification accuracy. Models utilized include random forest (RF), decision tree classifier (DT), multilayer perceptron (MP), and XGBoost (XGB). GridSearchCV was used to hypertune the parameters of the applied model in order to improve the outcome. The suggested model is tested on a real-world dataset of 70,000 occurrences from Kaggle. Models were trained on data with an 80:20 split and achieved the following accuracy: Decision tree: 86.37% (with cross-validation) and 86.53% (without cross-validation); XGBoost: 86.87% (with cross-validation) and 87.02% (without cross-validation); random forest: 87.05% (with cross-validation) and 86.92% (without cross-validation); multilayer perceptron: 87.28% (with cross-validation) and 86.94% (without cross-validation). The proposed models have the following AUC (area under the curve) values: Decision tree: 0.94, XGBoost: 0.95, Random Forest: 0.95, and Multilayer Perceptron: 0.95. The conclusion drawn from this underlying research is that multilayer perceptron with cross-validation has outperformed all other algorithms in terms of accuracy. It achieved the highest accuracy of 87.28%. The study in [26] provides a thorough analysis of the several machine learning approaches that are available and examines how well they work for effective heart disease diagnosis, treatment, and prediction. The suggested research surveys various machine learning techniques, such as support vector machines (SVM), decision trees (DT), Naïve Bayes (NB), K-nearest neighbor (KNN), artificial neural networks (ANN), etc., that are employed to forecast the development of heart illnesses. The best and worst performing techniques overall were then determined by calculating the average forecast accuracy for each technique. The results showed that the C4.5 decision tree technique produced the lowest average prediction accuracy of 74.0%, while the ANN achieved the best average prediction accuracy of 86.91%.
From the reviewed studies, it was discovered that the existing studies have good accuracy scores on the prediction of heart diseases. However, there is room for improvement as some of the reviews are weak in terms of feature selection approaches and improper tuning of the learning models. Also, it appears that there are not many studies that have employed the use of Extra Tree Classifier. Therefore, this current study focuses on how features can be selected using the recursive feature elimination through the use of Random Forest estimator, and the development of a predictive model using the extra tree classifier.
2.1. Artificial Intelligence and Machine Learning
Artificial Intelligence (AI) is a broad field that encompasses the development of intelligent machines that can perform tasks that typically require human intelligence. AI systems can mimic human cognitive functions such as learning and problem-solving through the use of mathematics and logic [27].
Machine Learning (ML), on the other hand, is a subset of AI. It is an application of AI that enables a computer system to learn without direct instruction. This is achieved by using mathematical models of data to help the computer learn and improve on its own, based on experience [27]. The connection between AI and ML is that an “intelligent” computer uses AI to think like a human and perform tasks on its own. Machine learning is how a computer system develops its intelligence. The process of creating an AI system involves building it using machine learning and other techniques. Machine learning models are created by studying patterns in the data, and data scientists optimize these models based on patterns in the data. This process repeats and is refined until the models’ accuracy is high enough for the tasks that need to be done.
AI and ML have numerous capabilities that have become valuable in helping companies transform their processes and products. Some of these capabilities include predictive analytics, recommendation engines, speech recognition, natural language understanding, image and video processing, and sentiment analysis [27]. The connection between AI and ML offers powerful benefits for companies in almost every industry. Some of the top benefits that companies have already seen include improved efficiency, better decision-making, and the ability to create new products and services.
2.2. Machine Learning Algorithms
This section identifies and discusses the relevant machine learning algorithms that were considered in this study.
1) Logistic Regression (LR)
Logistic Regression is a statistical model used for binary classification problems. It uses a logistic function to model the probability of a binary response based on one or more predictor variables. The goal of logistic regression is to find the best fitting model to describe the data and to use it for prediction [27].
2) K-Nearest Neighbours (KNN)
KNN is a type of instance-based learning algorithm that can be used for both classification and regression problems. It works by finding the k-nearest data points in the feature space to the new data point, and then assigning the new data point to the class that is most common among its k-nearest neighbours [27].
3) Decision Trees (DT)
Decision Trees are a type of supervised learning algorithm that is mostly used in classification problems. It works by partitioning the feature space into a series of rectangular regions, with each region corresponding to a leaf node in the tree. The goal of a decision tree is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features [27].
4) Extreme Gradient Boosting (XGBoost)
XGBoost is a gradient boosting algorithm that is used for both regression and classification problems. It works by building an ensemble of weak prediction models, typically decision trees, in a stage-wise fashion. It generalizes them by allowing optimization of an arbitrary differentiable loss function [27].
5) Naïve Bayes (NB)
Naïve Bayes is a family of probabilistic classification algorithms based on Bayes’ theorem. It is called “naïve” because it makes the “naïve” assumption that the features are conditionally independent given the class. Despite this assumption, Naïve Bayes often works well in practice [27].
6) Support Vector Machines (SVM)
SVM is a supervised learning algorithm used for both classification and regression problems. It works by finding the hyperplane that maximally separates the data points of different classes. In cases where no hyperplane exists, SVM uses the kernel trick to transform the data into a higher dimensional space where a separating hyperplane can be found [27].
7) Random Forest (RF)
Random Forest is an ensemble learning method for classification and regression that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random Forest corrects for decision trees’ habit of overfitting to their training set by building a multitude of trees and outputting the class that is the mode of the classes or mean prediction of the individual trees [27].
8) Extra Tree Classifier (ETC)
An Extra Tree Classifier is a type of decision tree ensemble algorithm that is used for classification tasks. It differs from a classic decision tree in the way it selects the split at each node. Instead of choosing the best split among all possible splits, Extra Tree Classifier selects the split at random. This randomness helps to reduce overfitting and improves the robustness of the model [28]. The key parameters of the Extra Tree Classifier are:
criterion: The function to measure the quality of a split. The supported criteria are “gini” for the Gini impurity and “entropy” for the Shannon information gain.
splitter: The strategy used to choose the split at each node. The supported strategies are “best” to choose the best split and “random” to choose the best random split.
max_depth: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_split: The minimum number of samples required to split an internal node.
min_samples_leaf: The minimum number of samples required to be at a leaf node.
max_features: The number of features to consider when looking for the best split.
random_state: Used to pick randomly the max_features used at each split.
The Extra Tree Classifier also has an attribute feature_importances that can be used to determine the importance of each feature in the classification task.
3. Methodology
The dataset used in this study was obtained from the University of California, Irvine (UCI) Machine Learning Repository. The dataset contains over 300 thousand patient records, with 18 features (with 17 predictors and 1 target). The dataset is text data in the comma separated value (csv) format. The dataset in its original form contains some irregularities that were handled through data preprocessing stages to get it to a machine trainable format. The data features were normalized or standardized to ensure that each feature had equal weight in the analysis. This was crucial to ensure the machine learning algorithms’ prediction accuracy and reliability. Relevant features that contributed to the accurate prediction of heart disease were selected using Recursive Feature Elimination (RFE). The selected features were used to train the Extra Tree Classifier (ETC) Model. The Extra Tree model was implemented using a Python programming language in the Google Colab environment. Thereafter, a streamlit Python framework was used to develop an interface for the model. The developed model’s performance was evaluated using evaluation metrics such as accuracy, precision, recall, and F1-score.
3.1. Experimental Set-Up
The data features were transformed by using a label encoder scaler transformer. This was used to transform the categorical data features and to scale the feature values to a standard acceptable to the training algorithms.
Figure 1 describes the summary of the data preprocessing phase.
The feature selection was done to ensure that possible model over/underfitting is avoided. Also, this helps to ensure that features of high importance are used. This study used the recursive feature elimination (RFE) method. For a proper estimation and random selection of the features, the random forest was used as the estimator. Figure 2 shows the workflow of the RFE.
Figure 1. Data Preprocessing workflow.
Figure 2. RFE feature selection model.
RFE is a wrapper method for feature selection that utilizes an estimator (like a Random Forest) to iteratively remove features deemed least important. Here’s a breakdown of the process as shown in Figure 2:
1) Start with All Features: Begin with the entire set of features in your dataset.
2) Train a Random Forest: Create a Random Forest model and train it on the complete dataset.
3) Evaluate Feature Importance: Extract feature importance scores from the Random Forest. These scores indicate how much each feature contributes to the model’s predictions.
4) Remove Least Important Feature(s): Identify the feature with the lowest importance (or a subset of features if you’re removing multiple at once). Eliminate this feature (or features) from the dataset.
5) Repeat: Go back to step 2 and retrain the Random Forest on the reduced dataset.
6) Continue Recursively: Repeat steps 2 - 5 until you reach a desired number of features, a performance metric threshold, or another stopping criterion.
The model was designed using the extra tree classifier. The trained ETC was compared with existing models. Figure 3 shows the general framework of the model for the heart disease prediction.
Figure 3. The general heart disease prediction framework.
3.2. Perform Feature Selection Using Recursive Feature
Elimination (RFE)
The data features were selected and reduced. This was done to actually select the features that contribute more to the prediction of heart disease. The study made use of RFE method using Random Forest as the estimator. Figure 4 shows the features that were selected after applying feature elimination method. The method ranked 10 features higher than others by assigning value = 1 to the highly ranked features.
Figure 4. RFE ranked features.
The selected features are: [“BMI”, “Stroke”, “PhysicalHealth”, “MentalHealth”, “AgeCategory”, “GenHealth”, “SleepTime”]. These features will be used to train the ExtraTree model.
3.3. Model Development
In developing the model, this study implemented the existing models as discussed in the body of literature to verify the obtained results. The models that were verified are: Logistic Regression (LR), K-Nearest Neighbours (KNN), Decision Trees (DT), Extreme Gradient Boosting (XGBoost), Naïve Bayes (NB), Support Vector Machines (SVM), and Random Forest (RF).
Table 1 shows the models and their parameters used for the training.
Table 1. Models and set parameters.
Models |
Parameters & Values |
Logistic Regression (LR) |
solver = “liblinear”, C = 0.1, penalty = “l1” |
K-Nearest Neighbours (KNN) |
n_neighbors = 5, metric = “euclidean” |
Decision Trees (DT) |
criterion = “gini”, max_depth = 3, min_samples_split = 2, min_samples_leaf = 1 |
Extreme Gradient Boosting (XGBoost) |
objective = “binary: logistic”, eval_metric = “auc”, max_depth = 5, learning_rate = 0.1, n_estimators = 100 |
Naïve Bayes (NB) |
var_smoothing = 1.0 |
Support Vector Machines (SVM) |
kernel = “rbf”, C = 1.0, gamma = 0.1 |
Random Forest (RF) |
n_estimators = 100, max_depth = 5, min_samples_split = 2, min_samples_leaf = 1 |
Our Model (ExtraTree Classifier) |
n_estimators = 100, max_depth = 5, min_samples_split = 2, min_samples_leaf = 1, oob_score = True |
4. Result and Discussion
The result obtained was compared with the results of the individual models. The basis of comparison was dependent on the use of similar dataset, parameter tuning and the evaluation metrics. The evaluation metrics were used to determine the extent of the correctness of the predictions.
Table 2 shows the summary of the results of this study. From the results, it was discovered that the Extra Tree model outperformed other models with the accuracy, precision, recall and f1-scores of 93.1%, 94.8%, 100%, and 93.1% respectively.
Table 2. Result summary.
Models |
Accuracy (%) |
Precision (%) |
Recall (%) |
F1-Score (%) |
Logistic Regression |
91.3 |
92.0 |
99.0 |
96.0 |
K-Nearest Neighbour |
90.4 |
92.0 |
98.0 |
95.0 |
Decision Tree |
86.3 |
83.4 |
89.0 |
88.0 |
Extreme Gradient Boost (XGBoost) |
91.2 |
85.0 |
83.0 |
81.0 |
Naïve Bayes |
84.3 |
82.0 |
86.0 |
83.0 |
Random Forest |
91.3 |
93.2 |
1 |
98.1 |
Support Vector Machine |
91.5 |
92.0 |
1 |
98.1 |
Extra Tree |
93.1 |
94.8 |
1 |
98.9 |
The Extra Tree models, also called extremely randomized trees, show superiority because they offer a good balance between bias and variance, making them effective for various machine learning tasks. The tabulated result was represented in a graphical form for a proper result visualization. Figure 5 shows the graphical presentation of the result.
The Extra model was adopted for the prediction of heart disease. Figure 6
Figure 5. The graphical presentation of the obtained result.
Figure 6. Prediction output.
shows the output of the first 20 predictions. The actual values are the values presented in the test dataset, while the predicted values are the values predicted using the extra tree model. The possibility of having heart disease is encoded using “1” and the possibility of no heart disease is encoded using “0”.
The Extra model was adopted for the prediction of heart disease. Figure 6 shows the output of the first 20 predictions. The actual values are the values presented in the test dataset, while the predicted values are the values predicted using the extra tree model. The possibility of having heart disease is encoded using “1” and the possibility of no heart disease is encoded using “0”.
The evaluation metrics used for the models’ calibration are the accuracy, precision, recall and f1-score. The outcome of this study shows that the Extra tress model has the highest performance scores in terms of prediction when the results are compared with the other popular machine learning models to benchmark the effectiveness of the Extra Tree model. The Extra Tree model demonstrates great performance, showing high accuracy, precision, recall, and f1 scores of 93.1%, 94.8%, 100% and 98.9% respectively on an imbalanced dataset split ratio of 80% to 20% train set and test set respectively.
5. Conclusion
The study comes to the conclusion that the Extra Tree model is a very useful tool for heart disease prediction. The model is especially well-suited for medical prediction tasks because of its resilience against overfitting and capacity to handle big datasets with many characteristics. The Extra Tree model may be effectively employed in clinical settings to support healthcare professionals in the early identification and prevention of heart disease, as demonstrated by its high accuracy and robust performance metrics. This study’s findings suggest that the Extra Trees model should be utilized to predict heart disease. The model may be implemented into a clinical decision support system to help healthcare providers diagnose cardiac disease. Furthermore, the feature importance analysis can help direct future research into finding the most significant risk factors for cardiovascular disease. It is also advised that the model be validated on bigger and more varied datasets to confirm its generalizability and robustness.
Ethical Consideration
This study will be guided by the principle and code of practice of the Babcock University Health Research Ethics Committee (BUHREC), and will be conducted under thorough supervision by experts in the related areas of specialization. As a result, it guarantees that the intellectual property rights of the secondary dataset used, machine learning algorithms, programming tools, and works of literature included in this study’s documentation are correctly referenced in accordance with the Institute of Electrical and Electronics Engineers (IEEE) ethical standard.