Argumentative Comparative Analysis of Machine Learning on Coronary Artery Disease

Cardiovascular disease (CVD) is a leading cause of death across the globe. Approximately 17.9 million of people die globally each year due to CVD, which comprises 31% of all death. Coronary Artery Disease (CAD) is a common type of CVD and is considered fatal. Predictive models that use machine learning algorithms may assist health workers in timely detection of CAD which ultimately reduces the mortality. The main purpose of this study is to build a predictive model that provides doctors and health care providers with personalized information to implement better and more personalized treatments for their patients. In this study, we use the publicly available Z-Alizadeh Sani dataset which contains random samples of 216 cases with CAD and 87 normal controls with 56 different features. The binary variable “Cath” which represents case-control status, is used the target variable. We study its relationship with other predictors and develop classification models using the five different supervised classification machine learning algorithms: Logistic Regression (LR), Classification Tree with Bagging (Bagging CART), Random Forest (RF), Support Vector Machine (SVM), and K-Nearest Neighbors (KNN). These five classification models are used to investigate the detection of CAD. Finally, the performance of the machine learning algorithms is compared, and the best model is selected. Our results indicate that the SVM model is able to predict the presence of CAD more effectively and accurately than other models with an accuracy of 0.8947, sensitivity of 0.9434, specificity of 0.7826, and AUC of 0.8868.


Introduction
Coronary Artery Disease (CAD) is one of the most common types of Cardiovascular disease (CVD). According to the current statistics from World Health Organization (WHO), 17.9 million of people are dying globally in yearly due to CVD, which is 31% of all deaths. It is the number one cause of death around the world in 2020 [1]. CAD is considered a fatal illness that causes the death of millions of people every year globally. In the United States of America, 365,914 people died because of CAD in 2017. About 18.2 million (6.7%) Americans who are 20 and older have CAD, and CAD is the cause of death for 20% of Americans that are 65 and younger [2]. India is predicted to be the country hardest hit by CAD. By 2020, it is estimated that at least 1.4 million citizens will die of heart disease, and one out of four cardiac patients globally will be Indian [3].
These facts illustrate the importance of dealing with CAD. There have been numerous efforts applied during the previous years to include clinical decision support systems and artificial intelligence to predict the CAD. Such predictive models provide doctors and health care providers with personalized information to implement better and more personalized treatments for their patients. Often health care data are very large with several information collected over a large number of patients, which is impractical to analyze using standard statistical techniques. Machine learning approaches are very powerful and efficient tools to study and analyze such large-scale multi-dimensional dataset. Because of that, for decades, machine learning and the other artificial intelligence have been successfully used and have proven to be helpful in medicine [4].
Several studies have been conducted using various machine learning algorithms and different datasets in order to detect CAD. In 2019, M. Abdar et al. [5] compared 10 machine learning algorithms to investigate the CAD detection using accuracy and F1-score as the performance matrices. However, the authors did not use sensitivity, specificity, and area under the receiver operating character (ROC) curve, which are also critical information for the model comparison.
In 2017, H. Forssen et al. [6] compared 6 machine learning algorithms to investigate the CAD detection using accuracy, area under the ROC curve (AUC), sensitivity, and specificity as the performance matrices. The model they preferred has a very low specificity of 0.339. In 2020, A.B. Akella & V. Kaushik [7] showed that neural network is the best machine learning algorithm to detect CAD. Their claim is based on the following matrices: Accuracy = 0.9303, Recall = 0.9380, F1 score = 0.8984, AUC = 0.796, and Mean = 0.88. Based on these matrices, it can be estimated that the specificity of their best model could be less than 0.60. In other words, the false positive rate (FPR) (also known as type I error), is about 40%, which is significantly high. In 2020, I.C. Dipto et al. [8] claimed that neural network is the best machine learning algorithm to detect CAD, for the algorithm achieving an average accuracy of 0.9325 and an AUC of 0.98. However, they applied SMOTE algorithm to balance the dataset. In 2012, R. Alizadehsani et al. [9] used four machine learning algorithms; Naïve Bayes, C4.5, AdaBoost, and SMO It has been very challenging to determine which model type to apply to a machine learning task in order to make a precise prediction. Every model has some merits and demerits [10]. It can be difficult to compare the relative merits of the models. In this study, we implement five different supervised classification machine learning approaches to predict the CAD using the publicly available Rest of the article is organized as follow: In Section 2, we discuss data description and preprocessing. In Section 3, a different classification machine learning will be discussed, followed by model comparison and selection of the best model in Section 4. In Section 5, we discuss the results. In Section 6, we summarize the main findings and conclude the manuscript.

Data Source
In this study, we use the publicly available Z-Alizadeh Sani dataset obtained from the UCL Machine Learning Repository, which contains a large collection of datasets that have been widely used by the Machine Learning Community. Detailed information about the dataset such as: name, type, level, and other relevant information are provided [11].

Data Description
The Z-Alizadeh Sani dataset contains the records of 303 random patients who

Feature Selection
The feature with a negligible effect on the response variable is called an irrelevant feature. A common example of an irrelevant feature is a serial number. In predictive modeling, we are often confronted with many inputs (explanatory variables). Some of these inputs may not have any relation to the target variable.  [14]. The irrelevant features increase the noise in the dataset. There are different ways to denoise [15]. Dropping irrelevant features is one of the most common ways. There are many feature selection methods that automatically drop the irrelevant features. We have used variable selection node available in SAS Enterprises Miner to drop the irrelevant features because it handles both categorical and numerical variables. We have chosen the Chi-square criteria because our target variable Cath is binary. The brief summary of the variables selected using variable selection node based on Chi-square criteria, including response variable with role, type, level, and range is summarized in Table 1. The relative importance plot of the input variables (those selected using variable selection node) with respect to the target is given in Figure 1.

Data Partition
The data is split into two parts-training and testing in the ratio 3:1. First, we train the data that contains 227 observations, and then move on to test the data that contains 76 observations. Train data is used to find the relationship between target and predictor variables while the test data assesses the performance of the model. The main purpose of the splitting data is to avoid overfitting. If overfitting  occurs, the machine learning algorithm could perform exceptionally in the training dataset, but perform poorly in the testing dataset.

Machine Learning Algorithms
There are various machine learning algorithms that are available to solve the classification problems such as: Logistic Regression, Random Forest, and Support Vector Machine. We have implemented the following approaches in this study:

Logistic Regression
Logistic Regression (LR) model is used for predicting binary outcomes. It is a statistical model that in its basic form uses as a sigmoid function to model a binary response variable, taking on values 1 and 0 with probability π and 1 − π respectively. A logistic regression model is given below as: LR is one of the most popular and commonly used method to solve classification problem, especially when the response variable is binary [16]. The method is simple, and convenience always comes first in the mind of a statistician [17].

Classification Tree with Bagging
Classification tree (CART) is a powerful alternative to more traditional approaches of land cover classifications. Trees provide a hierarchical and nonlinear classification method and are suited to handling non-parametric training data, as well as categorical or missing data. By revealing the predictive hierarchical structure of the independent variables, the tree allows for great flexibility in data analysis and interpretation [19]. CART is simple and useful for interpretation. It is a statistical model which is used to predict a qualitative response. In this model, we predict that each observation belongs to the most commonly occurring class of training observations in the region which it belongs to. To build the CART model, we used the Gini index in order to evaluate the quality of the split.
CART is a non-robust, meaning that a small departure from the validity of the model effects the performance badly. However, Bagging is a machine learning algorithm obtained by aggregating CART, and causes the predictive performance of the CART to improve substantially. In Bagging, we obtain n bootstrap samples from the existing training data. For each sample, a CART is fitted using all predictors. Finally, the average of the resulting predictions is obtained. Bagging always prevents the model from overfitting [20]. We fitted the classification tree with Bagging (Bagging CART) model using 1000 bootstrap samples using random Forest command of the R package [21].  A small value of these tuning parameters overfits the data whereas a large value underfits [20]. The 10-fold cross validation is used to choose the best tuning parameters. We used the grid technique to find the optimal parameters cost and gamma by varying cost 0.01 to 10 and gamma 0.01 to 1, for which it yields cost and gamma to be 0.02 and 0.01 respectively. A SVM model equipped with the linear kernel using the tuning parameters cost = 0.02 and gamma = 0.01 is fitted by the help of svm command of the R package [22].

K-Nearest Neighbors
K-Nearest Neighbors (KNN) model takes a completely different approach than the other classification models. To fit KNN model, no assumption is needed. In fact, it is completely nonparametric. KNN can outperform other classification models if the assumptions are not met [16]. In KNN, the parameter k characterizes the tradeoffs between variance and bias. The small and large value of k overfit and underfit the data, respectively. There is not a strong basis for the selection of the value of k [23]. It has been a common practice to choose k equals 10 so we fitted the KNN model using k equals 10 with KNN command of the R package [24].

Model Comparisons
To determine which model had the better performance, they were trained on the training dataset and fit to the test dataset where they retrieved the following matrices: Sensitivity, Specificity, Accuracy, and area under the receiver operating characteristic curve (AUC). We compute the confusion matrix for each model as shown in Table 2.
The proportion of the actual positive cases that is correctly predicted as positive is called sensitivity. It is also called true positive rate (TPR) and is given in The proportion of the actual negative cases that is correctly predicted as negative is called specificity. It is also called true negative rate (TNR) and is given in  The proportion of the actual negative cases that is incorrectly predicted as positive is called type I error. It is also called false positive rate (FPR) and is given in Equation (5).
The proportion of the actual positive cases that is incorrectly predicted as negative is called type II error. It is also called false negative rate (FNR) and is given in Equation (6).
The proportion of the cases that is predicted accurately is called the accuracy and is defined by Equation (7).

TP TN Accuracy
TP FN TN FP There is a direct relation between the sensitivity, specificity, Type I error, and Type II error. Sensitivity is 1-Type II error, whereas specificity is 1-Type I error.
Our goal is to minimize both types of errors. In other words, we want sensitivity, and specificity as large as possible.
Sensitivity and specificity are inversely proportional to each other, meaning that as the sensitivity increases, the specificity decreases, and vice-versa.
Receiver operating characteristic (ROC) curve is commonly used to characterize the sensitivity/specificity tradeoffs for a binary classifier. The ROC curve is obtained by plotting the false positive rate (1-specificity) on x-axis against the sensitivity on y-axis at various threshold settings.
Area under the ROC curve (AUC) is one of the most important matrices to measure the performance of the model. Its value lies between 0 and 1. A model is said to be an excellent if its AUC is close to 1. The higher the AUC, the better the model, and vice-versa. We used the roc command of R package to compute the AUC of ROC curve of each model [25].
The model with the highest statistics, which are: sensitivity, specificity, accuracy, and AUC is considered the best model.

Results
The summary of the performance statistics from the five models are presented in   Table 3, it can be concluded that the SVM is able to predict the presence of CAD more effectively and accurately than other models.
A possible cause of KNN suffering from a poor performance is whenever the class distribution of the Cath is skewed [26]. Most of the voting will raise conflict when there is a huge class that dominates prediction. There will also be a tendency for new data to be voted into additional popular classes. Figure 3 verifies the fact that the number of positive cases (Cad) is almost three times more than the number of negative cases (Normal). As a result, it is unsuitable to use KNN

Conclusion
In conclusion, we used logistic regression (LR), classification tree with Bagging (Bagging CART), random forest (RF), support vector machine (SVM), and k nearest neighbors (KNN) to learn the detection of coronary artery disease (CAD), utilizing the publicly available Z-Alizadeh Sani dataset. The performance of the models is gauged by comparing the following performance matrices: sensitivity, specificity, accuracy, area under the ROC curve (AUC) of the testing data. Our results indicate that the SVM model is able to predict the presence of CAD more effectively and accurately than other models with an accuracy of 0.8947, sensitivity of 0.9434, specificity of 0.7826, and AUC of 0.8868. Further research might be necessary to improve in the performance of the machine learning algorithm before this method translated into clinical solution. Such improvements might include, but are not limited to, using other machine learning algorithms such as artificial neural network, using more data, or exploring other ways of extracting important features before feeding to the machine learning algorithm.