Comparison of Different Machine Learning Algorithms for the Prediction of Coronary Artery Disease

Coronary Artery Disease (CAD) is the leading cause of mortality worldwide. It is a complex heart disease that is associated with numerous risk factors and a variety of Symptoms. During the past decade, Coronary Artery Disease (CAD) has undergone a remarkable evolution. The purpose of this research is to build a prototype system using different Machine Learning Algorithms (models) and compare their performance to identify a suitable model. This paper explores three most commonly used Machine Learning Algorithms named as Logistic Regression, Support Vector Machine and Artificial Neural Network. To conduct this research, a clinical dataset has been used. To evaluate the performance, different evaluation methods have been used such as Confusion Matrix, Stratified K-fold Cross Validation, Accuracy, AUC and ROC. To validate the results, the accuracy and AUC scores have been validated using the K-Fold Cross-validation technique. The dataset contains class imbalance, so the SMOTE Algorithm has been used to balance the dataset and the performance analysis has been carried out on both sets of data. The results show that accuracy scores of all the models have been increased while training the balanced dataset. Overall, Artificial Neural Network has the highest accuracy whereas Logistic Regression has the least accurate among the trained Algorithms.


Introduction
Coronary Artery Disease is the number one cause of deaths World-Wide and of the 56.9 million deaths reported around the world in 2016, more than 54% were because of top 10 causes of death among which Ischaemic Heart Disease (Coronary Artery Disease) and Stroke were the biggest killers and they remained the top causes of death for the last 15 years globally [1].
To function properly the Heart requires the supply of blood and the Heart muscles receive blood from Coronary Arteries. Coronary Artery Disease is the blockage or narrowing of the Coronary Arteries caused by hardening or clogging of these arteries due to the build-up of cholesterol or fatty deposits called plaque in the arteries inner walls. The plaque could restrict the flow of blood by clogging the artery or by causing abnormal artery tone or function. Therefore without a proper supply of blood, the heart becomes starved of oxygen and vital nutrients resulting in Chest Pain. If blood supply is entirely cut-off to a portion of a Heart muscle or if the energy requirements of the heart become more than the supply of blood, the result is a heart attack clinic [2]. Machine Learning is known as the Technology that is used for the development of Computer Algorithms with capabilities of mimicking the intelligence level of a human being. It is produced from ideas different fields such as Artificial Intelligence, Statistics and Probability, Computer Science, Information Theory, Psychology, Control Theory and Philosophy [3] [4] [5].
In this study, three different Supervised Machine Learning Algorithms are implemented to predict the presence of patients with CAD and finally, the performances of the Algorithms are compared to select the ideal Algorithm. The dataset used is the Z-Alizadeh Sani dataset that provides clinical records of 303 patients with a total of 54 features related to the disease.
The rest of this paper is organised as follows: in Section 2, a brief survey of existing literature related to the research topic is provided. The research methodology is described in Section 3 and the analysis of the results obtained after implementing the three Algorithms on the dataset is presented in Section 1. Then the findings from the research have been discussed in Section 5 and finally, the study is concluded with a number of future recommendations that would improve the existing research in the future in Section 6.

Literature Review
In 2008, Kurt [6] compared five different Algorithms for the prediction of Coronary Artery Disease which were Logistic Regression, Classification and Regression Tree, Artificial Neural Network, Radial Bias Function and Self Organising Feature Maps. The Algorithms were tested on a data set containing 1245 patients records and with the use of various predictor variables such as age, sex, body mass index and so on. The test results showed that the Neural Network was the best performer compared to the other Algorithms.
In 2009, Lavesson used Algorithms such as Bagging, AdaBoost and Naive I. C. Dipto et al.
Bayes on the CHAPS data set. The test was done for the prediction of the Acute Coronary Syndrome from the research it was found that Naive Bayes had the highest accuracy [7]. In another study in 2010, Babaoğlu used a data set containing 480 patient data with 23 features and applied Support Vector Machine Algorithm to detect the presence of Coronary Artery Disease. In the study, a subset of features was selected using an Algorithm called Principal Component Analysis which reduced the dimensionality of the data set. The test results show that the researchers had finally achieved an accuracy score of 81.46%. It is seen from the research that the application of the Principal Component Analysis reduces the training error and time taken to testing and training of the Support Vector Machine Algorithm [8]. showed that the Neural Network with the use of a Genetic Algorithm produced an accuracy of 93.85% [14].
In 2018, Meng built a hybrid Algorithm called two-layer Gradient Boosting Decision Tree which was compared with two other commonly used Machine Learning Algorithms such as Support Vector Machine and Logistic Regression.
The Algorithms were run on a created data set which consisted of 15,000 patient data of routine blood test results. With the use of the created data set, the researchers trained the Algorithms to classify healthy status, coronary heart disease and other diseases. The test results show that the prediction accuracy of the created Algorithm for prediction of the presence of Coronary Heart Disease and other diseases was higher than the other Algorithms trained on the data set [15].
Another study conducted by Nassif in 2018, the researchers tested Support Vector Machine, Naive Bayes and K-Nearest Neighbour Algorithms using 10-fold cross-validation on the Cleveland Heart Disease data set to compare the performance of three Machine Learning Algorithms for the prediction of Coronary Artery Disease. Three different feature selection methods were applied on the data set and feature versus risk score graphs were plotted to identify the features that are closely related to the risk of Coronary Artery Disease and seven best features were selected for input variables for the Algorithms. From the research, it is found that the Naive Bayes Algorithm was the best performer which showed an accuracy score of 84% [16].

Data Collection Method
The data set is collected from the UCI Machine Learning Repository which contains a collection of data sets that are widely used by the Machine Learning community. In the repository, the information of the donors and creators of the data set, data information, attributes of the data set and other relevant information are also provided [18].

Dataset Description
The Z-Alizadeh Sani Dataset will be used for the research which contains  [14]. Although the data set contains one additional feature "BBB" that stands for Blood-brain barrier, however, from the research papers it is found that the feature has been removed from the data set before running the Algorithms, therefore, it can be inferred that "BBB" is a feature that is not related to Coronary Artery Disease likewise the feature will also be dropped from the data set in the current research. The label or target variable is "Cath" with values "CAD" for the presence of the disease and "Normal" for a normal patient. The dataset was donated on UCI Machine Learning Repository at 2017-11-17 [18].

Design of the Experiment
There are a total of four steps which will be followed to carry out the research. The first step involves the Exploratory Data Analysis followed by data pre-procession.
After processing the data set is divided into training and test sets on which the Algorithms will be implemented and finally the Algorithms will be evaluated in the final step. Figure 1 provides an overview of the experiment design and the following Sections describe the steps in further detail.

Exploratory Data Analysis
This step will be performed to gain useful insights into the collected data set through data visualisations and results from performing analysis. This step will help to find if the data set has any missing values, identify the Categorical features, numerical features and more.

Data Pre-Processing
Raw data are often not found in the form and shape that is required for the optimal performance of learning Algorithms. Therefore the preprocessing step is the most important in Machine Learning Applications [19]. First, the Categorical Features of the dataset are encoded to transform these features into numerical values. Then the matrices of features and the predictor variable are created followed by dividing the dataset into training and test sets. In the following step, Feature Scaling is applied to the input features. Finally, the dataset is balanced using the SMOTE Algorithm. The following Subsections describe these sub-steps involved in the Pre-processing stage for the dataset used in this research.

Encoding Categorical Features
One important aspect of Machine Learning is feature engineering. The Algorithms that will be implemented are only able to read numerical values so it is important to transform the categorical features into numerical values cat [20].

Forming Features and Target Matrices
The matrix of features to be used as input variables and the target variable will be taken. The input variable features and the target matrices will be taken into X and y variables.

Partitioning the Dataset
In this part of the Pre-processing stage, the matrix of features X and the target variable y is split using the "test_train_split" method of "model_selection" class of scikit-learn. 80% data will be used for training and the remainder will be used to Test the Machine Learning Models that will be implemented in this research.
The training and test set of X will be stored into variables called "X_train" and "X_test". Likewise, the y training and test set will be stored in variables named "y_train" and "y_test".

Feature Scaling
Most data sets contain features that are of varying ranges and this is a problem since most of the Machine Learning Algorithms use Euclidean distance between two data points. If features are not scaled, such algorithms will only take in the magnitude of features and the produce various results as features with higher ranges will weigh in more in the distance calculations than features in lower ranges. Hence feature scaling is applied to suppress the explained effect to bring all the features into the same magnitude [21].
To scale the features of the data set Standardization will be used. The results of the Standardization also known as Z-score Normalization are that the features will have properties of a standard normal distribution with 0 µ = and 1 σ = where µ is mean and σ is the standard deviation. The formula used to calculate the Standardization (Z-Scores) is as follows: where i x is the value of each feature, x is the mean of the features in a column and σ is the standard deviation of values in that column. The implementation will be performed via the use of the "StandardScaler" method from the "preprocessing" class of scikit-learn.

Balancing the Dataset
A data set balancing Algorithm called Synthetic Minority Oversampling Technique (SMOTE) will be used. SMOTE developed by [22] is an over-sampling technique used to generate synthetic minority samples. The technique combines informed oversampling of the minority class with random under-sampling of the majority class, as a result, the minority class is over-sampled with the creation of artificial sample classes of k-nearest neighbours as shown in Figure 2. SMOTE balances a data set by over-sampling the minority class (by creating artificial instances of the minority class) so that it equals to the number of the majority class. The Algorithm is given as: for each minority sample: • Find its k-nearest minority neighbours.
• Randomly select q of those neighbours.
• Randomly generate synthetic samples along the lines joining the minority sample and its q selected neighbours (q depends on the amount of oversampling desired) [23].

Machine Learning Algorithms
There are various Supervised Machine Learning Algorithms such as K-Nearest Neighbours, Decision Tree, Naive Bayes, Support Vector Machine and many more, but throughout the medical literature, it is seen that Support Vector Machine and Neural Network Algorithms are most commonly used [24]. In this research, the along with the two Algorithms mentioned the Logistic Regression will also be implemented as from Figure? It can be seen that Logistic Regression is the third most commonly used Machine Learning Algorithm in Healthcare. Hence implementing three Algorithms will make the research more relevant.

Logistic Regression
Logistic Regression (LR) Model is used for predicting binary outcomes. In predicting the LR equation the maximum-likelihood ratio to determine the statistical significance of the variables [25]. LR is ideal for problems where the task is to predict the presence or absence of a characteristic or outcomes that are based on values of predictor variables. LR model is similar to a Linear Regression model however, it is suitable for models where the outcome is binary [6]. LR is based on the Logistic Function ( ) P y defined as: LR model for P independent variables can be written as: Here ( ) is the presence of Coronary Artery Disease and 0 1 , , , P β β β  are regression coefficients. There is a linear model hidden within the LR model and the mathematical Logarithm of the ratio of ( ) gives a linear model in X: where ( ) g X has some properties of a Linear Regression model and the independent variables: "X" could be a combination of both continuous and categorical variables [25].

Support Vector Machine
It was invented by Vapnik in 1979 and proposed for solving Classification and Regression problems by Vapnik in 1995. Support Vector Machine or SVM for short is a Supervised Learning Algorithm that uses a non-linear mapping to transform the original training data into higher dimensional space, within this new dimension it searches for the linear optimal separating hyperplane (decision boundary) that separates the tuples of one class from another. Data can always be separated by a hyperplane with an appropriate nonlinear mapping to a sufficiently high dimension. The algorithm finds the hyperplane using support vectors ("essential" training tuples) and margins (defined by the Support Vectors) [26].

Mathematics
From the research paper of [26] the mathematics is explained as; let the data set D be given as ( ) ( ) ( ) X y X y X y  , where i X is the set of training tuples with associated class labels, i y . Each i y can take one of the two values, either +1 or −1 (i.e., 1, 1 i y ∈ + − ). In an SVM, the idea is to find the hyperplane that maximises the minimum distance from any training data point as shown in Figure 3. It is expected that the hyperplane with a larger margin to be more accurate at classifying future data tuples than hyperplane with the smaller margin.
Hence, SVM searches for the hyperplane with the largest margin. A separating hyperplane can be written as, 0 W X b ⋅ + = where W is a weight vector and b is a bias. Thus any point that lies above the separating hyperplane satisfies 0 W X b ⋅ + > Similarly, any point that lies above the separating hyperplane sa- However, Equation (5) is applicable to data samples that are linearly separable. In such cases where data is linearly inseparable a kernel function is used to transform the data into a higher-dimensional space without actually transforming them into that space. This notion is known as the kernel trick which allows the transformation of data to large dimensions for Classification problems [26].
In situations where the data samples are not linearly separable the resultant function is given as follows: is the non-linear space from the original space to high dimensional space [26]. The four basic kernels are given as follows where γ , r and d are kernel parameters [27]: • Polynomial: ( ) ( )

Artificial Neural Network
Inspired from capabilities of the human brain for its incredible processing capa- tilayer Perceptrons are used which contains an input layer, one or more hidden layers and an output layer [28].
Multi-layer Perceptron is a supervised learning algorithm that learns a function ( ) where m is the number of dimensions for input and o is the number of dimensions for output. MLP can learn a non-linear approximator for either classification or regression given a set of features 1 2 , , , m X x x x =  and a target y. Figure 4

Steps in Backpropagation Algorithm
The Backpropagation Algorithm (BA) is the most commonly used learning techniques in Artificial Neural Network, following are the steps as described by [28]: • All network weights are initialised to small random numbers.
• Error term k δ calculated for each hidden unit h as below where kh w denotes network weight from hidden unit h to output unit k: • Each network weight is updated as follows where x denotes the input from unit i into unit j [3]:

Designed MLP for the Research The designed MLP consists of an Input Layer, Hidden Layer and an Output
Layer as shown in Figure 5 where X denotes the input features, n denotes the final feature and A is the activation function. The number of neurons will be the same as the input features such as "Age", "Weight" and so on in the input layer  where X_1…X_n represents the number of input features. The number of neurons in the hidden layer is represented as A_1…A_n, as the Network will solve a binary classification problem hence, the output layer will consist of one neuron shown as Y. The hidden layer neurons will be activated by the Rectified Linear Unit function and the neuron on the output layer will be activated by the Logistic Function as shown by Equation (2).

Confusion Matrix
It is an evaluation metric which is used to describe the performance of a classifier by calculating evaluation parameters and is shown in Table 1

Stratified K-Fold Cross Validation
The designed experiment uses two steps to evaluate the implemented Algorithms. Firstly, a stratified K-fold Cross-validation technique will be used for validating the implemented models. In this validation technique, the folds are selected so that each class labels are distributed equally in each fold. The target variable is binary hence the experiment comes under dichotomous classification, this means that each fold contains roughly the same proportions of the two types of class labels. The data set will be divided into k subsets where 10 k = , each time one of the k subsets will be used as the test set and the 1 k − subsets will be used as a training set. Therefore every data point will be part of the test set exactly once and gets to be in training set 1 k − times. The average results from the k folds will be taken and a single estimation will be produced. 10 k = is taken because 10 is the standard value which is ideally used in research [30]. Figure 6 shows the illustration of the technique when 5 K = .

Accuracy
The performance of the Algorithms will be compare based on the accuracy obtained on the prediction of CAD given by Equation (11). Accuracy of the models will be obtained in each fold at the end of training with this technique there would be 10 accuracies per model. The average of the accuracies will be obtained along with the standard deviation of the accuracies will also be calculated to

ROC and AUC
Receiver operating Characteristic curve (ROC) and area under the curve (AUC) will be obtained. ROC-AUC plot of each model will be generated to visualise the mean accuracy of each model. ROC curve is based on two metrics, True Positive Rate (TPR) and False Positive Rate (FPR). True positive rate (TPR), also known as sensitivity, hit rate or recall, is given as: Intuitively this metric corresponds to the proportion of positive data points that are correctly considered as positive, with respect to all positive data points.
The higher the TPR, the fewer positive data points will be missed. False-positive rate (FPR) or fall-out is given as: FPR can also be generated from specificity as: where specificity is defined as: This metric corresponds to the proportion of negative data points that are mistakenly considered as positive, with respect to all negative data points. In

Statistical Analysis
It was found from the literature that in previous experiments the feature "BBB" was not used so this feature was removed from the dataset. Figure 8 shows the results of running the code for obtaining the descriptive statistics of the modified dataset.
It is seen that the dataset contains a total of 303 rows and 55 columns with column names of "Age" to "Cath". The data-types include 5 floats, 29 integers and 21 objects. Hence the total number of Numeric and Categorical Features are 34 and 21 also the dataset does not contain any missing values.

Results of Data Pre-Processing
At the first stage, the raw dataset is processed by One-Hot encoding all the categorical variables into dummy variables and to avoid the dummy variable trap the  last columns were dropped. After encoding features, it was found that the "Exertional CP" had unique value in the feature columns, so it was dropped from the dataset. Dropping this feature will be beneficial for the implemented Machine Learning models as it only contains only one value so the Algorithms will not learn anything from this feature as shown in Figure 9.
Then further steps of taking matrices of features, splitting the matrices and feature scaling are applied as mentioned in the Research Methodology chapter.
The implemented Algorithms will be implemented on the scaled features and also the X features will be re-sampled and the Algorithms will also be tested on the resampled features.

Results of SMOTE
From the Exploratory Data Analysis section, it was found that 28.7% of the patients are Normal and 71.3% of the patients were diagnosed with Coronary Artery Disease. This shows that there is an unequal distribution of labelled classes in the dataset. In order to fix the issue, the SMOTE Algorithm is implemented.
The results of implementing SMOTE are shown in the following two figures. Figure 10 shows the distribution of the target classes before applying SMOTE and the result of applying SMOTE is shown in Figure 11 from which it is seen that the target class count of both the classes is 173.

Logistic Regression: Imbalanced Target Data
Logistic Regression is implemented first on the processed data with imbalanced classes. From Figure 12 the Confusion Matrix of the model shows that it has returned with TP = 40 and TN = 12 and an accuracy score of 85.25%. The average accuracy score of K-Fold cross-validation obtained is 81.83% with a Standard    Deviation of ±5.28%. The ROC-AUC curve shows that the mean AUC obtained is 0.88 with minimum and maximum figures of 0.78 and 1 as shown in Figure 13.

Logistic Regression: Balanced Target Data
However, the value of TP, FN decreases while the values of TF and FP increase by 1 and accuracy stayed the same when Logistic Regression is run on the balanced dataset as shown in Figure 14.
From Figure 15 the average accuracy score increased to 89.61% and the standard deviation also decreases when the Algorithm is run on the balanced dataset.
The mean AUC score achieved on the balanced dataset is 0.94.

Support Vector Machine: Imbalanced Target Data
The Confusion Matrix of Support Vector Machine in Figure 16 shows that the accuracy is about 87%. TP and TN were 41 and 12. Values of FN and FP are 2

Artificial Neural Network: Imbalanced Target Data
Like Support Vector Machine the Artificial Neural Network trained on the imbalanced dataset produced a TP of 41, however, the FP value is the same FN is 7 and TN is 11 ( Figure 20). The cross-validation score obtained is 83.08% which is higher than both Logistic Regression and Support Vector Machine. Although The mean AUC obtained from the ROC-AUC curve is 0.90 with a standard deviation of +0.05. The value of mean AUC is 0.01 less than Support Vector Machine trained on the imbalanced target classes as shown in Figure 21.

Artificial Neural Network: Balanced Target Data
The highest accuracy score from the Confusion Matrix is found when the Artificial Neural Network is trained on the balanced target classes with the model be-

Impacts of Training on Imbalanced and Balanced Datasets
The results obtained from training the models on the dataset with imbalanced I. C. Dipto et al.
class shows that Support Vector Machine performed better than the other two Algorithms as it had the highest accuracy. This is also evident from the results obtained from the ROC curves of all the Algorithms trained on the imbalanced dataset. However, from the ROC curves, it is seen that the Support Vector Machine performed better as shown in Table 2 and Table 3.
Drastic improvements in results are seen after applying SMOTE. The results from Table 4 and Table 5 shows that the performance of all the Algorithms improved significantly as Logistic Regression and Support Vector Machine saw increases in performance by 7.78% and 6.19% and the model's performance were much stable as depicted by their reduction in Standard Deviations. The most significant improvement in performance is observed in the performance of ANN with an increase of accuracy score from 83.08% to 93.35% which is an increase of 10.27%. Moreover, ANN was much stable during training on both forms of the dataset and it was much stable when trained with a balanced dataset as the value of Standard Deviation reduced. As expected with the improved results the classifiers AUCs improved with SVM and ANN having the same values.    Results show that all the Algorithms trained on the balanced dataset produces better performance. According to [32] in certain areas such as fraud detection, medical diagnosis and risk management, severe imbalance class distribution is relatively common and is a concerning problem. ML Algorithms are built to reduce errors. As the probability of instances that belong to the majority class is greatly high in imbalanced datasets, the Algorithms are most likely to classify new observations to the majority class. Also, in real life, the cost of False Negative is usually much larger than False Positive, yet ML algorithms penalise both at a similar weight.

Ideal Machine Learning Model
To evaluate the performance of the implemented Algorithms various matrices are used and one of them is Confusion Matrix which gives results of the various aspects of a model from which it is possible to calculate performance measures such as Accuracy, False Positive Rate and so on. However, the accuracies obtained from the Matrix are not enough to find an accurate measure of a models accuracy as the dataset is split at a particular point. Hence, the K-Fold Cross Validation is used to split the dataset K times where (K = 10). The value 10 is chosen because this is the commonly used value found in the existing literature. In the K-Fold Cross Validation method, the entire dataset is split K number of times where one set is kept as a test set and the remainder as training set as discussed in the Methodology chapter and finally, the average of accuracies obtained from every training is calculated. The results have been already discussed in the previous Subsection. In this Subsection, the results obtained from the Confusion Matrix will be discussed. As found from the experiment that imbalanced dataset strongly reduces the predictive capabilities of Machine Learning models, therefore, only the results obtained from the Confusion Matrices of Algorithms trained on the balanced dataset are considered for comparison and they are provided in the following Table 6.
From the table, it can be seen that ANN and SVM have higher TP and TN values than LR. When selecting an ideal model, the FN and FP values must be taken into consideration. FN of ANN is 1 meaning that, out of the test set of data (61 patients) the model for one patient predicted that the patient has CAD but actually the patient is normal. The FP value is 4 which means that four patients were classified to have the disease but they are actually normal. In contrast, LR results with values of 4 and 5 for FN and FP with TN = 13 and TP = 39 whereas, SVM shows similar results compared to ANN.

Conclusion and Future Recommendations
In this research, a prototype system for detection of Coronary Artery Disease is built using Logistic Regression, Support Vector Machine and Artificial Neural Network for comparison of the Algorithms. The dataset used for the research contains medical records of patients who visited Shaheed Rajaei Cardiovascular, Medical and Research Center of Tehran, Iran. After performing Statistical Analysis on the data set, it was found that the dataset does not contain any missing values, however, from the Exploratory Data Analysis, it is evident that there is a class imbalance in the dataset as patients with CAD are higher than Normal patients, to solve this issue, SMOTE Algorithm is applied on the dataset to balance the dataset. Then the Algorithms have been compared on both balanced and imbalanced datasets and required pre-processing steps were carried out before the Algorithms were implemented. Results show that the performance of Support Vector Machine and Artificial Neural Network significantly improved when trained on the balanced dataset however, the overall accuracy of Logistic Regression stayed the same on both sets of data. Various performance matrices were used in the research and the accuracies were cross-validated and their ROC curves were plotted for each fold. Overall, the Artificial Neural Network had the highest average accuracy of 93.35% ± 2.56% and AUC of 0.98 ± 0.02, whereas the Support Vector Machine came quite close with an accuracy of 91.37% ± 3.50% with the same AUC value. In contrast, the Logistic Regression performed CAD prediction with an accuracy of 89.61% ± 4.96% with an AUC value of 0.94 ± 0.05.

Future Recommendations
A limitation of this research is the size of the dataset, hence working on a larger dataset with more features could be a better extension to this research. therefor a larger dataset containing patients with different geographic locations could be ideal. High Blood Cholesterol is another risk factor which is not present in the dataset. Heavy drinking of alcohol, use of drugs could lead to causes of increased blood pressure, stroke and so on could also be considered as contributing risk factors [33]. Also, risk factors for women such as Menopause and emerging non-traditional features for women mentioned in the research work of [34] which are; preterm delivery, Hypertensive disorders of pregnancy, Gestational diabetes, Autoimmune disease, Breast Cancer treatment and depression.
with whom I discussed all the relevant aspects of this research. Finally, I would like to express my gratitude towards Dr. Zahra Alizadeh Sani, Roohallah Alizadehsani and Mohamad Roshanzamir who are donors and creators of the dataset. They have published the data online and it is free to use for research purposes. Without their contribution, it would not have been possible to conduct this research.