^{1}

^{2}

^{3}

^{1}

Coronary Artery Disease (CAD) is the leading cause of mortality worldwide. It is a complex heart disease that is associated with numerous risk factors and a variety of Symptoms. During the past decade, Coronary Artery Disease (CAD) has undergone a remarkable evolution. The purpose of this research is to build a prototype system using different Machine Learning Algorithms (models) and compare their performance to identify a suitable model. This paper explores three most commonly used Machine Learning Algorithms named as Logistic Regression, Support Vector Machine and Artificial Neural Network. To conduct this research, a clinical dataset has been used. To evaluate the performance, different evaluation methods have been used such as Confusion Matrix, Stratified K-fold Cross Validation, Accuracy, AUC and ROC. To validate the results, the accuracy and AUC scores have been validated using the K-Fold Cross-validation technique. The dataset contains class imbalance, so the SMOTE Algorithm has been used to balance the dataset and the performance analysis has been carried out on both sets of data. The results show that accuracy scores of all the models have been increased while training the balanced dataset. Overall, Artificial Neural Network has the highest accuracy whereas Logistic Regression has the least accurate among the trained Algorithms.

Coronary Artery Disease is the number one cause of deaths World-Wide and of the 56.9 million deaths reported around the world in 2016, more than 54% were because of top 10 causes of death among which Ischaemic Heart Disease (Coronary Artery Disease) and Stroke were the biggest killers and they remained the top causes of death for the last 15 years globally [

To function properly the Heart requires the supply of blood and the Heart muscles receive blood from Coronary Arteries. Coronary Artery Disease is the blockage or narrowing of the Coronary Arteries caused by hardening or clogging of these arteries due to the build-up of cholesterol or fatty deposits called plaque in the arteries inner walls. The plaque could restrict the flow of blood by clogging the artery or by causing abnormal artery tone or function. Therefore without a proper supply of blood, the heart becomes starved of oxygen and vital nutrients resulting in Chest Pain. If blood supply is entirely cut-off to a portion of a Heart muscle or if the energy requirements of the heart become more than the supply of blood, the result is a heart attack clinic [

Machine Learning is known as the Technology that is used for the development of Computer Algorithms with capabilities of mimicking the intelligence level of a human being. It is produced from ideas different fields such as Artificial Intelligence, Statistics and Probability, Computer Science, Information Theory, Psychology, Control Theory and Philosophy [

In this study, three different Supervised Machine Learning Algorithms are implemented to predict the presence of patients with CAD and finally, the performances of the Algorithms are compared to select the ideal Algorithm. The dataset used is the Z-Alizadeh Sani dataset that provides clinical records of 303 patients with a total of 54 features related to the disease.

The rest of this paper is organised as follows: in Section 2, a brief survey of existing literature related to the research topic is provided. The research methodology is described in Section 3 and the analysis of the results obtained after implementing the three Algorithms on the dataset is presented in Section 1. Then the findings from the research have been discussed in Section 5 and finally, the study is concluded with a number of future recommendations that would improve the existing research in the future in Section 6.

In 2008, Kurt [

In 2009, Lavesson used Algorithms such as Bagging, AdaBoost and Naive Bayes on the CHAPS data set. The test was done for the prediction of the Acute Coronary Syndrome from the research it was found that Naive Bayes had the highest accuracy [

In 2013, Alizadehsani used different Classification Algorithms on the Z-Alzadeh Sani data which consists of random 303 patients who visited the Shaheed Rajaei Cardiovascular, Medical and Research Center in Tehran, Iran. The data set contains 216 patients who had Coronary Artery Disease (CAD) and the rest of the patients are free from the disease and a total of 54 features. With the Sequential Minimal Optimisation (SMO), Naive Bayes, Bagging with SMO and Neural Network, the researcher also introduced a feature selection Algorithm for the creation of three different features. From the test results, it is found that the accuracy of SMO produced the highest value of 94.08% when tested with the three created features [

A hybrid model was proposed for identification and prediction of Coronary Heart Disease (CHD) by Akila in 2015 and the model was tested on patient data who were occupational drivers and the data were collected from a medical college and hospital. The model consisted of two stages in the first stage, risk identification was carried out by classification of physical and biomedical factors using Decision Tree (DT) Algorithm. In the second stage, CHD risk identified instances using Decision Tree were analysed using Multi-Layer Perceptron (MLP) using habitation and medical history attributes. The Classification accuracies of DT and MLP were 98.66% and 96.66% [

In 2016, Lo collected four Heart Disease Data sets from the University of California Irvine (UCI) Machine Learning Repository, combined the data sets, removed the missing values and prepared a combined data set containing data of 822 patients of which 453 had Coronary Artery Disease (CAD) and the rest did not. The presence of disease was identified using various risk factors of CAD in the Asian population. The authors used seven Machine Learning Algorithms such as Naive Bayes, Artificial Neural Network, Sequential Minimal Optimization, K-Nearest Neighbour, Adaboost, J48 and Random Forest. The seven methods were compared against an Ensemble Method called Voting Algorithm was also used. From the study, it was found that the Voting method predicted the presence of CAD in patients with the highest accuracy [

Interesting research had been carried out in 2016 by Alizadehsani where the stenosis (narrowing) of the major Coronary Arteries of patients from the Z-Alizadeh Sani data set. Two feature selection methods were used to extract the best features and variables for consideration of stenosis of the major arteries were studied from a medical book. To predict the patients with stenosis, the Support Vector Machine Algorithm was used with various kernel methods and promising results were obtained. In addition, the Apriori Algorithm was used to decide on whether the arteries were stenotic [

In 2017, Forssen systematically implemented and evaluated two Supervised Learning Algorithms used were Logistic Regression, Penalized Logistic Regression and Random Forest and compared them to traditional regression approaches for Coronary Artery Disease prediction. The data was collected from the Clinical Cohorts in Coronary disease Collaboration (4C) study containing 3409 number of recruited patients with acute or stable chest pain from four UK National Health Service (NHS) hospitals. In order to reduce the dimensionality of the data Principal Component Analysis (PCA) was used. After running PCA six principal components were selected which had more than 95% variance of the data. After running the Supervised Algorithms in the adjusted and unadjusted forms it was found that applying PCA and adjusting the Algorithms it is seen that Penalized Logistic Regression had the highest accuracy when it was run in both adjusted and unadjusted ways [

A hybrid approach was proposed by Arabasadi in 2017 where the researchers tried to enhance the performance of Neural Networks by using the Genetic Algorithm. The tests were carried out on the Z-Alizadeh Sani Dataset. The results showed that the Neural Network with the use of a Genetic Algorithm produced an accuracy of 93.85% [

In 2018, Meng built a hybrid Algorithm called two-layer Gradient Boosting Decision Tree which was compared with two other commonly used Machine Learning Algorithms such as Support Vector Machine and Logistic Regression. The Algorithms were run on a created data set which consisted of 15,000 patient data of routine blood test results. With the use of the created data set, the researchers trained the Algorithms to classify healthy status, coronary heart disease and other diseases. The test results show that the prediction accuracy of the created Algorithm for prediction of the presence of Coronary Heart Disease and other diseases was higher than the other Algorithms trained on the data set [

Another study conducted by Nassif in 2018, the researchers tested Support Vector Machine, Naive Bayes and K-Nearest Neighbour Algorithms using 10-fold cross-validation on the Cleveland Heart Disease data set to compare the performance of three Machine Learning Algorithms for the prediction of Coronary Artery Disease. Three different feature selection methods were applied on the data set and feature versus risk score graphs were plotted to identify the features that are closely related to the risk of Coronary Artery Disease and seven best features were selected for input variables for the Algorithms. From the research, it is found that the Naive Bayes Algorithm was the best performer which showed an accuracy score of 84% [

In 2019, Shamsollahi used a data set containing clinical data of 282 patients with 58 attributes to compare the performance of various Machine Learning methods for the prediction of Coronary Artery Disease. The researchers had used both descriptive (Clustering) and predictive (Classification) methods on the data set. The K-Means Clustering Algorithm was used to cluster the data into three clusters of patients of their amount of smoking. Then to predictive (Classification) Algorithms such as C & RT, CHAID and so on were used on the Clusters. From the research, it is found that the C & RT Algorithm was the best performer as it predicted with the highest figures of accuracy for the three Clusters [

The data set is collected from the UCI Machine Learning Repository which contains a collection of data sets that are widely used by the Machine Learning community. In the repository, the information of the donors and creators of the data set, data information, attributes of the data set and other relevant information are also provided [

The Z-Alizadeh Sani Dataset will be used for the research which contains records of 303 random patients who visited Shaheed Rajaei Cardiovascular, Medical and Research Center of Tehran, Iran. The data set contains features that are related to the diagnosis of Coronary Artery Disease. 216 patients from the data set have the disease and the rest of the patients are normal. The features are grouped into four different categories. If a patient has stenosis (narrowing) of more than 50% in one of their coronary arteries then that patient is diagnosed with the disease [

Although the data set contains one additional feature “BBB” that stands for Blood-brain barrier, however, from the research papers it is found that the feature has been removed from the data set before running the Algorithms, therefore, it can be inferred that “BBB” is a feature that is not related to Coronary Artery Disease likewise the feature will also be dropped from the data set in the current research. The label or target variable is “Cath” with values “CAD” for the presence of the disease and “Normal” for a normal patient. The dataset was donated on UCI Machine Learning Repository at 2017-11-17 [

There are a total of four steps which will be followed to carry out the research. The first step involves the Exploratory Data Analysis followed by data pre-procession. After processing the data set is divided into training and test sets on which the Algorithms will be implemented and finally the Algorithms will be evaluated in the final step.

following Sections describe the steps in further detail.

This step will be performed to gain useful insights into the collected data set through data visualisations and results from performing analysis. This step will help to find if the data set has any missing values, identify the Categorical features, numerical features and more.

Raw data are often not found in the form and shape that is required for the optimal performance of learning Algorithms. Therefore the preprocessing step is the most important in Machine Learning Applications [

One important aspect of Machine Learning is feature engineering. The Algorithms that will be implemented are only able to read numerical values so it is important to transform the categorical features into numerical values cat [

The matrix of features to be used as input variables and the target variable will be taken. The input variable features and the target matrices will be taken into X and y variables.

In this part of the Pre-processing stage, the matrix of features X and the target variable y is split using the “test_train_split” method of “model_selection” class of scikit-learn. 80% data will be used for training and the remainder will be used to Test the Machine Learning Models that will be implemented in this research. The training and test set of X will be stored into variables called “X_train” and “X_test”. Likewise, the y training and test set will be stored in variables named “y_train” and “y_test”.

Most data sets contain features that are of varying ranges and this is a problem since most of the Machine Learning Algorithms use Euclidean distance between two data points. If features are not scaled, such algorithms will only take in the magnitude of features and the produce various results as features with higher ranges will weigh in more in the distance calculations than features in lower ranges. Hence feature scaling is applied to suppress the explained effect to bring all the features into the same magnitude [

To scale the features of the data set Standardization will be used. The results of the Standardization also known as Z-score Normalization are that the features will have properties of a standard normal distribution with μ = 0 and σ = 1 where μ is mean and σ is the standard deviation. The formula used to calculate the Standardization (Z-Scores) is as follows:

z i = x i − x ¯ σ (1)

where x i is the value of each feature, x ¯ is the mean of the features in a column and σ is the standard deviation of values in that column. The implementation will be performed via the use of the “StandardScaler” method from the “preprocessing” class of scikit-learn.

A data set balancing Algorithm called Synthetic Minority Oversampling Technique (SMOTE) will be used. SMOTE developed by [

SMOTE balances a data set by over-sampling the minority class (by creating artificial instances of the minority class) so that it equals to the number of the majority class. The Algorithm is given as: for each minority sample:

· Find its k-nearest minority neighbours.

· Randomly select q of those neighbours.

· Randomly generate synthetic samples along the lines joining the minority sample and its q selected neighbours (q depends on the amount of oversampling desired) [

There are various Supervised Machine Learning Algorithms such as K-Nearest Neighbours, Decision Tree, Naive Bayes, Support Vector Machine and many more, but throughout the medical literature, it is seen that Support Vector Machine and Neural Network Algorithms are most commonly used [

Logistic Regression (LR) Model is used for predicting binary outcomes. In predicting the LR equation the maximum-likelihood ratio to determine the statistical significance of the variables [

P ( y ) = e y 1 + e y = 1 1 + e − y (2)

LR model for P independent variables can be written as:

P ( y = 0 ) = 1 1 + e − ( β 0 + β 1 X 1 + β 2 X 2 + ⋯ + β P X P ) (3)

Here P ( y = 0 ) is the presence of Coronary Artery Disease and β 0 , β 1 , ⋯ , β P are regression coefficients. There is a linear model hidden within the LR model and the mathematical Logarithm of the ratio of P ( y = 0 ) to 1 − p ( y = 0 ) gives a linear model in X:

g ( X ) = ln ( P ( y = 0 ) 1 − P ( y = 0 ) ) = β 0 + β 1 X 1 + β 2 X 2 + ⋯ + β P X P (4)

where g ( X ) has some properties of a Linear Regression model and the independent variables: “X” could be a combination of both continuous and categorical variables [

It was invented by Vapnik in 1979 and proposed for solving Classification and Regression problems by Vapnik in 1995. Support Vector Machine or SVM for short is a Supervised Learning Algorithm that uses a non-linear mapping to transform the original training data into higher dimensional space, within this new dimension it searches for the linear optimal separating hyperplane (decision boundary) that separates the tuples of one class from another. Data can always be separated by a hyperplane with an appropriate nonlinear mapping to a sufficiently high dimension. The algorithm finds the hyperplane using support vectors (“essential” training tuples) and margins (defined by the Support Vectors) [

From the research paper of [

to class +1, and any tuple that falls on or below H 2 belongs to class −1. Combining the two inequalities of equations gives y i ( W ⋅ X + b ) ≥ 1 , for all i. The above problem can be solved by introducing the Lagrange multipliers ( α i ≥ 0 ( i = 1,2, ⋯ , m )) The patterns x i which correspond to non-zero Lagrange coefficients are called support vectors. The resultant decision function has the following form:

y ( x ) = sgn [ ∑ i = 1 m α i y i ( x i , x ) + b ] (5)

However, Equation (5) is applicable to data samples that are linearly separable. In such cases where data is linearly inseparable a kernel function is used to transform the data into a higher-dimensional space without actually transforming them into that space. This notion is known as the kernel trick which allows the transformation of data to large dimensions for Classification problems [

y ( x ) = sgn [ ∑ i = 1 m α i y i K ( x i , x ) + b ] (6)

where K ( x i , x ) is the kernel function equals to ( ϕ ( x i ) , ϕ ( x ) ) and ϕ ( x ) is the non-linear space from the original space to high dimensional space [

· Linear: K ( x i , x j ) = x i T x j .

· Polynomial: K ( x i , x j ) = ( γ x i T x j + r ) d γ > 0 .

· Radial Bias Function (RBF): K ( x i , x j ) = exp ( − γ ‖ x i − x j ‖ 2 ) , γ > 0 .

· Sigmoid: K ( x i , x j ) = tanh ( γ x i T x j + r ) .

Inspired from capabilities of the human brain for its incredible processing capabilities due to interconnected neurons. Artificial Neural Networks (ANN) are designed by processing units called Perceptrons. They contain one layer and are able to solve linearly separable problems and to solve non-linear problems Multilayer Perceptrons are used which contains an input layer, one or more hidden layers and an output layer [

Multi-layer Perceptron is a supervised learning algorithm that learns a function f ( . ) : R m → R o where m is the number of dimensions for input and o is the number of dimensions for output. MLP can learn a non-linear approximator for either classification or regression given a set of features X = x 1 , x 2 , ⋯ , x m and a target y.

The Backpropagation Algorithm (BA) is the most commonly used learning techniques in Artificial Neural Network, following are the steps as described by [

· All network weights are initialised to small random numbers.

· Training data is received as input and output is computed for each unit with the equation below known as Sigmoid Function where w ¯ is the vector of weight values and X ¯ is the vector of input values in the network:

o = α ( w ¯ , X ¯ ) = α ( y ) = 1 1 + e − y (7)

· Then error computation step is started. BP algorithm works as follows: Error signal ( δ ) which is calculated for each network output is propagated to all neurons in the network as input.

· Error term δ k calculated for each network output unit using the following equation where o k network output for output unit k and indicates desired output for output unit k:

δ k ← o k ( 1 − o k ) ( t k − o k ) (8)

· Error term δ k calculated for each hidden unit h as below where w k h denotes network weight from hidden unit h to output unit k:

δ h ← o h ( 1 − o h ) ∑ k ∈ o u t p u t s w k h δ k (9)

· Each network weight is updated as follows where Δ w j i = η δ j x j i and η is the learning rate and x j i denotes the input from unit i into unit j [

w j i ← w j i + Δ w j i . (10)

The designed MLP consists of an Input Layer, Hidden Layer and an Output Layer as shown in

where X_1…X_n represents the number of input features. The number of neurons in the hidden layer is represented as A_1…A_n, as the Network will solve a binary classification problem hence, the output layer will consist of one neuron shown as Y. The hidden layer neurons will be activated by the Rectified Linear Unit function and the neuron on the output layer will be activated by the Logistic Function as shown by Equation (2).

It is an evaluation metric which is used to describe the performance of a classifier by calculating evaluation parameters and is shown in

The designed experiment uses two steps to evaluate the implemented Algorithms. Firstly, a stratified K-fold Cross-validation technique will be used for validating the implemented models. In this validation technique, the folds are selected so that each class labels are distributed equally in each fold. The target variable is binary hence the experiment comes under dichotomous classification, this means that each fold contains roughly the same proportions of the two types of class labels. The data set will be divided into k subsets where k = 10 , each time one of the k subsets will be used as the test set and the k − 1 subsets will be used as a training set. Therefore every data point will be part of the test set exactly once and gets to be in training set k − 1 times. The average results from the k folds will be taken and a single estimation will be produced. k = 10 is taken because 10 is the standard value which is ideally used in research [

The performance of the Algorithms will be compare based on the accuracy obtained on the prediction of CAD given by Equation (11). Accuracy of the models will be obtained in each fold at the end of training with this technique there would be 10 accuracies per model. The average of the accuracies will be obtained along with the standard deviation of the accuracies will also be calculated to

Actual Positive | Actual Negative | |
---|---|---|

Predicted Positive | TP | FP |

Predicted Negative | FN | TN |

understand the variance.

Accuracy = TP + TN TP + FP + FN + TN . (11)

Receiver operating Characteristic curve (ROC) and area under the curve (AUC) will be obtained. ROC-AUC plot of each model will be generated to visualise the mean accuracy of each model. ROC curve is based on two metrics, True Positive Rate (TPR) and False Positive Rate (FPR). True positive rate (TPR), also known as sensitivity, hit rate or recall, is given as:

TPR = TP TP + FN (12)

Intuitively this metric corresponds to the proportion of positive data points that are correctly considered as positive, with respect to all positive data points. The higher the TPR, the fewer positive data points will be missed. False-positive rate (FPR) or fall-out is given as:

FPR = FP FP + TN (13)

FPR can also be generated from specificity as:

FPR = 1 − Specificity (14)

where specificity is defined as:

FPR = TN TN + FP (15)

This metric corresponds to the proportion of negative data points that are mistakenly considered as positive, with respect to all negative data points. In other words, the higher the FPR, the more negative data points will be miss classified. To combine FPR and TPR into one single metric i.e. to generate AUC, two former metrics with a different threshold are calculated and then plotted on a single graph with FPR values on x-axis and TPR values on the y-axis. The resulting curve is called AUROC as shown in

It was found from the literature that in previous experiments the feature “BBB” was not used so this feature was removed from the dataset.

It is seen that the dataset contains a total of 303 rows and 55 columns with column names of “Age” to “Cath”. The data-types include 5 floats, 29 integers and 21 objects. Hence the total number of Numeric and Categorical Features are 34 and 21 also the dataset does not contain any missing values.

At the first stage, the raw dataset is processed by One-Hot encoding all the categorical variables into dummy variables and to avoid the dummy variable trap the

last columns were dropped. After encoding features, it was found that the “Exertional CP” had unique value in the feature columns, so it was dropped from the dataset. Dropping this feature will be beneficial for the implemented Machine Learning models as it only contains only one value so the Algorithms will not learn anything from this feature as shown in

Then further steps of taking matrices of features, splitting the matrices and feature scaling are applied as mentioned in the Research Methodology chapter. The implemented Algorithms will be implemented on the scaled features and also the X features will be re-sampled and the Algorithms will also be tested on the resampled features.

From the Exploratory Data Analysis section, it was found that 28.7% of the patients are Normal and 71.3% of the patients were diagnosed with Coronary Artery Disease. This shows that there is an unequal distribution of labelled classes in the dataset. In order to fix the issue, the SMOTE Algorithm is implemented. The results of implementing SMOTE are shown in the following two figures.

Logistic Regression is implemented first on the processed data with imbalanced classes. From

Deviation of ±5.28%. The ROC-AUC curve shows that the mean AUC obtained is 0.88 with minimum and maximum figures of 0.78 and 1 as shown in

However, the value of TP, FN decreases while the values of TF and FP increase by 1 and accuracy stayed the same when Logistic Regression is run on the balanced dataset as shown in

From

The Confusion Matrix of Support Vector Machine in

and 6. The average cross-validation accuracy obtained is the same as the Logistic Regression. Compared to the Logistic Regression the average AUC obtained is 0.91 with a standard deviation of ±0.05. The maximum figure is 1 whereas the lowest figure is 0.82 as shown in

Significant improvements in performance are observed when the Support Vector Machine is trained on the balanced dataset. Although the value of TN and FP stayed the same although TN improved from 12 to 14 and FN reduced from 6 to 4 as shown in

The improvement in performance is further displayed by the ROC-AUC curve showing a mean AUC of 0.98 with a standard deviation of 0.02 as illustrated in

Like Support Vector Machine the Artificial Neural Network trained on the imbalanced dataset produced a TP of 41, however, the FP value is the same FN is 7 and TN is 11 (

The highest accuracy score from the Confusion Matrix is found when the Artificial Neural Network is trained on the balanced target classes with the model being 91.80%. From

and TN are the same as Support Vector Machine trained on the balanced dataset.

The highest cross-validation score is achieved with a low standard deviation and the mean AUC obtained is the same as Support Vector Machine’s AUC with the same mean value of the standard deviation of the mean AUC value. The average cross-validation score obtained is 93.35 and the standard deviation is +2.56%. The results show that the model was quite stable with AUC of 1 appearing five times, 0.99 and 0.97 twice and the lowest score being 0.96 as shown in

The results obtained from training the models on the dataset with imbalanced class shows that Support Vector Machine performed better than the other two Algorithms as it had the highest accuracy. This is also evident from the results obtained from the ROC curves of all the Algorithms trained on the imbalanced dataset. However, from the ROC curves, it is seen that the Support Vector Machine performed better as shown in

Drastic improvements in results are seen after applying SMOTE. The results from

Algorithm | Average Accuracy and Standard Deviation |
---|---|

Logistic Regression | 81.83% ± 5.28% |

Support Vector Machine | 85.18% ± 7.99% |

Artificial Neural Network | 83.08% ± 5.34% |

Algorithm | Mean AUC and Standard Deviation |
---|---|

Logistic Regression | 0.88 ± 0.06 |

Support Vector Machine | 0.91 ± 0.05 |

Artificial Neural Network | 0.90 ± 0.05 |

Algorithm | Average Accuracy and Standard Deviation |
---|---|

Logistic Regression | 89.61% ± 4.96 |

Support Vector Machine | 91.37% ± 3.50 |

Artificial Neural Network | 93.35% ± 2.56 |

Algorithm | Mean AUC and Standard Deviation |
---|---|

Logistic Regression | 0.94 ± 0.05 |

Support Vector Machine | 0.98 ± 0.02 |

Artificial Neural Network | 0.98 ± 0.02 |

Results show that all the Algorithms trained on the balanced dataset produces better performance. According to [

To evaluate the performance of the implemented Algorithms various matrices are used and one of them is Confusion Matrix which gives results of the various aspects of a model from which it is possible to calculate performance measures such as Accuracy, False Positive Rate and so on. However, the accuracies obtained from the Matrix are not enough to find an accurate measure of a models accuracy as the dataset is split at a particular point. Hence, the K-Fold Cross Validation is used to split the dataset K times where (K = 10). The value 10 is chosen because this is the commonly used value found in the existing literature. In the K-Fold Cross Validation method, the entire dataset is split K number of times where one set is kept as a test set and the remainder as training set as discussed in the Methodology chapter and finally, the average of accuracies obtained from every training is calculated. The results have been already discussed in the previous Subsection. In this Subsection, the results obtained from the Confusion Matrix will be discussed. As found from the experiment that imbalanced dataset strongly reduces the predictive capabilities of Machine Learning models, therefore, only the results obtained from the Confusion Matrices of Algorithms trained on the balanced dataset are considered for comparison and they are provided in the following

From the table, it can be seen that ANN and SVM have higher TP and TN values than LR. When selecting an ideal model, the FN and FP values must be taken into consideration. FN of ANN is 1 meaning that, out of the test set of data (61 patients) the model for one patient predicted that the patient has CAD but actually the patient is normal. The FP value is 4 which means that four patients were classified to have the disease but they are actually normal. In contrast, LR results with values of 4 and 5 for FN and FP with TN = 13 and TP = 39 whereas, SVM shows similar results compared to ANN.

Algorithm | TP | FN | FP | TN |
---|---|---|---|---|

Logistic Regression | 39 | 4 | 5 | 13 |

Support Vector Machine | 41 | 2 | 4 | 14 |

Artificial Neural Network | 42 | 1 | 4 | 14 |

In this research, a prototype system for detection of Coronary Artery Disease is built using Logistic Regression, Support Vector Machine and Artificial Neural Network for comparison of the Algorithms. The dataset used for the research contains medical records of patients who visited Shaheed Rajaei Cardiovascular, Medical and Research Center of Tehran, Iran. After performing Statistical Analysis on the data set, it was found that the dataset does not contain any missing values, however, from the Exploratory Data Analysis, it is evident that there is a class imbalance in the dataset as patients with CAD are higher than Normal patients, to solve this issue, SMOTE Algorithm is applied on the dataset to balance the dataset. Then the Algorithms have been compared on both balanced and imbalanced datasets and required pre-processing steps were carried out before the Algorithms were implemented. Results show that the performance of Support Vector Machine and Artificial Neural Network significantly improved when trained on the balanced dataset however, the overall accuracy of Logistic Regression stayed the same on both sets of data. Various performance matrices were used in the research and the accuracies were cross-validated and their ROC curves were plotted for each fold. Overall, the Artificial Neural Network had the highest average accuracy of 93.35% ± 2.56% and AUC of 0.98 ± 0.02, whereas the Support Vector Machine came quite close with an accuracy of 91.37% ± 3.50% with the same AUC value. In contrast, the Logistic Regression performed CAD prediction with an accuracy of 89.61% ± 4.96% with an AUC value of 0.94 ± 0.05.

Future RecommendationsA limitation of this research is the size of the dataset, hence working on a larger dataset with more features could be a better extension to this research. therefor a larger dataset containing patients with different geographic locations could be ideal. High Blood Cholesterol is another risk factor which is not present in the dataset. Heavy drinking of alcohol, use of drugs could lead to causes of increased blood pressure, stroke and so on could also be considered as contributing risk factors [

There are a number of people whom I would like to express my acknowledgements. First, I would like to thank my Supervisor Mr. H M Mostafizur Rahman for providing his constructive feedback on which I have reflected on to prepare better work. A special thanks to Mrs. Tanzila Islam from whom I was inspired about studying Artificial Intelligence during the time I was completing Higher National Diploma. Then I would like to thank my dear friend Md Ashiqur Rahman with whom I discussed all the relevant aspects of this research. Finally, I would like to express my gratitude towards Dr. Zahra Alizadeh Sani, Roohallah Alizadehsani and Mohamad Roshanzamir who are donors and creators of the dataset. They have published the data online and it is free to use for research purposes. Without their contribution, it would not have been possible to conduct this research.

The authors declare no conflicts of interest regarding the publication of this paper.

Dipto, I.C., Islam, T., Rahman, H.M.M. and Rahman, M.A. (2020) Comparison of Different Machine Learning Algorithms for the Prediction of Coronary Artery Disease. Journal of Data Analysis and Information Processing, 8, 41-68. https://doi.org/10.4236/jdaip.2020.82003