This paper aims to design and implement an automatic heart disease diagnosis system using MATLAB. The Cleveland data set for heart diseases was used as the main database for training and testing the developed system. In order to train and test the Cleveland data set, two systems were developed. The first system is based on the Multilayer Perceptron (MLP) structure on the Artificial Neural Network (ANN), whereas the second system is based on the Adaptive Neuro-Fuzzy Inference Systems (ANFIS) approach. Each system has two main modules, namely, training and testing, where 80% and 20% of the Cleveland data set were randomly selected for training and testing purposes respectively. Each system also has an additional module known as case-based module, where the user has to input values for 13 required attributes as specified by the Cleveland data set, in order to test the status of the patient whether heart disease is present or absent from that particular patient. In addition, the effects of different values for important parameters were investigated in the ANN-based and Neuro-Fuzzy-based systems in order to select the best parameters that obtain the highest performance. Based on the experimental work, it is clear that the Neuro-Fuzzy system outperforms the ANN system using the training data set, where the accuracy for each system was 100% and 90.74%, respectively. However, using the testing data set, it is clear that the ANN system outperforms the Neuro-Fuzzy system, where the best accuracy for each system was 87.04% and 75.93%, respectively.
Recently, heart disease has become one of the most prevalent diseases which people are being suffered from. According to statistics, it is one of the most important causes of deaths all over the world (CDC’s report). Many factors, such as clinical symptoms and the relation between the functional and the pathologic manifestations of heart diseases and other human organs rather than heart, complicate the diagnosis of it and result in delay in correct diagnosis decision. Therefore, diagnosing of the heart disease is an essential matter in health care industry and many researchers try to develop medical decision support systems (MDSS) to help physicians. These systems are developed to moderate the diagnosis time and enhance the diagnosis accuracy in addition to supporting increasingly complicated diagnosis decision process [
Currently, hospital information systems using decision support systems have different tools available to obtain data, but they are still restricted. These tools can just answer some simple queries like “identifying the male patients who are below 20 years old, and single who have been treated for heart attack”. However, they are not able to answer complex queries “given patient records, predicting the probability of patients getting a heart disease” as an example [
According to [
This paper presents a decision support system for heart disease classification using neural network. The data set used is the Cleveland Heart Database taken from UCI learning data set repository which was donated by Detrano. The data set is being divided into two classes: 0 corresponding to absence of any disease and 1 corresponding to presence of disease.
The rest of the paper is organized as follows. Related works are presented in Section 2. In Section 3, research algorithms and concepts are described. Automated heart disease diagnosis system’s design and implementation details are presented in Section 4. In Section 5, experimental results are presented and discussed in details. The study is finally concluded in Section 6.
Until now, various classification algorithms have been employed on heart disease data set and high classification accuracies have been reported in the last decade. Cleveland heart disease database is one of the most accurate existing databases. Robert Detrano created this database in V.A. Medical Center, Long Beach and Cleveland Clinic Foundation in 1988. Since 1988, researchers worked a lot on classification of its data by using various classification algorithms and they obtained different accuracy results. The work presented in [
In addition to artificial neural network, fuzzy expert systems are also used in MDSS. For Instance a fuzzy expert system was proposed to determine heart disease risk of patient in 2007 and the result of this system was 79% [
Important concepts, architecture theory, and algorithm for Neural Network and Neuro-Fuzzy are described in this section.
Neural Network (NN) also referred to as Artificial Neural Network (ANN) is a computational model where its functions and methods are based on the structure of the brain. Neural network follows graph topology in which neurons are nodes of the graph and weights are edges of the graph. It consists of so many layers that should be finite in order to decrease time of problem solving. In this paper, neural network is used since it has the potential for supporting medical decision support systems.
In large data sets, it has cost-effective and flexible non-linear modeling since the optimization is easy. In addition, it is accurate in predictive inference. Another important factor is that these models can make knowledge dissemination easier by providing explanation, for instance, using rule extraction or sensitivity analysis [
In ANN, neurons can be arranged in various ways and the weights (connection between neurons) can have different patterns which is called neural network architecture. There are different types of architectures, such as feed-forward, feed-back, fully interconnected net, competitive net and so forth. Some of the most important architectures are introduced in [
Fully recurrent network architecture is the simplest sort of architecture in which every neuron is connected to each other. Simple recurrent network is to somehow like fully recurrent network, except that neurons are not fully connected. Competitive network is the same as single layer feed forward architecture. In addition of all attributes related to single layer feed forward architecture, in competitive network there is connection between outputs. Among aforementioned architecture, feed-forward architecture is the most suitable one in terms of time for a large amount of data.
The process of training the network aims to achieve the expected output by changing the weights in the connections between network layers. There are three sorts of network training as follows:
・ Supervised Training: In this process, a series of sample inputs are available for the network and the resulted output are compared with expected responses.
・ Unsupervised Training: This process is used for the time that the output of training input vectors are unknown.
・ Reinforcement Training: This process shows the correctness of output result.
In this paper, supervised training is used since it is based on the Cleveland database, whereby all input and expected output data are available.
In this research, MLP is used as one neural network model since it follows feed-forward architecture and supervised training. Perceptron network has usually a layer of input, a layer of output and one or more hidden layers in between.
MLP uses back propagation as its training algorithm. This algorithm repeats presentation of the input data to the neural network. In each iteration, the output data is compared with the desired one, error is computed and fed back (back propagated) to the network. This feedback is used to modify the weights of neurons. Finally, the desired output will be generated based on iterations [
Another model that is used in this work is Neuro-Fuzzy, which is the combination of fuzzy logic and neural networks in order to solve wide variety of real world problems in an effective manner. This combination is for removing the limitation of each model. Since neural networks are good at recognizing patterns and not good at explaining how they achieve their decisions. Fuzzy logic systems that can give inexact reasons, and explain their decisions well but not good at reaching the rules they use to make those decisions [
In this work, a very famous architecture for Neuro-Fuzzy approach known as Adaptive-Network-based Fuzzy Inference System (ANFIS) is used as introduced in [
This section highlights all aspects regarding the data set, design, and implementation for the automatic heart disease diagnosis system.
It is very obvious that data set is an important aspect for developing this kind of systems. The Cleveland data set is very famous and has been widely used as a benchmark for heart disease diagnosis systems. Therefore, the Cleveland data set for heart diseases is used in this project.
The Cleveland data set contains a total number of 303 instances with 13 medical attributes (factors) that are acquired from heart disease data set of Cleveland [
Heart Disease (Cleveland) Data Set | |||||
---|---|---|---|---|---|
1 | Type | Classification | 5 | (Real/Integer/Nominal) | (13/0/0) |
2 | Features | 13 | 6 | Missing Values? | Yes |
3 | Classes | 5 | 7 | Total Instances | 303 |
4 | Origin | Real World | 8 | Instances without Missing Values | 297 |
Attribute | Description |
---|---|
Age | Age is in year. |
Sex | Value (“0” = male, “1” = female). |
CP | Chest pain type (“1” = typical angina, “2” = atypical angina, “3” = non-angina pain, “4” = asymptomatic). |
Trestbps | Resting blood pressure in mm Hg. |
Chol | Serum cholesterol in mg/dl. |
Fbs | Indicator of whether fasting blood sugar was > 120 mg/dl (“1” = yes, ”0” = no). |
Restecg | Resting electrocardiographic results (“0” = normal, “1” = ST-T wave abnormality, “2” = probable or definite left ventricular hypertrophy). |
Thalach | Maximum heart rate achieved. |
Exang | Indicator of whether the angina is exercise induced (“1” = yes, “0” = no). |
Oldpeak | ST depression induced by exercise relative to rest. |
Slope | The slope of the peak exercise ST segment (“1” = up sloping, “2” = flat, “3” = down sloping). |
Ca | Number of major vessels colored by fluoroscopy. |
Thal | Summary of heart condition (“3” = normal, “6” = fixed defect, “7” = reversible defect). |
Num | “The Disease Diagnosis” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Here the H0 is denoting no presences of heart disease and H1, H2, H3, and H4 are presenting the presence of heart disease. |
It is important to highlight that the Cleveland data set was randomly divided into two main categories namely: Training and Testing data set, which comprise of 80% and 20% of the total Cleveland data set respectively.
Experiments with the Cleveland data set have concentrated on simply attempting to distinguish presence (values H1, H2, H3, and H4) from absence (value H0). Therefore, two main outputs are identified, where the value H0 means heart disease is absent from the patient, and the values H1, H2, H3, and H4 mean heart disease is present in the patient.
The programming language used for developing the automated heart disease diagnosis system is MATLAB, which is a powerful language for data analysis and visualization. There are many programming languages used in data mining. It is important to know the reasons of choosing MATLAB as data mining tool for this paper. The first advantage of using MATLAB is portability that the users will have the same range of basic functions at their disposal. Second advantage is domain specific representations that points out in MATLAB implementation, all data is the form of matrices [
As stated earlier, there are two main systems, whereby the first one is based on ANN and the second one is based on Neuro-Fuzzy. Each system has three main modules namely: Training, Testing, and Case-Based Modules.
As stated earlier, there are two main approaches used to develop the automated heart disease diagnosis system, which are the ANN and Neuro-Fuzzy. Therefore, each system has been experimented and results have been analyzed in order to compare the performance of the two approaches.
Disease Diagnosis | Classification | No. of Records | Training Data | Testing Data |
---|---|---|---|---|
H0 | Absent | 164 | 122 | 42 |
H1 | Present | 55 | 40 | 15 |
H2 | 36 | 20 | 16 | |
H3 | 35 | 27 | 8 | |
H4 | 13 | 11 | 2 | |
Total | 303 | 220 | 83 |
It is important to highlight that there are two main tests conducted, the first at the training module, where the training data set is tested against the trained Neural Network and Neuro-Fuzzy, and the second at the testing module, where the testing data set is tested against the trained Neural Network and Neuro-Fuzzy. Certain parameters were modified for both systems in order to optimize the systems’ performance and acknowledge their effects on the overall systems’ performance. Finally, the best combination of parameters are selected and used in the systems for future tests. A third test is conducted, where users could input values for a specific case and classify whether the heart disease is present or absent as shown in
Two main parameters have been explored in training the ANN system, which are the maximum number of epochs and number of hidden neurons. Maximum number of epochs ranges from 1000 to 5000 with an increment of 1000, whereas number of hidden neurons ranges from 5 to 15 with an increment of 5.
From
In most systems, the testing data is very important and systems are evaluated on how best they can perform when receiving data from of someone who has not been trained earlier. Therefore, if considering this aspect, the combination of 5000 epochs and 15 hidden neurons is selected since it performs the highest using the testing data set.
It is also seen that the ANN system could successfully classify the user inputs for a specific case, where through the 15 different experiments, the ANN system could classify that data to “Absent” of heart disease, therefore, the case-based module achieved 100% accuracy as far as
Given separate sets of input and output data, “genfis2” parameter generates a Fuzzy Inference System (FIS) structure using fuzzy subtractive clustering. When there is only one output, “genfis2” may be used to generate an initial FIS for ANFIS training by first implementing subtractive clustering on the data. The parameter “genfis2” accomplishes this by extracting a set of rules that models the data behavior.
The rule extraction method first uses the MATLAB “subclust” function to determine the number of rules and antecedent membership functions and then uses linear least squares estimation to determine each rule’s consequent equations. This function returns an FIS structure that contains a set of fuzzy rules to cover the feature space. Therefore, the “genfis2” is the only parameter that is investigated in this research and it ranges from 0.1 to 1.0 as shown in
From
Exp. No. | No. of Epochs | No. of Hidden Neurons | Recognition Results (%) | ||
---|---|---|---|---|---|
Training Data Set (%) | Testing Data Set (%) | Case-Based Classification | |||
1 | 1000 | 5 | 78.70 | 77.78 | Absent |
2 | 1000 | 10 | 82.87 | 77.78 | Absent |
3 | 1000 | 15 | 83.33 | 87.04 | Absent |
4 | 2000 | 5 | 82.41 | 77.78 | Absent |
5 | 2000 | 10 | 87.50 | 85.19 | Absent |
6 | 2000 | 15 | 85.65 | 83.33 | Absent |
7 | 3000 | 5 | 84.26 | 85.19 | Absent |
8 | 3000 | 10 | 87.96 | 85.19 | Absent |
9 | 3000 | 15 | 88.43 | 85.19 | Absent |
10 | 4000 | 5 | 81.48 | 70.37 | Absent |
11 | 4000 | 10 | 88.89 | 79.63 | Absent |
12 | 4000 | 15 | 90.74 | 85.19 | Absent |
13 | 5000 | 5 | 83.80 | 83.33 | Absent |
14 | 5000 | 10 | 86.11 | 85.19 | Absent |
15 | 5000 | 15 | 88.43 | 87.04 | Absent |
Average Results | 85.37 | 82.32 | 100% Absent |
Exp. No. | genfis2 Value | No. of Generated Rules | Recognition Results (%) | ||
---|---|---|---|---|---|
Training Data Set (%) | Testing Data Set (%) | Case-Based Classification | |||
1 | 0.1 | 215 | 100 | 74.07 | Absent |
2 | 0.2 | 214 | 100 | 74.07 | Absent |
3 | 0.3 | 204 | 100 | 74.07 | Absent |
4 | 0.4 | 186 | 100 | 74.07 | Absent |
5 | 0.5 | 164 | 100 | 75.93 | Absent |
6 | 0.6 | 138 | 100 | 61.11 | Present |
7 | 0.7 | 107 | 100 | 72.22 | Present |
8 | 0.8 | 42 | 100 | 50.00 | Absent |
9 | 0.9 | 23 | 100 | 55.56 | Present |
10 | 1.0 | 19 | 100 | 61.11 | Present |
Average Results | 100 | 67.22 | 60% Absent |
achieve higher accuracy using the testing data set. The best performance using the testing data set was at 0.5, which is 75.93%. In fact, “0.5” is the benchmark value for “genfis2” parameter. It is also noticed that as we increase the value for the “genfis2”, the number of generated rules decreases.
It is also seen that the Neuro-Fuzzy system could not successfully classify all the user inputs for a specific case, where only 6 out of 10 different experiments, the Neuro-Fuzzy system could classify that data to “Absent” of heart disease, and 4 other experiments the Neuro-Fuzzy system failed to successfully classify them. Therefore, the case-based module achieved 60% accuracy as far as
In most systems, the testing data is very important and systems are evaluated on how best they can perform when receiving data from of someone who has not been trained earlier. Therefore, if considering this aspect, the value for the “genfis2” parameter is fixed at “0.5”.
This research effort developed two systems based on ANN and Neuro-Fuzzy approaches in order to develop an automatic heart disease diagnosis system. From both