A consistency contribution based bayesian network model for medical diagnosis

This paper presents an effective Bayesian network model for medical diagnosis. The proposed approach consists of two stages. In the first stage, a novel feature selection algorithm with consideration of feature interaction is used to get an undirected network to construct the skeleton of BN as small as possible. In the second stage for greedy search, several methods are integrated together to enhance searching performance by either pruning search space or overcoming the optima of search algorithm. In the experiments, six disease datasets from UCI machine learning database were chosen and six off-the-shelf classification algorithms were used for comparison. The result showed that the proposed approach has better classification accuracy and AUC. The proposed method was also applied in a real world case for hypertension prediction. And it presented good capability of finding high risk factors for hypertension, which is useful for the prevention and treatment of hypertension. Compared with other methods, the proposed method has the better performance.


INTRODUCTION
Machine learning classification techniques have been applied to medical diagnosis tasks [1][2][3][4].With many advantages over other methods, Bayesian network has become a promising tool in medical diagnosis [5][6][7][8][9].It can reason under uncertainty by providing a concise representation of a joint probability distribution over a set of random variables [6,10].Due to its graphical representation, a Bayesian network is easily understandable [11].In addition, the diagnostic performance with Bayesian networks is often surprisingly insensitive to imprecision in numerical probabilities [12].
Bayesian network based diagnosis requires the learning of the Bayesian network structure, which is an op-timization problem in the search space of the DAGs (directed acyclic graph).In general, researchers treat this problem in two distinctly different ways: the constraintbased method [13] and the search-and-scoring method [14,15].The constraint-based methods are computationally effective.However, the search-and-scoring methods may produce more accurate results than the constraintbased methods [15].In addition, constraint-based methods may fail to provide orientations for all edges in a network [13].In this work we focus on structure learning and consider search-and-scoring methods.
When learning the structure of a Bayesian network model, however, search-and-scoring methods still face great challenges [16], because the true structure may not be identified by a finite number of observations, and it may be computationally intractable when the number of possible network structures grows super exponentially.To prune the search space, the existing methods depend on either setting a uniform bound on the number of parents [17], or conditional dependency test [18], which may degrade the performance of classification.As for the computation efficiency, hill-climbing search algorithm is the most commonly used because of its good trade-off between accuracy of the obtained model and ease of implementation.However, the well-known problem, myopia, may hinder the algorithm from finding the global optima.
In this paper, a new Consistency-Contribution based Bayesian Network algorithm, named CCBN, is proposed for medical diagnosis.To build the skeleton of Bayesian network, a novel feature selection method with the consideration of feature interaction is used to efficiently construct an undirected network.Further more, to enhance the performance of search algorithm, "AND" or "OR" strategy [19] is used to prune search space, while random start, beam-search and look-ahead help to overcome the optima of hill-climbing algorithm.To evaluate the effectiveness of the proposed algorithm in various medical problems, six disease datasets from the UCI machine learning repository were selected, and six offthe-shelf classifiers were used for comparison.Moreover, Copyright © 2010 SciRes.JBiSE a real world application on hypertension dataset was chosen to present its capability of finding high risk factors for hypertension.Its classification performance was further evaluated against other algorithms.The rest of this paper is organized as follows.Section 2 introduces the necessary preliminaries about Bayesian network.In Section 3, the proposed classifier is described in detail.Then experiments on six disease datasets from the UCI machine learning repository are presented in Section 4. Later in Section 5, the proposed method is applied in a real world case for hypertension prediction to further evaluate its performance.Consequently in Section 6, the main contribution of this paper is summarized.

Bayesian Network
A Bayesian network B = (G, Θ) consists of two parts [7].The first part is a network structure, which is a directed acyclic graph (DAG) G with each variable represented as a node.The edges in the DAG represent statistical dependencies between the variables.The second part, Θ, is a set of parameters describing the probability density.

i X
The problem of learning a BN can be stated as follows: given a finite data set of instances of variable X, find a minima set Ф to construct a graph structure G with proper parameters Θ that best matches D. In the commonly used search-and-scoring methods, a scoring function is usually used to evaluate how well the DAG G explains the data and then to search for the best DAG that optimizes the scoring function.In our proposed algorithm, Bayesian scoring criterion [14] is chosen.It computes the posteriori probability of a network given the data and prior knowledge [20].

Feature Selection
To get the skeleton of a BN, Ф, is a solution to feature selection [21].In real-world classification problems, feature selection is indispensable to identify the most useful relevant features from data, and remove the irrelevant and redundant features [22].Here, we define the feature relevant as that in [14] i F To find the relevant features, many effective algorithms have been developed [23,24], but they most often assume that features are independent.Due to attribute interaction [25], however, there are many features that can be considered as irrelevant when evaluated independently; but when they are combined with other features, their relevancy may be significant.Thus, unintentional removal of these features could lead to poor classification performance because of the loss of useful information.
Jakulin and Bratko [26,27] attempted to address this issue, but their methods can only deal with low order interaction.Later, Zhen Zhao [28] presented a more efficient algorithm by searching for interacting features.In this paper, we will use consistency measure to construct the skeleton of BN with consideration of feature interaction.

Overview
In this section, we present our consistency measure based Bayesian Network algorithm, named CCBN.The algorithm is a two-step process: Firstly, an undirected network is constructed as the prototype of a BN's skeleton Ф based on symmetrical uncertainty and consistency hypothesis.This process finds a set that is as small as possible, but can deal with feature interactions in variable selection, and contain all parents, children and co-parents (attribute interaction) of each variable.In the second step, several methods are integrated together to enhance the performance of search algorithm by restricting the search space and overcoming the myopia of Hill-climbing search.

Constructing the Skeleton of BN
To construct the skeleton of a BN, an undirected network will be discovered in two phases.In phase-one, all variables (features) are ranked with regard to their SU (symmetrical uncertainty) and stored in the candidate set PC in descending order.In the second phase, the backward phase, features with less consistency-contribution will be removed from the candidate set PC.This elimination process is based on the consistency measure and takes into account feature interaction when removing both irrelevant and redundant features.In the end, the undirected network is obtained with candidate set PC.It will be used as the skeleton of BN.
As a fast correlation measure [24], SU is based on mutual information to measure the common information shared by two variables.It attempts to increase the chance for a strongly relevant feature to remain in the selected subset.The SU between the class label C and a feature is formulated as: where H(F i ) and H(C) are the marginal entropies, and M(F i ,C) is the joint entropy.
This formula compensates for information gain's bias toward features with more values and normalizes its values to the range [0,1] with the value 1 indicating that Copyright © 2010 SciRes.JBiSE knowledge of the value of the feature F i can completely predict the value of class label C and the value 0 indicating that the feature F i and class label C is independent.After all features are ranked in descending order with regard to their SU values, a backward elimination method will be performed.However, in our method, SU will not be the criterion for elimination, because it can't guarantee that the interacting features are ranked highly.Instead, a novel consistency measure with the consideration of feature interaction is used to remove both irrelevant and redundant features.
In a dataset , there is inconsistency if at least two instances have the same value for a feature space , but they are categorized into different classes.Those inconsistent instances will be grouped together into a subset , named as inconsistent instance set.Every has its inconsistency count Here, m represents the number of instances in D.
If i F is irrelevant or redundant, removing it from F will have no effect on the number of inconsistent instantces.Consequently, there will be .
F is a relevant feature, no matter whether it has feature interaction or not, its removal will bring more inconsistent instances, { } ( ) ( ) . This unique character, named as consistency contribution, makes it a good metric for feature selection by taking the feature interaction into consideration.
In our method, a feature i F 's consistency contribution is formalized as: From Eq.3, we can conclude that: higher consistency contribution value means higher relevancy; and the zero value indicates that it is either irrelevant or redundant feature.Using certain threshold value δ those weak relevant feature [16] can also be removed from if PC ( ) i CC F   .It will help to restrict the search space.
Based on the consistency-contribution, the elimination operation will start from the end of the ranked feature list.If the consistency-contribution of a feature is less than a predefined threshold δ (0 < δ < 1), the feature is considered immaterial and can be removed; otherwise it is selected.The process is repeated until all features in the list are checked.A large δ is associated with a high probability of removing relevant features.δ is assigned 0.0001 in this paper if not otherwise mentioned.

DAG Search
After an undirected network is constructed as the prototype of Ф, a local hill climbing algorithm will be used to search for the best DAG that optimizes the scoring function in the restricted space of DAGs.
To make the greedy hill climbing algorithm more efficient, our method takes into account several aspects, like preprocessing, starting solution, how to acquire neighbors and score metrics for candidate neighbors.
Before performing the hill-climbing algorithm, the search space of DAGs is initially restricted by using "OR" or "AND" strategy [19], because of the likely asymmetric dependencies in the finite sample scenario (i.e., the edge a-b may be found on a but not on b).In "OR" strategy, an arc is excluded only if it was not found in either direction, while "AND" strategy requires edges to be found in both directions.
In addition, our DAG search will start with a random start DAG network rather than an empty one (i.e., the one with no edges).Based on the restricted search space for DAGs, a random number of edges will be chosen randomly.The edge will be added to the original DAG only if it is discovered in the process of constructing the undirected network.
To obtain the neighbor at each search step, the legal operators used are arc addition, arc deletion, and arc reversal.Of course, except in arc deletion, we have to make sure that there is no directed cycle introduced in the graph.For fast evaluation of candidate sub-graphs (neighbors), the Bayesian scoring criterion is used in our work.
To overcome the myopia of hill climbing algorithm, the 2-step look ahead in good directions (LAGD) hill climbing is used.Further more, 5 best evaluation operations will be considered at each search step.Following is the procedure of our DAG searching: 1) Restrict search space of DAG with "OR" or "AND" strategy; 2 To evaluate our method's effectiveness for medical diagnosis, six disease datasets from the UCI machine learning repository [29] is used in experiments.In this section, we will discuss the experiments in detail.

Datasets Description
Among the selected datasets, datasets sick and hypothyroid are chosen for the diagnosis of thyroid diseases.Dataset spectf_test is selected for diagnosis of heart disease.In addition, the datasets for mammographic, diabetes and horse's colic are also used in experiments.
Table 1 tabulates the detailed information of the experimental datasets.They are further analyzed, respectively, to verify the effectiveness of the proposed CCBN algorithm on medical diagnosis tasks.

Implementation Results
For comparison, three representative classification algorithms are chosen.They are NaiveBayes [1], C4.5 [4], and IB1 [30].All are available in the WEKA environment [31].Our CC-BN method is also implemented in the WEKA's framework.Moreover, Bayesian Networks with different local search algorithms [32], like hillclimbing, LAGD hill-climbing and simulated annealing are also chosen for comparison.They are relatively represented by HC-BN, LAGD-BN and SA-BN.To guarantee the impartiality of experiments, for HC-BN, LAGD-BN and SA-BN, the maximum allowed size for the candidate parents' set is the same as the size of features in each dataset.For the evaluation of performance, error rate and ROC [33] were used in experiments with10-fold cross validation.
Table 2 reflects the error rate generated by each method based on 10-fold CV.On average, CCBN (AND) has the best performance with error rate 12.2% and CCBN (OR) is the second best with 12.8%.We can see that both CCBN methods have distinctly better performance than others, and the difference between two CCBN methods is slight.Among the six datasets, at least one of CCBN methods has the lowest error rate in four of the datasets.For the other two datasets, CCBN methods still have the second lowest error rate.It proves that CCBN has a better consistent performance than the other methods used in the experiments.
Impressively, on datasets sick and hypothyroid, which have a large set of features and instances, all Bayesian Network methods have very low error rates, which are less than 60% of error rate of IB1 and not more than 25% of that of Naïve Bayes or C4.5.This shows the capability of Bayesian Network in dealing with high dimension data.But on other datasets, unlike CCBN, HC-BN, LAGD-BN and SA-BN have no great advantages over other conventional methods.This proves that our proposed, novel, method brings great enhancement to Bayesian Networks for coping with various problems.
Among BN methods, SA-BN has apparent advantage over HC-BN and LAGD-BN.However, simulated annealing search costs much more time than hill climbing methods.For example, on colic dataset, it takes less than 10 seconds for HC-BN and LAGD-BN to finish 10-fold CV testing, while it costs 2250 seconds for SA-BN.It proves that hill climbing algorithm has a good trade off between accuracy and efficiency.Further when compared with CCBN, SA-BN only wins on colic.It may be because CCBN gets benefit from our proposed method that brings improvement to HC and LAGD methods.
To study the impact of threshold δ on CCBN's performance, more experiments were performed with various thresholds δ.For each dataset, the error rate of LAGD-BN in Table 2 is used as reference.In Figure 1, the ratio of error rate between CCBN(AND) and LAGD-  Copyright © 2010 SciRes.
BN is shown to reflect the effect of threshold δ. meaning of the threshold, for example, from a statistical aspect, is another interesting topic for future work.It is likely that this turning point may not be independent of the dataset (problem) to be analyzed.If it is true, we hope to propose some feasible approaches that could tune the threshold automatically for a specific problem.
Figure 1 shows that on hypothyroid, colic, diabetes and mammographic, CCBN is better than that of LAGD-BN, even for δ = 0.Such an improvement should owe to random start which has depressed local suboptima.Also, with δ = 0.00001 or δ = 0.0001, CCBN shows much more improvement over LAGD-BN for all datasets except sick.This supports our claim that our method is much more efficient for feature selection due to the consideration of feature interaction.However, it should be noted that CCBN's advantage over LAGD-BN is not consistent.On hypothyroid, CCBN's performance is worse than LAGD-BN's when δ = 0.01.On sick, CCBN has no win.We will discuss about this more in next section.

APPLICATION ON HYPERTENSION PROBLEM
In this section, a real-case from medical diagnosis is presented.We will show that our proposed method is capable of finding the high risk factors for hypertension diagnosis and prediction.Also, classification result on hypertension is evaluated against other algorithms.

Problem Description
Also, when studying the impact of δ on CCBN's performance, we found that for δ with range from 0 to 0.001, the change of error rate for each dataset is less than 10%.This means that our method is insensitive to the threshold value δ.This improvement may benefit from random start.
As one of the most common disease in the world, hypertension is considered to be present when a person's systolic/diastolic blood pressure is consistently 140/ 90 mmHg or greater [34].It is affecting estimated 600 million people worldwide [35], and that number is expected to increase to 1.56 billion by 2025 [36].Therefore, it is meaningful to develop a computer-aid system for its diagnosis and prediction.
Although classification accuracy is our focus, ROC is also introduced in our experiment's evaluation.According to Table 3, CCBN method has highest average area under ROC (AUR).Although its AUR is not highest on colic, spectf-test and diabetes, the difference is slight.Also, it is apparent that BN methods have higher AUR than other traditional methods.
In our experiments, the data from the hypertension study were specifically collected by Hubei Provincial Center for Disease Control and Prevention in China.The investigation was conducted in compliance with the IRB

DISCUSSION
By comparing CCBN with other classification algorithms, experiments proved that CCBN can achieve better performance, but there are still some issues that should be addressed for improvement.
First, the error rate of CCBN (AND) on four of six datasets, is better than that of the other methods used in our experiments.On colic and mammogrpahic, however, this does not hold, although CCBN (AND) is still the second best.This suggests that there may be some properties Second, although on most dataset CCBN proves to be insensitive to threshold value, threshold value still shows big impact on spectf-test.Also, the best error rate for different datasets happens on different thresholds.It is obvious that the ideal choice of a threshold is the turning point that provides the lowest error rate.Exploring the   4).

Classification Results
In order to assess the effectiveness of the proposed approach, its performance was evaluated against that of some off-the-shelf algorithms.The performance of the different algorithms is listed in Besides error rate, CCBN methods also have the best specificity, PPV and AUR.On sensitivity, although Naive Bayes has a slightly better performance than CCBNs, the difference is not significant.Therefore, we can conclude that CCBN has better performance on hypertension classification than other methods.

Risk Factor Analysis
For the prediction of hypertension, CCBN tried to find those likely high risk factors.These high risk factors enabled the classifiers to achieve their best performance and also would give us important guide to hypertension prevention and treatment.
In experiment, CCBN marked age, BMI and blood pressure as the high risk factors in hypertension development.This result is consistent with many clinical studies.
As we know, lifestyle plays an important role in the development and treatment of hypertension.For the dietary intake, fish, meat, soybean products, eggs and diary are selected by CCBN, which is coherent with current knowledge about hypertension.According to exercise, the original data have three attributes: frequency of exercise in a week, taking exercise every day or not, number of days in a week that have more then 30 minutes spent on exercise.In CCBN, the first two attributes are ignored, because the third attribute has more or at least not less information about the impact of exercise on hypertension than other two attributes.
Among the attributes about social background, nationality and occupation are removed from BN. Remarkably education background and marital status are also selected as key factors in prediction of hypertension.CCBN shows that higher education will lead to less risk of hypertension, and the single is more susceptible to hypertension than the married.It is reasonable because people with higher education are more aware of the jeopardy of hypertension and have more knowledge about the prevention of hypertension, which in turn leads to good lifestyle.Similarly for marital status, the married is more likely to lead regular lifestyle and have more healthy diet than the single.

CONCLUSIONS
Bayesian network is a powerful learning technique to tackle the uncertainty in medical problems, but learning Copyright © 2010 SciRes.JBiSE its structure is a great challenge, which may hinder its wide application in medicine.In this paper, the CCBN algorithm is proposed for medical diagnosis.Our method firstly constructs the skeleton of BN with consideration of feature interaction, and then performs the efficient search for an optimal BN network.Case studies on disease datasets and application on hypertension predicttion show that the proposed method is a promising tool for computer-aided medical decision system, because it has better classification accuracy and consistency than other traditional methods.

Figure 1 .
Figure 1.The ratio of error rate between CCBN (AND) and LAGD for different thresholds on six UCI datasets.
Form an original DAG Mc with random starting and get the Bayesian scoring metric for Mc; 3) Obtain all neighbors of current candidate DAG Mc by getting all possible directional orientations for the edges in Mc and compute the Bayesian scoring metric Greedy operation: if new candidate Mnew has Bayesian scoring metric higher than that of current Mc, replace Mc with new candidate Mnew, and then go back to step 3) for further improvement; otherwise, it means that there is no better solution available and the search ends up with the final graph structure Mc; 8) Deliver the out-coming graph structure Mc.

Table 1 .
Summary of experimental disease datasets from UCI.

Table 2 .
Error rates of the classifiers on experimental disease datasets.

Table 3 .
AUR of the classifiers on experimental disease datasets.

Table 5
, which contains Error rate, Sensitivity (SN), Specificity (SP), Positive predictive value (PPV), and Negative predictive value (NPV).The result shows that CCBN with 'AND' strategy has the same performance as that with "OR" strategy for hypertension classification.Both CCBNs have the lowest error rate on the hypertension classification.

Table 4 .
Summaries of mean and standard deviations of some important features in hypertension dataset.

Table 5 .
Performance of classifiers for the hypertension dataset based on 10-fold CV.