Helicobacter pylori microbe and detecting with data mining algorithms

Nowadays medicines believe that the only definite method to diagnose the existence of Helicobacter pylori microbe is performing endoscope, however it’s painful and insufferable for young children. Thus in this paper we used data mining algorithms to diagnose the existence of this microbe and eventually we succeeded in predicting the existence of this bacterium in stomach that guides medicines to perform Endoscopy just in cases where percentage of finding this bacterium is high.


INTRODUCTION
In recent years, helicobacter pylori microbe has gained lots of attention and become prevalent among children.It is curved bacterium and usually lives in the stomach and many people around the world are affected by this bacterium.This bacterium not only leads to disorder in digestive system, but also; if it doesn't cure, it may cause disease like gastric cancer or peptic ulcer.
Helicobacter pylori is prevalent pathogen in human and scientists believe this bacterium is more prevalent in unhygienic and crowded places and they guess Helicobacter pylori is contagious microbes and they can transfer from one person to another [1].
There are different methods we can use to diagnose the existence of this bacterium, but generally, we can classify them in two groups: aggressive and non-aggressive methods.Aggressive methods include performing Endoscopy and non-aggressive methods include taking blood sample, respiratory test and urine analysis.
In fact it's a gram negative S shape, micro aerophilic, spiral bacterium and its length is about 3.5 micron and its width is about 0.5 micron.The outer cover of this bacterium is flat and covered with lots of flagel.
In this paper we talked about the type of disease, method of healing, side effects and suggested methods to detect this bacterium stomach.

History of Helicobacter pylori
Helicobacter pylori is known as H. pylori.They grow in stomach and about half of the people in the world are affected by this bacterium, but just a group of them are affected by side effects of H. pylori infection and most of them never feel any illness despite being affected by this microbe.On the other side; H. pylori infection can lead to peptic ulcer disease or gastric cancer.Still it is a question why the side effects of this bacterium are appearing in a group of people and it doesn't cause any illness in the other ones?
In fact, the prevalence of infection in different countries and in various populations of a country varies and it has strong relation with economic and social status of people living in that country.For example, in advanced and high tech countries, the infection of H. pylori is unusual and it's not prevalent in children.In these countries just 5% of children under 5 years old are affected by this bacterium, on the other side; in less advanced countries Helicobacter pylori bacterium mostly appears in young children and a high percentage of children under 10 years old are affected by this bacterium and it's not prevalent in adult.
To definitely diagnose the existence of these microbes' medicines must perform Endoscopy and take samples from stomach, after that patient must start drug therapy and she must undergo UBT 1 to completely make assurance about the eradication of infection [2].

Helicobacter pylori in Children
H. pylori bacterium usually doesn't cause any illness in period of infancy, however if it doesn't cure it can lead to digestive disease like gastric (pain and inflammation in gastric tract) or peptic ulcer (ulcer in gastric or upper part of small intestine called duodenum), besides Helicobacter pylori may not have any symptom in children which causes diagnosing this bacterium harder [3].

Contagion
Scientist guess H. pylori infection must be contagious, because it's more prevalent in families that live in unhygienic and crowded places, besides the researchers show that infection can transfer from one person to another, however still it's not clear how infection is transmitted to the others, and due to passive and mysterious nature of H. pylori bacterium there isn't any vaccine or instruction to prevent the contagion of infection.

Diagnosis
There are varieties of methods to diagnose H. pylori bacterium; but, in general we can classify them in to two groups: aggressive and non-aggressive methods.
Aggressive methods like performing Endoscopy and non-aggressive methods like taking blood test, respiratory test and urine analysis.In fact selecting the suitable method of healing depends on clinical status of patient.

Aggressive Methods
In this method medicines need to directly look at gastro intestinal tract, so this operation requires to use sedative and enter an Endoscopy (a small and flexible pipe with a small camera at end) into throat, stomach and duodenum.
During the procedure, medicines take sample for laboratory to examine the symptom of microscopic infections and existence of H. pylori.

Non-Aggressive Methods
Generally we can classify non aggressive methods in three sections [5]:  Blood Sample;  Respiratory Tests;  Urine Analysis.

Blood Sample
This method helps us to identify the existence of H. pylori antibodies.Performing blood sample is easy in this test the positive result just indicates the existence of H. pylori in the past and can't show the active infection of patient at present.

Respiratory Tests
In this test patient need to drink a solution which helps the medicines to identify the carbon has broken by H. pylori bacterium.Respiratory tests are useful for indicating the existence of H. pylori infection but they can't provide information about the measure of infection, besides; performing this test in children is not easily achievable.

Urine Analysis
In urine analysis we can identify the existence of H. pylori protein in urine.Urine analysis like respiratory test is just able to show the existence of H. pylori Bacterium; but they can't help us to figure the measure of infection.
However; children may suffer gastric pain due to the variety of reasons, like dyspepsia, virus, depression and anxiety, appendicitis and etc. and most of gastric pains are not related to H. pylori bacterium but it's crucial to diagnose this disease correctly and fast.

MOTIVATIONS AND RESEARCHES
Now it is clear that Endoscopy is an only method that's able to definitely diagnose the existence of H. pylori infection but performing Endoscopy in children is insufferable and painful.In this paper we tried to diagnose Helicobacter pylori infection with data mining algorithm.In fact we tried to predict the probability of disease and existence of Helicobacter pylori infection before performing Endoscopy that help us to perform Endoscopy in cases where percentage of finding this bacterium is high.
The process of collecting data for this paper is as below: At first we gathered and consolidated the real medical data collected from patient's blood test in Namazi Hospital 2 of Shiraz.Then we converted data to a readable format for data mining algorithms and eventually we analyzed the converted data with data mining algorithms.In this paper we just mention to those algorithms that show better performance compared with other ones.

METHOD OF EVALUATIING MODELS
In this paper for evaluating described model we used Cross Validation model.

Cross Validation
In k-fold cross validation, the initial data are randomly partitioned into n subsets 1 n X X  .In this model training and testing is performed n times.In iteration i, parti-tion X i is reserved as the test set, and the remaining partitions are collectively used to train the model.For example in the first iteration, subsets 2 n X X  collectively serve as the training set in order to obtain a first model, which is tested on D 1 ; the second iteration is trained on subsets 1 3 , and tested on X 2 ; and so on, here, each sample is used the same number of times for training and once for testing.For classification, the accuracy estimate is the overall number of correct classifications from the n iterations, divided by the total number of tuples in the initial data.
Generally we use 10-Fold state of this model on a particular dataset.

THE PROPOSED METHOD
Below we described some of the algorithms that showed better performance among the other ones.

RBF Network Algorithm
RBF 3 network is an artificial neural network uses radial basis functions as activation functions.RBF networks have three layers: input layer, hidden layer, output layer.One neuron in the input layer corresponds to each predictor variable.With respects to categorical variables, n -1 neurons are used where n is the number of categories.Hidden layer has a variable number of neurons.Each neuron consists of a radial basis function centered on a point with the same dimensions as the predictor variables.The output layer has a weighted sum of outputs from the hidden layer to form the network outputs.This algorithm uses the k-means clustering algorithm to provide the basis functions and learns either a logistic regression (discrete class problems) or linear regression (numeric class problems) on top of that.Symmetric multivariate Gaussians are fit to the data from each cluster [8].
After performing RBF Network algorithm we found out this algorithm is just able to predict only 63% of data correctly that represent RBF Network doesn't has acceptable performance in detecting the existence of H. pylori bacterium.

Naive Bayes Algorithm
Bayesian classifiers are statistical classifiers.Naïve Bayes classifier is valid to multiply probabilities when the events are independent.Naïve Bayes algorithm has well performance in text classifying and medical diagnosis and its performance is comparable with neural network and decision tree.
Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes.In theory, Bayesian classifiers have the minimum error rate in comparison to all other classifiers [9].
This algorithm needs a base knowledge about number of quantities for probability, however most of the time this knowledge is unavailable and we have no alternative except estimating data.In fact we can take help form background information or past data's or theories in field of probability distribution and etc.

        P D h P h P h D P D 
(1) We explain the above formula with an example: for diagnosing a disease we have two states:  The patient has cancer;  The patient is healthy.
Laboratory data represents that 0/008 of population is affected by this disease.In spite of the fact that laboratory tests can be inaccurate we represent the result as below:  In 98% of situation which person definitely is sick the correct result is positive;  In 97% of situation which person definitely is healthily the correct result is negative.The probability of patient susceptible to cancer: cancer cancer cancer 0.98 0.008 0.0078 The probability of patient being healthy: ~cancer ~cancer ~cancer 0.03 0.992 0.0298 On the other side; this algorithm has weakness, in fact it can't make any differences among instances and it behave with all of them in a same manner and consider them unrelated.
After performing Naïve Bayes algorithm we found out this algorithm is able to predict about 70% of data correctly in detecting the existence of H. pylori bacterium.

Part Algorithm
Part is a class for generating decision list.This algorithm is used to identify Knowledge, Patterns and generating different rules [10] Figure 1.
We performed part algorithm on a dataset with different attribute and we could generate some rules which expertise (medicines) can perceive the importance of these rule.
After performing PART algorithm on data we found out this algorithm is able to predict about 72.20% of data correctly in detecting the existence of H. pylori bacterium.

Decision Tree Algorithm
A decision tree is a flowchart-like tree structure, where each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node 4 holds a class label.The topmost node in a tree is the root node.In artificial intelligent, trees are used to show various concept like sentence structure, equation, and etc. [11].
This approach is a well-known induction train algorithms that successfully tested in different applications.In fact, decision trees are useful for topics that let us to answer them as category or class name.For example we can create a decision trees that able to answer below ---------------- question; Dose her susceptible to disease?In general Decision Tree Algorithm is useful for subject which let us to determine the output value with YES or NO answer.

PART decision list
After performing this algorithm on data we found out it's able to predict about 78.50% of data correctly in detecting the existence of H. pylori bacterium.

Logistic Regression
Nowadays scientists in researches that involved with couples of element are trying to follow a specific purpose to obtain better result.In statistic we do the same with different methods of regression to obtain the desire result.
Generally in regression with couples of independent variable we must try to evaluate the answer variable.Logistic regression is useable for situation that answer variable has two or couples of states.This type of regression is useful for medical and sociological researches [14].
Logistic regression is a mathematical model which use to describe the correlation between couples of X variable with two or couples of depended (Y) variable.Two state variable is a variable with just two answer like dead or alive, present or absent, having relation or not and etc. this type of variable mostly use zero and one codes to indicate a state, code one use to show positive state (success) and code zero use to show negative state (failure).
This paper used Logistic regression to find a correlation between answer variable (Y) and a collection of predicator variables like 1 2 3 , , n X X X X  .After examining the structure of logistic regression algorithm we found out this algorithm is able to predict about 83% of data correctly in detecting the existence of H. pylori bacterium which shows better performance among the other algorithms.

PRACTICAL RESULTS
We have distributed a questionnaire for 6 months among those patients who needed to undergo Endoscopy to diagnose the existence of H. pylori infection.The questionnaire contains 22 questions based on below parameters: Male and Female, abdominal pain, Nocturnal awakening, Nausea, Vomiting, Halitosis, Heart Burn, Bloating, Belching, GI bleeding, Constipation, Diarrhea, Weight loss, Fatigue, Epigastric tenderness, Weight, height, Duration of symptoms, Previous treatment, Previous Endoscopy, Previous family H Acid peptic Dx, Rapid Urease test before therapy.
After collecting data and running the algorithms we used Cross Validation method to evaluate the algorithms, and then we compared algorithms with each other and below result generated (Table 1 and Figure 2).Finally Logistic regression represents better performance in detecting H. pylori bacteria among the other algorithms.

CONCLUSIONS
Recently Helicobacter pylori disease has become very prevalent among children under 10 years old and the only definite method to correctly diagnose the existence of H. pylori infection is performing endoscope, however it's painful and insufferable for children.
In this paper we tried to eliminate the unnecessary use of Endoscopy and use non-aggressive method as an alternative solution for children, and eventually we succeeded in predicting the existence of this bacterium with data mining algorithms about 83% correctly.
With these algorithms, we are able to diagnose the existence of Helicobacter pylori bacterium.In fact, using Data mining algorithms in identifying H. pylori bacterium helps us to make better decision in confronting this bacterium.
Now if we face a new patient and the laboratory result being positive; Dose patient susceptible to cancer?

Figure 1 .
Figure 1.Rule extraction with the part algorithm.
The percentage of correctly and incorrectly. Table1.