Random forest classifier combined with feature selection for breast cancer diagnosis and prognostic

As the incidence of this disease has increased significantly in the recent years, expert systems and machine learning techniques to this problem have also taken a great attention from many scholars. This study aims at diagnosing and prognosticating breast cancer with a machine learning method based on random forest classifier and feature selection technique. By weighting, keeping useful features and removing redundant features in datasets, the method was obtained to solve diagnosis problems via classifying Wisconsin Breast Cancer Diagnosis Dataset and to solve prognosis problem via classifying Wisconsin Breast Cancer Prognostic Dataset. On these datasets we obtained classification accuracy of 100% in the best case and of around 99.8% on average. This is very promising compared to the previously reported results. This result is for Wisconsin Breast Cancer Dataset but it states that this method can be used confidently for other breast cancer diagnosis problems, too.


INTRODUCTION
The high incidence of breast cancer in women has increased significantly in the recent years.It is the cause of the most common cancer death in women (exceeded only by lung cancer) [1].As reported by WHO, [2] there are about 1.38 million new cases and 458000 deaths from breast cancer each year.Breast cancer is by far the most common cancer both in the developed and developing countries.Breast cancer survival rates vary greatly worldwide, ranging from 80% or over in North America, Sweden and Japan to around 60% in middle-income countries and below 40% in low-income countries.
The use of expert systems and machine learning techniques in medical diagnosis is increasing gradually.There is no doubt that evaluation of data taken from patient and decisions of experts are the most important factors in diagnosis.But, expert systems and different artificial intelligence systems for diagnoses also help experts in a great deal.With the help of automatic diagnostic systems, the possible errors medical experts made in the course of diagnosis can be avoided, and the medical data can be examined in shorter time and more detailed as well.
This study aims to build a computer-aided diagnostic system to distinguish benign breast tumor from malignnant one.This method involves two stages in which a backward elimination approach of feature selection and a learning algorithm random forest are hybridized.The first stage of the whole system conducts a data reduction process for learning algorithm random forest of the second stage.This provides less training data for random forest and so prediction time of the algorithm can be reduced in a great deal.With a selected feature set, the explanation of rationale for the system can be more readily realized.
Our proposed method averagely obtained 99.82% and 99.70% classification accuracy in test phase and these results are the highest one among the studies applied for Wisconsin Breast Cancer Diagnosis Dataset (WBC-DD) and Wisconsin Breast Cancer Prognostic Dataset (WBCPD) classification problem so far.It also indicates that the proposed method can be applied confidently to other breast cancer problems with different data sets especially with ones that have a higher number of training data.
The used data source is Wisconsin Diagnosis Breast Cancer Dataset taken from the University of California at Irvine (UCI) Machine Learning Repository [3].This data set is commonly used among researchers who use expert systems and machine learning methods for breast cancer diagnosis and so it provides us to compare the performance of our system with other conducted studies related with this problem.
The rest of the paper is organized as follows.Section 2 summarizes the methods and results of previous research on breast cancer diagnosis.Section 3 reviews theoretical background.Section 4 describes the proposed method.Section 5 presents experimental result from using the proposed method to diagnose and prognosis breast cancer.Finally, Section 6 concludes the paper along with outlining future directions.
"Word 97-2003 & 6.0/95-RTF" for the PC, provides authors with most of the formatting specifications needed for preparing electronic versions of their papers.All standard paper components have been specified for three reasons: 1) ease of use when formatting individual papers; 2) automatic compliance to electronic requirements that facilitate the concurrent or later production of electronic products; and 3) conformity of style throughout a journal paper.Margins, column widths, line spacing, and type styles are built-in; examples of the type styles are provided throughout this document and are identified in italic type, within parentheses, following the example.Some components, such as multi-leveled equations, graphics, and tables are not prescribed, although the various table text styles are provided.The formatter will need to create these components, incorporating the applicable criteria that follow.

RELATED WORK
There has been a lot of research on medical diagnosis of breast cancer Hui-Ling Chen et al. [4] used a rough set (RS) based supporting vector machine classifier (RS_ SVM) and the reported accuracy was 96.87% in average.Based on neuro-fuzzy rules Ali Keles et al. [5] presented a decision support system for diagnosis breast cancer.As be reported, the system has high positive predictive rate (96%) and specificity (97%).In A. Marcano et al. [6] proposed a method named AMMLP based on the biological meta-plasticity property of neurons and Shannon's information theory.As reported by authors, the AMMLP obtained total classification accuracy of 99.26%.In Murat Karabatak and M. Cevdet Ince [7] an automatic diagnosis system for detecting breast cancer based on association rules (AR) and neural network (NN) was proposed.The reported correct classification rate of proposed system was at 95.6%.In Kemal Polat and Salih Güne [8] breast cancer diagnosis was conducted using least square support vector machine (LS-SVM) classifier algorithm that obtained classification accuracy was at 98.53%.In Ubeyli et al. [9] multilayer perceptron neural network, four different methods, combined neural network, probabilistic neural network, recurrent neural net-work and SVM were used, respectively, highest classification accuracy of 97.36% was obtained by SVM.In Sahan et al. [10] a new hybrid method based on fuzzyartificial immune system and knn algorithm was used and the obtained accuracy was 99.14%.Abonyi and Szeifert et al. [11] applied supervised fuzzy clustering (SFC) technique and obtained 95.57% accuracy.Quinlan [12] reached 94.74% classification accuracy using 10fold cross validation with C4.5 decision tree method.However, it should be noted that all above researches were tested on Wisconsin Breast Cancer Dataset, this dataset only contains 699 samples with 9 attributes and it is totally different with WBCDD and WBCPD that sometimes makes researcher confuse.
Relating to medical diagnosis of breast cancer with both WBCDD and WBCPD, in M. M. R. Krishnan, S. Banerjee, et al. (2010) [13] a support vector machine based classifier for breast cancer detection was used and the reported accuracy was 93.726 % on WBCDD.In R. Stoean and C. Stoean (2013) [14].Support vector machines and evolutionary algorithms was hybridized, the method obtained correct classification of 97% for diagnostic and 79% for prognostic.In T. T. Mu and A. K. Nandi (2007) [15] a combination of support vector machines, radial basis function networks and self-organizing maps was used, the method achieved classification accuracy of 98% on WBCDD.In Z. W. Zhang, Y. Shi and G. X. Gao (2009) [16] used rough set-based multiple criteria linear programming approach for breast cancer diagnosis and the reported accuracies are 89% and 65% on WBCDD and WBCPD, respectively.In D.-C.Li, C.-W. Liu, et al. (2011) [17] a three-stage algorithm was proposed.Firstly a fuzzy-based non-linear transformation method to extend classification related information from the original data attribute values for a small data set.Secondly, based on the new transformed data set, applies principal component analysis (PCA) to extract the optimal subset of features.Finally, authors used the transformed data with these optimal features as the input data for a learning tool, a support vector machine.The highest reported accuracy of the method was 96.35% on WBCDD.In S. N. Ghazavi and T. W. Liao (2008) [18] three fuzzy modeling methods including the fuzzy k-nearest neighbor algorithm, a fuzzy clustering-based modeling, and the adaptive network-based fuzzy inference system were used and the reported accuracy was 97.17% on WBCDD.

Random Forest
The random forest (RF) algorithms form a family of classification methods that rely on the combination of several decision trees (Figure 1).The particularity of such Ensembles of Classifi rs (EoC) is that their tree-e based components are grown from a certain amount of randomness.Based on this idea, RF is defined as a generic principle of randomized ensembles of decision trees [19].The basic unit of RF (the so-called base learner) is a binary tree constructed using recursive partitioning (RPART).
The RF tree base learner is typically grown using the methodology of CART [20], a method in which binary splits recursively partition the tree into homogeneous or near homogeneous terminal nodes (the ends of the tree).A good binary split pushes data from a parent tree-node to its two daughter nodes so that the ensuing homogeneity in the daughter nodes is improved from the parent node.RF is often a collection of hundreds to thousands of trees, where each tree is grown using a bootstrap sample of the original data.RF trees differ from CART as they are grown non-deterministically using a two-stage randomization procedure.In addition to the randomization introduced by growing the tree using a bootstrap sample of the original data, a second layer of randomization is introduced at the node level when growing the tree.Rather than splitting a tree node using all variables, RF selects at each node of each tree, a random subset of variables, and only those variables are used as candidates to find the best split for the node.The purpose of this two-step randomization is to de-correlate trees so that the forest ensemble will have low variance, a bagging phenomenon.RF trees are typically grown deeply.In fact, Breiman's original proposal called for splitting to purity.
Although it has been shown that large sample consistency requires terminal nodes with large sample sizes [21], empirically, it has been observed that purity or near purity is often more effective when the feature space is large or the sample size is small.This is because in such settings, deep trees grown without pruning generally yield lower bias.Thus, Breiman's approach is generally favored in genomic analyses.In such cases, deep trees promote low bias, while aggregation reduces variance.The construction of RF is described in the following main steps: 1) Draw ntree bootstrap samples from the original data.
2) Grow a tree for each bootstrap data set.At each node of the tree, randomly select mtry variables for splitting.Grow the tree so that each terminal node has no fewer than nodesize cases.
3) Aggregate information from the ntree trees for new data prediction such as majority voting for classification.
4) Compute an out-of-bag (OOB) error rate by using the data not in the bootstrap sample.
RF [22] can handle thousands of variables of different types with many missing values.Figure 2 presents bootstrapped sample and out of bag sample in random forest algorithm.For a tree grown on a bootstrap data, the OOB data can be used as a test set for that tree.As the number of trees increases, RF provides an OOB data-based unbiased estimate of the test set error.OOB data are also used to estimate importance of variables.These two estimates (test set error estimate and variable importance) are very useful byproducts of RF.
Variable importance: There are four variable importance measures implemented in the RF software code [23,24].Two measures, based on the GINI index of node impurity and classification accuracy of OOB data, are usually used.Given a node t and estimated class probabilities   1, , where Q is the number of classes.node the decrease in the GINI index is calculated for variable xj used to make the split.The GINI index-based variable importance measure j  is then given by the average decrease in the GINI index in the forest, where the variable xj is used to split a node.

N-Fold Cross Validation
In n-fold cross validation approach [26][27][28][29], we randomly partition into N sets of equal size and run the learning algorithm N times.Each time, one of the N sets is the test set, and the model is trained on the remaining N − 1 sets.The value of K is scored by averaging the error across the N test errors.We can then pick the value of K that has the lowest score, and then learn model parameters for this K.A good choice for N is N = M − 1, where M is the number of data points.This is called Leave-one-out cross-validation.

Bayesian Probability
Bayes' rule [30,31] really involves nothing more than the manipulation of conditional probabilities.As we know, the joint probability of two events, A & B, can be expressed as In Bayesian probability theory, one of these "events" is the hypothesis, H, and the other is data, D, and we wish to judge the relative truth of the hypothesis given the data.According to Bayes' rule, we do this via the relation The term P(D|H) is called the likelihood function and it assesses the probability of the observed data arising from the hypothesis.Usually this is known by the experimenter, as it expresses one's knowledge of how one expects the data to look given that the hypothesis is true.The term P(H) is called the prior, as it reflects one's prior knowledge before the data are considered.The specification of the prior is often the most subjective aspect of Bayesian probability theory, and it is one of the reasons statisticians held Bayesian inference in contempt.But closer examination of traditional statistical methods reveals that they all have their hidden assumptions and tricks built into them.Indeed, one of the advantages of Bayesian probability theory is that one's assumptions are made up front, and any element of subjectivity in the reasoning process is directly exposed.The term P(D) is obtained by integrating (or summing) P(D|H)P(H) over all H, and usually plays the role of an ignorable normalizing constant.Finally, the term P(D|H)P(H) is known as the posterior, and as its name suggests, reflects the probability of the hypothesis after consideration of the data.Another way of looking at Bayes' rule is that it represents learning.That is, the transformation from the prior, P(H), to the posterior, P(H|D), formally reflects what we have learned about the validity of the hypothesis from consideration of the data [32,33].

PROPOSED METHOD
The proposed method can be understood as a two-phase method.In the phase one, the learning algorithm RF was trained and tested on the training set and validation set in order to select the best features (overall view of the proposed method is presented in Figure 3).The most important procedure in phase one is to estimate feature ranking value for each feature based on Bayesian probability and feature impurity, after that all these features will be rank in ascending order based on feature ranking value.A backward elimination approach was applied to evaluated contribution of each feature to the classifier through oneby-one eliminating feature and comparing classification accuracy before and after eliminating the feature.Output of the phase one is a set of selected features.In the phase two, also on the same dataset only the selected features were used to train the classifier so that classification accuracy was improved.
The proposed method is descripted as four-step classification algorithm as following:  on class k th is calculated as Equation (4) below: • Backward elimination approach to eliminate feature, start from the smallest feature in the feature ranking list; For example, we have a confusion matrix with two classes are positive class and negative class, at j th crossvalidation we have • Evaluate the important of the eliminated feature though variance of classification accuracy with and without eliminating feature.If subtraction of classification accuracy before eliminating the feature and after eliminating the feature is positive then the feature should be kept, otherwise the feature is the redundant feature and it should be deleted., , , where , , is the trait importance of the learning accuracy of class k th and the test accuracy of class k th respectively.For example, if we need to classify a dataset into two classes, using random forest with n-fold cross validation at j th cross validation we will obtain a set of In Step 2, we will setup a feature ranking formula that is use to rank all features in the dataset.This step is the most important step in our algorithm.It is indispensable to mention that our proposed method uses feature ranking formula as key factor to determine as which feature should be eliminated first.In other words, the feature ranking formula will help us in determining which feature may be a noisy/redundancy feature.If a feature has high ranking in the dataset then it will be a useful feature for classifier and otherwise.The weakness of feature ranking formula will lead to the weakness of proposed algorithm because this problem will lead time-consuming and classification accuracy of algorithm.This problem will be discussed in Step 3 in detail.
In reality, a simply method usually will use when we judge whether the feature is useful to the classifier or not.The method can best be understood as follow: we add a feature into the training set; let a learning algorithm learn on the training set; assess classification accuracy on validation set before and after adding the feature.However, in our situation the question is that how can we have a good estimation of classification accuracy?In order to deal with this issue, within the scope of this paper we will use Bayesian probability to estimate classification accuracy.Now assume that there are m classes, 1 2 .Given an entity X, X is depicted by n features, X = (X 1 , X 2 ,···,X n ).According to Bayesian probability, the probability that X belongs to the class P(X) is constant for all classes because we know that the probability of an entity can be classified in to a class are the same, so that only     i i need to be estimated.According to Bayes' suggestion in case the prior probabilities of the class are unknown, then it is commonly assumed that prior probabilities of all the classes are equally or in other word we have P(C 1 ) = P(C 2 ) = ··· = P(C m ), and we therefore only need to estimate We know that with the given dataset of many attributes, it would be extremely computationally expensive to estimates   i P X C .In order to reduce computation in evaluating   i , the naive assumption of conditional independence of class is made.This presumes that the values of the attributes are conditionally independent of one another.Thus, From ( 4) and (10) we propose a way to estimate average of classification accuracy and average classification inaccuracy on the learning dataset as following: Average of classification accuracy: Average of classification inaccuracy: Similarly, average classification accuracy and average classification inaccuracy the on the test dataset: Average of classification accuracy: Average of classification inaccuracy: To evaluate classification accuracy we use fraction of average of classification accuracy and average of classification inaccuracy that are calculated as following.learn learn learn We propose a new feature ranking formula for feature i th at j th cross validation as following: rank , , , learn test 1 , , where: +k = 1,···, n is the number of cross validation folders, +F i,j is GINI index in case of using decision tree algorithms learn , j k K  and test , j k K are evaluating parameters of classifier performance on learning set and test set, The feature ranking formula includes two elements: 1) the first element is GINI index, the element decreases for each feature over all trees in the forest when we train data by learning algorithm random forest; 2) the second element is fraction, nominator of the fraction is constant, equals to 1, denominator of the fraction equals , presents the variance between classification accuracy on the learning set and the test set.That means the smaller variance, the better features we have.The combination between GINI index and the fraction presents our expectation: higher ranking feature are better feature.After finishing the Step 2, we have an ordered list of ranking features.The list will be used in Step 3 to determine optimal features for the classifier.One should be noted that feature assessing procedure is the correlation among features.We know that a feature may have a low position in feature ranking list but when it is use concurrently with other features they will bring a great contribution to classification accuracy.One feasible way to deal with this issue is to use feature elimination strategy which is the next step (Step 3) of our proposed method.
In Step 3, we use backward elimination approach to eliminate noisy/redundant features.In this step, we will use feature ranking list as a standard criterion to determine which feature should be eliminated first.In this proposed method the feature of lowest position in feature ranking list will be eliminated first.At each step in feature eliminating procedure we will validate the classification accuracy.Purpose of the validation is to determine whether the eliminated feature is actually redundancy/ noisy feature or not.We can perform the validation by comparing the classification accuracy before and after eliminating the feature.If classification accuracy before eliminating feature is greater than that of after eliminating feature then the feature will be kept, otherwise it will be eliminated.This iteration will terminate whenever classification accuracy of new subset is higher than classification accuracy of previous subset.Our algorithm will stop when we cannot find better classification accuracy or no feature to eliminate.In this case the current subset is the best subset we can have.Otherwise, in term of n-fold cross validation the procedure will jump back to Step 1 (Step 4).

Data Description
Two different sets of data have been used taken from the Machine Learning Repository of the University of California, Irvine, USA (ftp://ftp.ics.uci.edu/pub/machine-learning-databases,2012).

Wisconsin Breast Cancer Diagnosis Dataset
The features of the diagnostic collection describe characteristics of the cell nuclei present in a digitized image of a fine needle aspirate (FNA) of a breast mass [34].Every cell nucleus is defined by ten traits and for every trait the mean, the standard error and the worst (mean of the three largest values) are computed, resulting in a total of 30 features for each image:  Radius (mean of distances from center to points on the perimeter):

Experimental Results
In this section we present experimental results of the proposed method through fifty times of trials.The proposed method for breast cancer diagnosis and prognosis were implemented by using the R program language version 2.1.5.2 with RF package.The both of datasets were divided randomly into training set and validation set in the ratio of 1 to 1.The parameters of the proposed method in this experiment were determined as follows:  Number of trees in RF: 25;  Number of remaining features: 15-the proposed method will be stop if number of remaining features in dataset is greater or equal number of remaining features.
The proposed method was experimented 50 times on both WBCDD and WBCPD.Table 1 shows the results of the proposed method.
Figures 4 and 5 present performances of proposed method on training set and validation set of WBCDD and WBCPD.
Table 2 shows number of original features and number of features were selected.The result indicates that the feature subsets selected by the proposed approach have a better classification performance than that produced by the original RF.
Table 3 presents time-consuming of the proposed method.It should be noted that the proposed method was     carried out on a laptop computer with the central processing unit Intel Core 2 Duo 2.13 GHz so that timeconsuming is not a challenge of the proposed method.
In this paper ROC is used to evaluate performance of the proposed method.The sensitivity and specificity of the proposed method are presented in Table 4. Figures 6  and 7 show ROC curve those are built based on sensitiv-   ity and specificity of the propose method on WBCDD and WBCPD, respectively.The results indicate that the proposed method is a reliable diagnostic tool for breast cancer diagnosis and prognostic.
For comparison purposes, Table 5 gives the classification accuracies of our method and previous methods.As we can see from the results, our method obtains the highest classification accuracy so far.

CONCLUSIONS
W ith the non-stop improvements in expert systems and From the above results, we conclude that the proposed method obtains very promising results in classifying the possible breast cancer.We believe that the proposed system can be very helpful to the physicians for their as a second opinion for their final decision.By using such an efficient tool, they can make very accurate decisions.Our method is proved to be superior to the state-of-the-art algorithms applied to the Wisconsin Breast Cancer Database, and shows that it can be an interesting alternative.

Figure 1 .
Figure 1.General Architect of random forest.

Figure 2 .
Figure 2. Bootstrapped sample and out of bag sample in random forest.

Figure 3 .
Figure 3. Overall view of the proposed method.
 .The classification accuracy th , number of features at cross validation were classified correctly on class Total features of class

Figure 4 .
Figure 4. Result of 50 trials of proposed method on training set and validation set of WBCDD.

Figure 5 .
Figure 5. Results of 50 trials of proposed method on training set and validation set of WBCPD.

Figure 6 .
Figure 6.ROC curve of proposed method on WBCDD.

Figure 7 .
Figure 7. ROC curve of proposed method on WBCPD.

Table 1 .
Results of 50 trials on WBCDD and WBCPD.

Table 2 .
A comparison on classification accuracy and number of features selected between original method and proposed method (50 trails).

Table 4 .
The sensitivity and specificity of the proposed method.

Table 5 .
Classification accuracies obtained with our proposed system and other classifiers from literature.[4] Chen, H.-L., Yang, B., Liu, J. and Liu, D.-Y.(2011) A support vector machine classifier with rough set-based feature selection for breast cancer diagnosis.Expert Systems with Applications, 38, 9014-9022.doi:10.1016/j.eswa.2011.01.120 machine learning tools, the effects of these innovations are entering to more application domains day-by-day and medical field is one of them.Decision making in medical field can be a trouble sometimes.The expert systems and machine learning tools that are used in medical decision making provide medical data to be examined in shorter time and more detailed.In order to significantly improve predictive accuracy for breast cancer diagnosis and prognosis we developed a classifier model by combining the random forest classifier and feature selection technique.The classification accuracies of the proposed model are 99.82% on WBCDD and 99.7% on WBCPD.