^{1}

^{*}

^{1}

^{*}

^{1}

^{*}

With the abundance of exceptionally High Dimensional data, feature selection has become an essential element in the Data Mining process. In this paper, we investigate the problem of efficient feature selection for classification on High Dimensional datasets. We present a novel filter based approach for feature selection that sorts out the features based on a score and then we measure the performance of four different Data Mining classification algorithms on the resulting data. In the proposed approach, we partition the sorted feature and search the important feature in forward manner as well as in reversed manner, while starting from first and last feature simultaneously in the sorted list. The proposed approach is highly scalable and effective as it parallelizes over both attribute and tuples simultaneously allowing us to evaluate many of potential features for High Dimensional datasets. The newly proposed framework for feature selection is experimentally shown to be very valuable with real and synthetic High Dimensional datasets which improve the precision of selected features. We have also tested it to measure classification accuracy against various feature selection process.

Data Mining is a multidisciplinary task to find out hidden nuggets of information from data. In recent years, as the technology advances in various fields, the data generated in these fields, have become increasingly larger in both number of instances and number of features in various field. The proliferation of High Dimensional data in various applications poses challenges to Data Mining field. This enormity cause serious problems to many Data Mining and Machine Learning algorithms with respect to scalability and learning performance [

Within this, High Dimensional datasets are flattering more and more copious in learning process. Relatively it has made traditional search algorithm too expensive in terms of time and memory storage resource. Thus, several modification or enhancement to local search algorithm can be found to deal with such problem. Therefore, feature selection is indispensable for the Data Mining and Machine Learning process while managing High Dimensional datasets. Various established search techniques have shown promising results in a number of feature selection problems, but there are only few techniques which deal with High Dimensional data. The central hypothesis is that the important attribute sets are strongly correlated with the target class, and uncorrelated attributes are less important. Further, strong correlation among attribute with other attributes makes strong only one of them and other can be removed. If two or more attributes have the same importance to the target class values, it will be good to consider only one of them. As the attributes of a particular application increases, the dimension of that dataset increases. Then feature selection algorithm becomes intractable for finding the best subset, so this problem, sometimes becomes the NP-hard.

Feature selection is a simple method that tries to find out a subset of original features that have the same information regarding the whole datasets, without the loss of generality. Here, the main goal is to identify a few features/genes from thousands of genes to identify a specific set features/gene for specific diseases. However, as the number of attributes becomes extremely larger, most of these presented techniques face the problem of unachievable time computation. In this context, the main problem with this type of data is due to less number of instances, within hundred, while the number of feature is in the order of thousands or even in order of millions. The major challenge in these types of applications is to haul out a set of impressive features, as small as possible, that accurately classifies the learning algorithms [

From the study, there is no feature selection method available for handling the all requirement presents in the inconsistent real world datasets. So the hybrid methods were also present for improving the efficiency of this method. Ranking of features is also applicable for managing the number of large set of feature. After ranking all the features we select only features that are above then some threshold value and then apply our traditional Data Mining approaches on the reduced features to check its correctness and accuracy of the trained model with the reduced set of features.

The motivation for investigating the feature subset selection algorithms came from the requirement to give support to application domain experts with very important quantified evidence that the selected features ultimately become more robust to variations in the training data. This requirement is particularly decisive in biological applications, e.g. DNA-microarrays, genomics, and proteomics, mass spectrometry. These applications are generally characterized by high dimensionality; the goal is to find a small output set of highly uncorrelated variables on which biomedical and Data Miner experts will subsequently invest considerable less time and research effort.

The remaining of this paper is organized as follows. In section 2 we give the related work and background of feature selection techniques that are required for our proposed algorithm. Section 3 details the methodology and correlation based feature subset selection for High Dimensional data using SU. In section 4 we have presented our framework and algorithm. In the section 5 we have done complexity analysis of the proposed algorithm. Then, we have analyzed our algorithm’s result, on synthetic data as well as on real world data in section 6, and finally we conclude in section 7.

The recent problem in Machine Learning and Data Mining is to discovering representative set of attributes from which to construct a model for classifying or clustering for a specific task. The Feature selection aims at selection a small subset of feature that meets certain criteria given by the user [

In literature, a large number of feature selection algorithms have been already proposed and they were applied to different fields: bioinformatics [

In the process of feature selection, the most important and necessary key operation is, how the individual feature are clearly discriminated. For evaluation of discrimination power of attribute various methods have been proposed, in which information gain is the older and often used techniques [

A characteristic feature selection method consists of four fundamental steps as depicted in

Almuallim and Dietterich proposed FOCUS [

1) Limitation of difficulties in target class;

2) Data is free from the noise.

But the main problem in High Dimensional data is the computational complexity, that can be as large as O(2^{p}), for example when all the features are relevant, it may be intractable. Devijver and Kittler in their paper review heuristic search algorithm. They have find Sequential Forward Selection (SFS) and Sequential Backward Selection (SBS) algorithms. These algorithms totally based on heuristic, “find out the most important attribute to add in every step of the iteration is the attribute to be selected and find out the most important attribute to remove in every step of the iteration is the attribute rejected”. These techniques cause problem with High Dimensional data because they did not consider the interaction among attributes.

Relief [

How to measure the correlation between two or more attributes based on label data? Mutual information (MI) is a basic technique to measures how much knowledge between two attributes are correlated. It is defined as the difference between the sum of the marginal entropies and their joint entropy. For two totally independent objects the mutual information is always zero. In [

Consider the High Dimensional data

We can use the relation of entropy and mutual information to solve the problem in different ways, these are as follow. Let H(X) denote Shannon’s entropy of X, then

The entropy is related to mutual information as follows:

The feature’s values are considered to be discrete. Here, the marginal entropies is represented by H(X) and H(Y),

As a feature selection criterion, the best feature will maximize the mutual information MI(X, Y), where X is the feature vector and Y is the class indicator. This is a nonlinear statistics of correlation between feature values and class values. The symmetric uncertainty (SU) [

Symmetric uncertainty can be used to calculate the fitness of features for feature selection by calculating between feature and the target class. The feature which has high value of SU gets high importance. Symmetric uncertainty defined as

where H(X) is the entropy of a discrete random variable X. If the prior probability of each element of X is p(x), then H(X) can be calculated by equation (2).

Symmetric uncertainty, equation (5), behave a couple of variables symmetrically, it compensates for mutual information’s bias towards features having large number of different values and normalizes within range [0, 1]. A value 1 of SU(X, Y) indicate that knowledge of the object value strongly represent the values of other and the SU(X, Y) value 0 indicate the independence of X and Y. In this paper, we also deal with continuous features by normalized in proper discrete form.

The definition of relevant feature is defined as: F_{i} is relevant to the target concept C if and only if there exists

some

1) When_{i} is fundamentally relevant to the target class;

2) When

It can be concluded that

Given SU(X, Y) the symmetric uncertainty of features X & Y, the correlation between two attributes is refers

as F-correlation. The correlation between any pair of attributes

relation of

In this section, we propose the framework of our feature subset selection techniques which can improve the classification and clustering technique. To select important feature for classification or clustering accuracy, we require some aspect i.e.

· How to decide which of the attributes is relevant for a particular class and which of the all attributes are not?

· How to decide among all relevant which attribute is redundant?

· How to decide whether two attribute are closely correlated?

Using the symmetric uncertainty (SU) as the fitness function, we are able to generate an algorithm and framework to select important features for Data Mining task. This framework and algorithm is totally based on the correlation analysis of attributes using supervised High Dimensional datasets.

The answer to these questions can be sorted out by applying appropriate approaches, like, for first question we can use a user defined threshold value, generally used with filter approach of feature selection. For example, let us consider a dataset D having M feature and N instances and set of C classes. Let _{i} and class C, then a subset D’ of the important features can be decided by a user defined threshold

value, which is the second step in our framework. It can be defined as:

The answer to the next question is important because this is the main question on which we are focusing. For this, we have to analyze pair-wise correlations among all attributes, but if we calculate the pair wise correlation, the time complexity for this will be O(M^{2}), where M is the number of attributes which are very high in High Dimensional data.

Correlation between attributes are also captures by symmetric uncertainty values, but to decide and differentiate between relevant and redundant attribute, we have a reason, why we are selecting a particular threshold value. We can say, need to define whether the value of correlation or symmetric uncertainty between two attributes in

The correlation between a feature

there should not be any

If there exists such

where

According to the above definitions, a attribute is good if it is predominant in predicting the class value, and feature selection, for classification, is a process that determine all predominant attributes to the class value and remove other attributes.

We are considering some assumptions in development of this framework, that is, if two attributes are seems redundant to each other and we have to remove one attribute, then we will remove the attributes that is less relevant to the class value and keeps more information to predict the class. The attribute with the highest

As we discussed the methodology so for, we are now going to propose a framework and algorithm. By using SU, that reimburse for the Information Gain’s bias toward attributes with more values and normalize their values in the range of [0, 1] where the value 1 represent the knowledge of either one of the values totally classify the value of the another and value 0 represent that X and Y are independent. The main advantages of using symmetric uncertainty are that it treats a pair of feature symmetrically.

We are using the SU value for two main reasons: as we can see in step two, it can remove the attributes that have less SU value than predefined threshold λ because those attribute which are having high

the SU value for

In this, we have given a High Dimensional dataset with M different attributes and a class label C, the approach finds a set of predominant attributes subset for the class values and reject all other attributes which are irrelevant. It can be divided in to two sections. In the earlier section, it calculates the _{i} and class C is represented by

The process start with the calculation of SU for each attributes, after that we select the first and last element and continues as follow. Calculate the middle index of the sorted element and divide the whole attributes into two parts. In the first part, we start from the first element to the middle index and in the second part from last element till the middle index. For all the remaining feature

We analyze time complexity of the proposed algorithm. In the computation of symmetric uncertainty (SU) values of each feature have linear time complexity in terms of the number of feature M. Most of the time this number of feature also called dimensionality of datasets. Subsequently, this task is performed only once and stored in D’, the computation is consider negligible in compared to the further consideration of important features. In the second part (14 - 23) and (24 - 33), in each round, the proposed algorithm can delete a large number of attributes that are redundant to the f_{p} in the same loop. In the best case, all of the remaining f_{q} will be redundant and so all of attributes are removed and time complexity will be of order O(M)). In the worst case, when all the f_{q} are stored in the D’ the time complexity will O(M^{2}). In the average case, we can assume that out of important attribute half of the attributes are deleted in the each iteration. So, the time complexity may be of order O(M logM) where M is the number of attributes. We divide the D’ into two part and treat them individually. On average, Line (14 - 23) and (24 - 33) can be computed in O(M/2 logM/2). Since, in the line (1 - 7) we calculate a pair of attribute’s SU values in term of the number of instances N in the data, so the complete complexity of the above proposed algorithm O(N M logM).

In our experimental work, we experimentally evaluate the effectiveness of the proposed technique. The objective of our proposal is to evaluate the method in term of speed, number of selected attributes, and predictive accuracy for a particular classifier on selected feature. The algorithm compared against some already existing techniques: Information gain (IG), Chi Square, ReleifF and FCBF on the 5 benchmarking high dimension datasets. Because our approach finding less number of features as compared to information gain, chi square, FCBC and ReliefF, results in reduction of time for the resultant mining algorithm. A list of datasets used in our approach is listed in the table 1. This table contains 5 benchmarking High Dimensional datasets along with their characteristics, number of attribute, how many classes contained in the datasets. All of these datasets are taken from the UCI Repository [

For each dataset from the table 1 we will run our algorithm and note down the time required to run in table 2 and the number of selected features by the proposed algorithm table 4. We are also analyze the same from some traditional algorithm like ReliefF, information Gain, chi square, FCBF and record time required and number of selected feature for each algorithm in table 2 and table 3.

For the validation of our proposed algorithm we have tested the classification accuracy against to different classifier. Mainly, decision tree, SVM and NB classifier are used to check the classification accuracy with all 5 previous feature selection.

Datasets | Number of attributes | Number of instances | Number of classes |
---|---|---|---|

Lung-cancer | 57 | 32 | 3 |

Chemical | 151 | 936 | 3 |

Isolat | 618 | 1560 | 26 |

Leukemia | 7129 | 72 | 2 |

Overian | 15,154 | 253 | 2 |

Datasets | IG | Chi Square | ReliefF | FCBF | Our techniques |
---|---|---|---|---|---|

Lung-cancer | 238 | 325 | 62 | 25 | 20 |

Chemical | 2766 | 2432 | 2622 | 130 | 103 |

Isolat | 19930 | 19851 | 18085 | 3098 | 2830 |

Leukemia | 29987 | 27883 | 21090 | 4143 | 3716 |

Overian | - | - | 36561 | 7613 | 7207 |

Datasets | IG | Chi Square | ReliefF | FCBF | Our techniques |
---|---|---|---|---|---|

Lung-cancer | 16 | 15 | 9 | 7 | 8 |

Chemical | 23 | 21 | 11 | 10 | 9 |

Isolat | 37 | 39 | 22 | 21 | 25 |

Leukemia | 52 | 62 | 36 | 33 | 33 |

Overian | - | - | 107 | 96 | 100 |

Datasets | Full data | IG | Chi Square | ReliefF | FCBS | Our techniques |
---|---|---|---|---|---|---|

Lung-cancer | 81.26 | 89.32 | 88.35 | 84.50 | 93.73 | 92.24 |

Chemical | 94.13 | 92.13 | 92.78 | 93.27 | 95.36 | 95.83 |

Isolat | 79.54 | 78.21 | 75.53 | 75.02 | 77.32 | 78.37 |

Leukemia | 74.25 | 76.86 | 82.47 | 83.89 | 86.64 | 85.61 |

Overian | 72.87 | - | - | 77.31 | 78.27 | 79.34 |

Datasets | Full data | IG | Chi Square | ReliefF | FCBF | Our techniques |
---|---|---|---|---|---|---|

Lung-cancer | 88.53 | 92.75 | 92.17 | 86.22 | 94.73 | 94.33 |

Chemical | 95.56 | 93.80 | 95.53 | 93.53 | 96.65 | 95.30 |

Isolat | 82.37 | 83.93 | 77.73 | 77.05 | 81.62 | 80.18 |

Leukemia | 78.84 | 81.23 | 85.42 | 81.47 | 88.77 | 90.50 |

Overian | 77.29 | - | - | 74.32 | 82.55 | 84.83 |

Datasets | Full data | IG | Chi Square | ReliefF | FCFS | Our techniques |
---|---|---|---|---|---|---|

Lung-cancer | 83.42 | 90.75 | 90.97 | 84.32 | 93.18 | 94.63 |

Chemical | 95.32 | 92.50 | 93.16 | 92.61 | 96.42 | 96.19 |

Isolat | 80.21 | 80.26 | 77.49 | 76.56 | 80.86 | 78.77 |

Leukemia | 74.23 | 79.47 | 84.35 | 81.28 | 88.27 | 88.50 |

Overian | 72.94 | - | - | 73.98 | 80.75 | 83.29 |

FCFS method in respect to classification accuracy. But when we talk about the time consumption, our approach outperform then all other methods.

In this paper, we have proposed an algorithm for feature subset selection for High Dimensional datasets. We are using correlation based feature ranking method, symmetric uncertainty (SU), which forms the basis of our approach. Our future plan is to extend this approach on very High Dimensional data (it is proposed that the current approach be explored on very High Dimensional data (i.e. ovarian dataset)). We have noticed that this algorithm generally works fine with numerical data; we can also try to extend this approach to working with mixed type of data (containing both nominal and categorical) without normalizing them in discrete values. This may also solve the problem of feature selection for High Dimensional data and biological datasets with millions of features using this approach. Since, for example the next generation sequencing techniques in biological analysis can produce data with several millions features in a single computation. Existing approaches make it hard to access data of this dimensionality, which creates the challenges of computational power, algorithm stability and accuracy of algorithm in parallel.