A Minimal Subset of Features Using Feature Selection for Handwritten Digit Recognition

Many systems of handwritten digit recognition built using the complete set of features in order to enhance the accuracy. However, these systems lagged in terms of time and memory. These two issues are very critical issues especially for real time applications. Therefore, using Feature Selection (FS) with suitable machine learning technique for digit recognition contributes to facilitate solving the issues of time and memory by minimizing the number of features used to train the model. This paper examines various FS methods with several classification techniques using MNIST dataset. In addition, models of different algorithms (i.e. linear, non-linear, ensemble, and deep learning) are implemented and compared in order to study their suitability for digit recognition. The objective of this study is to identify a subset of relevant features that provides at least the same accuracy as the complete set of features in addition to reducing the required time, computational complexity, and required storage for digit recognition. The experimental results proved that 60% of the complete set of features reduces the training time up to third of the required time using the complete set of features. Moreover, the classifiers trained using the proposed subset achieve the same accuracy as the classifiers trained using the complete set of features.


Introduction
Handwriting recognition is the ability of recognizing handwritten text from a scanned file, image, touch-screen or other tools and converting it into an editable text [1]. Handwriting recognition is a quite complex problem. The ideal goal How to cite this paper: Alsaafin, A. and The recognition of handwritten digits is not a new problem. It is a problem that has continued to be an active topic in research for several reasons [5]:  The problem is suitable for image processing and pattern recognition using a small number of classes.  Availability of standard benchmark datasets reduces the effort and time spent in preprocessing the data.  A lot of work and research have been done in this field that can be cited and built on.  The practical applications encourage more research to be done.
 Improvements in classification accuracy using existing approaches continue to be achieved using new approaches.
In addition, handwritten digit recognition is an active subject in OCR (Optical Character Recognition) applications and pattern classification/learning research [6]. OCR applications like postal mail sorting, bank check processing, and form data entry require high accuracy and speed techniques to achieve a satisfactory performance. The aforementioned reasons motivate us to proceed with this study.
In this paper, we employ a Feature Selection (FS) method in order to select a subset of relevant features using the MNIST dataset. The FS method used in this paper is called Feature Importance. This paper also implements various classification techniques in order to study their suitability for digit recognition. It presents a statistical analysis that compares models of different algorithms, The goal behind using classifiers of different algorithms is to find out the most suitable and efficient algorithm for digit recognition in terms of training time, accuracy, and confusion matrix. In addition, the purpose of introducing this comparison is to limit the algorithms that can be used to solve such a problem and devote more effort to develop the suitable techniques by doing more experiments and research.
The rest of this paper is organized as follows. Section 2 reviews related literature. Sections 3 and 4 include description of the dataset, FS, and classification techniques used in this paper. The stages of the performed experiment and the A. Alsaafin, A. Elnagar Journal of Intelligent Learning Systems and Applications obtained results are presented in Section 5. Section 6 includes some discussion before concluding this paper in Section 7.  LeCun et al. [7] reported different accuracies obtained from various classifiers.

Related Work
The highest accuracy, 99.30%, was obtained via boosted convolutional neural network (CNN) that is trained using distorted data. Simard et al. [8] enhanced distorted sample generation as well as the implementation of CNN. This led to a small improvement in test accuracy, 99.60%. Liu et al. [6] used gradient direc-Journal of Intelligent Learning Systems and Applications tion features to implement several classification methods. The test accuracies that have been achieved are 99.42% by polynomial classifier, 99.58% by SVM classifier, and over 99% by many other classifiers. Holmstrom et al. [9] compared different statistical and neural classifiers based on PCA features. However, the PCA feature does not perform satisfactorily. Liu et al. [10] claim that training classifiers using MNIST dataset without using feature extraction methods shows inferior performance. This motivates us to compare the performance of various classification models trained with a subset of features against the complete set of features available in the MNIST dataset.

1) MNIST dataset
In this paper, we use MNIST dataset. The MNIST is a dataset developed by LeCun, Cortes and Burges for evaluating machine learning models on the handwritten digit classification problem [11]. It has been widely used in research and to design novel handwritten digit recognition systems. The MNIST dataset contains 60,000 training cases and 10,000 test cases of handwritten digits (0 to 9). Each digit is normalized and centered in a gray-scale (0 -255) image with size 28 × 28. Each image consists of 784 pixels that represent the features of the digits. Some examples from the MNIST dataset are shown in Figure 1. The MNIST dataset is balanced over the ten classes (0 -9). Figure 2 shows the percentage of each class in the MNIST dataset.  2) Feature selection methods Features that are used in any machine learning method have a huge influence on the obtained results. Having Irrelevant or partially relevant features can negatively affect performance of many models, especially linear algorithms like logistic regression [12].
Feature selection (FS) is a process of selecting features that contribute most to the prediction variable. An FS method selects a subset of relevant features and discards irrelevant/redundant features. The benefits of performing FS before data modeling are [12]:  Reduces Overfitting: Less redundant data implies less chance to make decisions based on noise.  Improves Accuracy: Less misleading data implies modeling accuracy improves.  Reduces Training Time: Less data implies that algorithms train faster.
In general, there are two types of FS methods: filter method and wrapper method. The filter method selects a subset of the most relevant features based on the characteristics of the dataset. This type of FS methods is independent of the classification algorithms unlike the wrapper methods that use a predetermined classifier in order to evaluate the selected subset of features. The wrapper method is more computationally expensive than the filter method [13]. In this research, we selected a wrapper FS method called Feature Importance. The Feature Importance is a method that estimates the importance of features using bagged decision trees like random forest and extra trees [12].
Our goal in this project is not to improve machine leaning algorithms but compare the performance of different algorithms using a subset of features Journal of Intelligent Learning Systems and Applications against the complete set of features. Therefore, we use a simple FS method that serves our goals.

Classification Techniques
Classification is a process of finding a model/function that recognizes, describes, and differentiates two or more classes. The purpose of using a classification model is to predict the class of an object where its class label is unknown. Classification has a predictive nature-for a given set of features, the goal is to attempt to predict the value of another feature [14]. A classifier model is used to classify the actual data into defined classes. Ultimately, patterns need to exist in the data that can be exploited. The classification techniques used in this paper are explained below.

1) Multinomial Logistic Regression:
A Logistic Regression is a binary classifier that can distinguish between two classes. Multinomial Logistic Regression (MLR) is a logistic regression that is designed to solve multiclass/multinomial classification problems. In other words, Multinomial Logistic Regression is a model that predicts the probabilities of different possible outcomes of a categorically distributed dependent variable, given a set of independent variables [15]. It is used when the dependent variable is nominal and falls into one of many (i.e. three or more) classes/categories (e.g. the MNIST dataset has nine classes).
MLR classifier is commonly used as an alternative to naive Bayes classifier since it does not assume statistical independence of the random features that used as predictors. The MLR classifier is simple and requires a little time to learn, however, it becomes slow when use a very large number of classes to learn.

2) Decision Tree:
A Decision Tree (DT) classifier is a flow-chart-like structure, where the topmost node represents the root node of the tree, each intermediate node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf node indicates a class label. The paths from root to leaf represent classification rules.
DT is a simple model to understand and interpret. However, computation can get very complex if many outcomes are liked (i.e. a decision tree reaches a great depth). The DT classifier becomes biased in favor of attributes with more levels/depth; however, it can be easily combined with other decision techniques in order to make better and accurate decisions [16].

3) Random Forest:
A Random Forest (RF) classifier [17] is one of the most effective techniques that is used for predictive analytics. It is an ensemble learning classifier that combines decisions from a sequence of base classifiers in order to make a decision. RF is correct for decision trees' habit of overfitting to their training set [18].
Decision Trees that have great depth tend to overfit their training sets thus have low bias but very high variance. RF classifier reduces the variance by averaging is an additive technique that is developed by Friedman [19]. It combines predictions from a sequence of base classifiers in order to make a prediction. Typically it is constructed of a number of simple/weak DT models where the output is the sum of the decisions of these base learners. The purpose of BT method is to achieve better predictive performance by minimizing the loss function when adding trees. Generally adding more trees makes the model more resistance to overfitting. Therefore, keep adding trees until no further improvement is observed can lead to build an efficient model that gives very accurate predictions. 5) Convolutional Neural Network: A Convolutional Neural Network (CNN) is a deep learning method that can be used for the purpose of feature extraction as well as classification. Hence, CNN acts as a feed-forward network that extracts topological properties from images [20]. It extracts features from raw images (i.e. contain the intended pattern) in its first layers, and then classifies the pattern with its last layers [20].
CNN is a powerful technique for image processing as well as natural language processing. The main advantage of CNN is the high accuracy in its results. However, it requires high computational cost. In addition, it needs a lot of data to be trained. The complexity of the CNN slows down the training process thus it is necessary to use a good GPU to overcome this problem.

Experimental Results
As previously mentioned, we use the MNIST dataset to perform the experiment of this study. In addition, we use a FS method to be examined with different classifiers: MLR, DT, RF, and BT. At the last stage of this study, we compare the same classifiers used at the previous stage in addition to CNN. In this section, we describe the tools that used throughout the experiment of this study, describe all the experiment stages, and finally present and discuss the obtained results.
1) Hardware and Software Tools Since we consider some factors (e.g. training time) that are affected by the tools used in the experiment, it is worth mentioning the hardware and software that have been used throughout this research. For the hardware tools, we used a laptop of MacBook Pro, 2.8 GHz Intel Core i7 with 4 GB RAM. All the setup and computations of this research have been performed on top of the CPU. For the software tools, GraphLab Create 1 is used for the implementation of this paper. GraphLab Create is a machine learning framework that is python-based libraries. It is designed to handle the major properties of real world data: scalable, variety, and complexity. The biggest advantage of using GraphLab is that it supports scalability, which allows using large datasets. It also supports different data Journal of Intelligent Learning Systems and Applications source such as JSON, CSV, HDFS/S3 and many more. Graphlab is a complete framework that has rich libraries for data transformation and manipulation. It is worth mentioning that we import some libraries from scikit-learn for the purpose of implementing the FS method used in this paper.

2) Pre-processing
Since the MNIST dataset is divided into standard training and testing sets, 80% and 20% respectively, we divide the MNIST training set into two proportions, 80% as a training set and 20% as a validation set. Therefore, the MNIST dataset is divided into three sets: training set (60%), validation set (20%), and testing set (20%). The training set is used to train the classifiers for the purpose of recognizing handwritten digits while the test set is used to assess the trained classifiers after fine-tuning them using the validation set.
Before training the classifiers, we select subsets of various sizes (i.e. number of features) using Feature Importance method. Table 1    The last step of this study is to compare the performance of the classifiers using the subset of features (60%) against the complete set of features. As shown in Table 2 and Table 3    a) Train and fine-tune the classification techniques In order to have a fair and valuable comparison, all the classification techniques are trained using the complete set of features. The MLR, DT, RF, and BT classifiers are trained as same as they trained in the Study I. A one layer CNN has been trained to classify an input to one of the ten classes (0 -9) available in the dataset. Similar to the other classifiers, the CNN is fine-tuned by increasing the number of iterations in order to fulfill the training criteria. b) Compare the classification techniques Table 3 presents the training accuracy, validation accuracy, test accuracy, and training time for all techniques after fine-tuning them based on the validation set. In this comparison, we did not consider the time required for testing since we use single classifiers (i.e. not combined classifiers) that do not require more than few seconds to predict any new input. As shown in Table 3, the MLR, DT, and RF have low test accuracy compared with the BT and CNN. However, they outperform the BT and CNN in terms of training time. The DT has the lowest training time (71.11 sec), although it has the lowest test accuracy (90.94%) among the four techniques. The CNN outperforms all the techniques in terms of test accuracy (98.36%). The BT has the highest training accuracy (100%) and the second highest test accuracy (97.91%). However, the BT requires the longest time to be trained among all other techniques (1396.03 sec). Figure 4 shows the confusion matrix of the BT and the CNN since they have the highest test accuracy compared with other techniques. The confusion matrix shows a more detailed breakdown of correct and incorrect classifications of each class on the MNIST test set. It composed of target (actual) label, predicted label, and the count of predicting the target label as the predicted label. As shown in Figure 4, the BT and CNN has misclassified digit "5" 19 times, mostly it was classified as digit "3". The CNN outperforms the BT in classifying all the digits except digit "6" and "8". The highest misclassified digit using the BT classifier is digit "7" that has been classified as digit "2" 15 times, while digit "6" is the most misclassified digit using the CNN, 16 mistakes. Using the CNN, digit "6" was often classified as digit "0". However, the CNN was the best in classifying digit "0", only 3 mistakes. On the other hand, the BT classifier was the best in classifying digit "1", 9 mistakes. In general, we suggest that most of the mistakes made by the BT and CNN were due the same shape that some digits have (e.g. "0" and "6"). Figure 5 shows the common and individual mistakes of the CNN and BT when classify digit "7" as digit "2".
A. Alsaafin, A. Elnagar Journal of Intelligent Learning Systems and Applications

Discussion
It is worth mentioning that the results obtained in the previous section can be optimized using different methods. We convinced of these results since our target was to find the most suitable classification techniques that can be used to solve the problem of recognizing handwritten digits. From this experiment, we can conclude that the CNN and the BT are the most suitable classification techniques among all other techniques used in this research or similar to them. Since the BT classifier requires a long time to be trained using the complete set of features, we suggest training the BT using a subset of 60% of features using FS method. This reduces the required time, complexity of the trained model, and required storage for digit recognition while still provides very similar performance to the model trained with the complete set of features.
As we discussed in the related work section, Liu et al. in [10] claim that training classifiers using the complete set of features available in the MNIST dataset gives unsatisfactory performance. However, the obtained results in the implementation section (Study I) show that the models trained with the complete set of features have accuracies not less than the accuracies obtained for the models trained using FS. Therefore, there is about 40% of features exist in the MNIST dataset are irrelevant/redundant features.
It is worthless to compare our results to other results since most of the related work has some limitations. Most of the related work available in the literature does not specify the type of the reported accuracy: training, validation, or testing accuracy. However, it is very important to identify the type of the obtained accuracy since the training accuracy could reach 100% if we kept fine-tuning the model until perfectly fit the training dataset. Although this leads to overfitting problem that causes a reduction in the test accuracy. We can notice from Table   3 that the BT classifier achieves 100% training accuracy without affecting the test accuracy. This is because the BT is an ensemble classifier that tends to be robust to overfitting, however, it will eventually overfit.
Many studies fine-tune and assess models using the same data (testing set).
This leads the test results to be overly optimistic, which means that the model is trained to give a good estimation on the test results. Therefore, assessing the models using the test set will be worthless since the model will give bad results once used to predict new data. To avoid this problem, the model should be fine-tuned using the validation set and then assessed using the test set.
The last point of this discussion is that some studies that use un-standard datasets do not present the distribution of the instances/examples available in the dataset. Imbalance datasets lead to have a high test accuracy with many false positive predictions. Therefore, the accuracy metric is not enough to evaluate the classification techniques since it does not capture everything. The confusion matrix is a very useful metric that shows the details of the predictions made by the trained models.

Conclusion & Future Work
This paper proposed a subset of 60% of features to train classification techniques for digit recognition. In addition, it has implemented and compared models of different algorithms: linear, non-linear, ensemble, and deep learning to study their suitability for digit recognition. The obtained results show that the proposed subset of features is not only minimized but also reduces the required time for training the model. Moreover, it reduces the computational complexity and lowers the required storage for digit recognition. The BT and CNN proved their suitability for digit recognition in terms of accuracy. However, the BT classifier requires using a subset of features since it needs a long time to be trained.
We evaluated the performance in terms of accuracy, training time, and confusion matrix.
In future, we plan to examine more FS methods with different classification techniques. Combining various FS methods is another area to be explored. Based on our observations from the confusion matrices of the BT and CNN, some digits are well recognized by some classifiers while misclassified by other classifiers. This observation motivates us to try FS with a combined machine learning techniques in order to enhance the required time as well as accuracy.