Bangla Handwritten Character Recognition Using Extended Convolutional Neural Network

The necessity of recognizing handwritten characters is increasing day by day because of its various applications. The objective of this paper is to provide a sophisticated, effective and efficient way to recognize and classify Bangla handwritten characters. Here an extended convolutional neural network (CNN) model has been proposed to recognize Bangla handwritten characters. Our CNN model is tested on “BanglalLekha-Isolated” dataset where there are 10 classes for digits, 11 classes for vowels and 39 classes for consonants. Our model shows accuracy of recognition as: 99.50% for Bangla digits, 93.18% for vowels, 90.00% for consonants and 92.25% for combined classes.


Introduction
In this present era of digitization, the importance of handwritten character recognition is increasing and its application is prevalent in computer vision. With the improvement of computer technology, governments are trying to computerize their information repository which includes a large amount of handwritten scripts. The traditional method is manually retyping everything which requires huge manpower and a considerable amount of time.
Handwritten character recognition has the ability to automate this process and this automation will help us in many areas e.g. postal code identification, passport and document verification, handwritten license plate recognition, automatic processing of bank cheques, converting hard-copy data to softcopy, ID card reading, signature verification etc. But the challenge is recognition of handwritten characters and it is far difficult compared to recognition of printed characters. The main reason behind this is the size and shape of handwritten characters which varies from person to person; moreover, writing style and inclination is also not identical.
Approximately 260 million people speak Bangla worldwide which ranks it as the seventh most spoken language in the world and second in the Indian subcontinent. Bangla handwritten characters have versatility in size, shape, stroke and writing style for different people. Therefore, a sophisticated model like CNN is necessary which is able to extract the features from images automatically without any explicit description.
Researchers have proposed some notable methods for recognizing Bangla handwritten characters. Majority of the works extract features explicitly from the character images using various methods and create feature vectors [1] [2] [3] [4] [5]. Then the feature vectors are fed into classifiers e.g. SVM, KNN etc. These explicit feature extraction methods face difficulties because of the complex shapes and similarity of the Bangla characters. Some similar characters differ from one another by just a single dot mark. This feature extraction task becomes more challenging because of different writing styles with distinct strokes and varied spacing of different individuals. Moreover, some of the works have considered similar characters as same class and reduced the number of classes [3] [4].
These factors affect the classification accuracy negatively. Some researchers have used convolutional neural network in their works [6] [7] [8]. But some of the works do not consider all the character classes. Besides, works that consider all character classes face less overall recognition accuracy in comparison to individual categories (vowels, consonants, numerals).
In this paper we intend to solve the problem of explicit feature extraction and propose a method that will automatically select and extract feature from the character images irrespective of the individual writing style and spacing. We have also considered 50 classes of basic letters (11 vowel classes and 49 consonant classes) and 10 classes of digits and tried to propose a method that will have better overall accuracy than existing methods.
CNN was first used in [9] for character recognition. It is different from traditional machine learning algorithm in the way that, for extracting features it does not require explicit specification. It can automatically extract the necessary features; that's why it is widely used for the classification tasks. Visual patterns are directly recognized from the image pixels in CNN. It is also translational and scale invariant. For these reasons we have used convolutional neural network to recognize Bangla handwritten characters.
The rest of the paper is organized as: Section 2 deals with some previous works in character recognition, Section 3 provides the system model where a modified CNN is used to recognize Bangla character, Section 4 provides results based on analysis of Section 3 and Section 5 concludes the entire work.

Literature Review
This section provides state-of-the-art pertinent to character recognition based on machine learning. Shopon et al. in [6] recognized Bangla handwritten digits employing unsupervised pre-training. They have used auto encoder with deep CNN. Their proposed architecture contains three convolutional layers and one max pooling layer. Authors got 99.5% accuracy for digit recognition which is the best accuracy so far. Purkayastha et al. in [7] proposed a convolutional deep model for recognizing Bangla handwritten characters including digits and other character classes. Specialty of their study was recognizing 20 mostly used compound characters. Their CNN model consists of two convolutional layers and two pooling layers followed by three densely connected layers.
According to Pak and Kim of [10], convolutional neural network is the most prominent deep learning approach for image processing and pattern recognition. They made a comparison among the successful and popular deep learning architectures namely AlexNet, VGG, GoogLeNet and ResNet. Vaidya et al. [11] developed a system based on CNN for handwritten English character recognition. Their system has two parts: an Android application for taking image of handwritten text to be recognized and a server in the backend having a trained neural network model.
Ryo, Karungaru and Terada [12] have proposed a smartphone based system for recognition and interpretation of road navigation signs. They provided techniques for character candidate domain extraction, one-character extraction and removal of noise from the image captured by the smartphone. The CNN was used for training the model and recognizing characters. Ashiquzzaman and Tushar in [13], proposed a deep neural network based algorithm for recognizing handwritten Arabic numerals. They showed that, their neural network model performed significant amount of improved accuracy in comparison with the existing methods for recognizing Arabic numerals.
Selmi et al. in [14] presented a deep learning based system, which can detect and recognize license plates. The system detects segments and recognizes characters. The authors have claimed their system to be successful in recognizing dynamic license plates in various complex conditions like low quality and distorted images and intense daylight and dark environment of night. The authors argued that their model requires fewer steps for image preprocessing. Joshi and Risodkar in [15] proposed a system, which can recognize Gujarati handwritten characters into the machine editable format. They have used deep neural network for recognizing the characters. Authors have used K-nearest neighbor, NNC classifier which are popular methodology in the field of OCR.
Tajane et al. in [16] analyzed three ways that are being used for coin recognition namely electromagnetic, mechanical and image processing. They proposed a new approach for recognizing and detecting Indian coins, which is based on deep learning model. Authors picked features like texture, color and shape for training the popular CNN architecture. Li et al. in [17] described recognition accuracy and inference performance as the key challenging factors in classifying images for any real time application. The authors proposed a solution to accelerate promising residual network (ResNet) framework in the inference application on FPGA (Field Programmable Gate Array) using OpenCL programming language. Authors have provided a convertor to transform any ResNet in CAFFE framework into FPGA platform.

System Model
This section provides basic construction of CNN, dataset used for analysis and extended CNN used for Bangla handwritten character recognition.

Convolutional Neural Network
The CNN is a deep learning approach which is able to extract different features of objects by applying learnable weights and biases which differentiate with objects. This methodology was proved successfully in the field of image classification of as explained in [18].
CNN model takes an image as input, processes the image and categorizes it in one of the predefined classes. A CNN model is first trained with a large amount of images of different categories. In this phase a general model of each category is built. Then, in the testing phase images are tested against the general models of different categories and determined in which category an image belongs to. The ReLU layer is used for introducing non-linearity to the feature maps then the pooling layer is used to down sample the feature maps. This is used for dimensionality reduction which reduces the spatial size of the feature maps while keeping the important information and spatial relationship among pixels intact.
Finally, the fully connected layer where the feature maps are flattened and converted into a vector to feed into a traditional feed forward network which uses backpropagation algorithm. Next, the softmax layer performs the probability distribution based on which an image is categorized. hence CNN is used for recognizing Bangla handwritten characters in this paper.

Dataset
For our study of Bangla handwritten character recognition we have used Ban-glaLekha-Isolated dataset which is the largest isolated Bangla character dataset. This dataset contains 50 classes of basic letters (11 vowel classes and 49 consonant classes), 10 classes of digits and 24 classes are for compound letters. From the dataset, digit, vowel and consonant classes are taken for the study which sums up to total 60 classes of characters. The character classes that we have picked for recognition from the BangLalekha-Isolated are given in Tables 1-3.

Proposed System Architecture
In this paper, we have proposed a deep convolutional neural network framework for Bangla handwritten character recognition. The model used for this purpose is shown in Figure 1. It has one image input layer, three convolutional layers and three fully connected layers. The image input layer takes Bangla hand written character images of dimension 32 × 32 × 1.
For every model, preprocessing the inputs is necessary to give the images a common form before feeding to any classifier. The images of BanglaLekha-Isolated are in png format. We have converted the images into tif format. The images found in the BanglaLekha-Isolated were of different size therefore, the most important task of preprocessing was making the images of equal size. All of the images are converted into 32 × 32 pixels and to reduce the computational time we preferred white letters on black background. Then the inputs are fed to the model.
In the 1 st convolutional layer, input images are padded with zero padding of size 1. Then 8 kernels each having the dimension 3 × 3 × 1 have been applied to extract eight different features. Each kernel performs convolution operation on the entire image, which results in one activation map. If the kernel has m rows and n columns, then the formula of convolution operation is: where, I = input matrix, W = kernel matrix, Z = output matrix.
Stacking the activation maps, we get a 32 × 32 × 8 feature map. The dimension of feature map is: where, N = Dimension of input image, P = Padding, This feature map is then passed through a batch normalization layer where it is normalized across mini batches. The normalization layer makes the training process faster and insensitive to the network initialization.
Next, our feature map is passed through the ReLU layer. The purpose of this layer is to include non-linearity. The convolution operation may create negative values in the feature map. We ensured positive values using ReLU activation function as, where, x denotes value of a pixel.
After that, our model employs a max-pooling layer with kernel size of 2 × 2. This kernel scans across the whole image with stride size 2 and returns the maximum pixel value from its covered regions. If the kernel size is k × k then it will cover a region M with dimension k × k of the feature map. Then max-polling will be done using the following formula: where, V = maximum value of the k × k region, M = k × k region of the feature map.
The dimension of the max-pooling layer output is: where, N = Dimension of input to pooling layer, F = Dimension of filter, S = Stride.
The Pooling layer reduces height and width of the activation map. The convolutional layer, batch normalization layer, ReLU layer and max-pooling layer forms the 1 st layer of our model. There are two more such layers in our proposed model.
The resulting features of 1 st layer of our model are then fed to the 2 nd convolutional layer with padding 1. In this convolutional layer, we employed 16 kernels each of dimension 3 × 3 × 8, which gives 16 feature maps each having dimension of 16 × 16 × 1. This set of feature maps are stacked up and treated as input of dimension 16 × 16 × 16 to the next layer. After that, our model has batch normalization layer and ReLU layer followed by 2 × 2 max-pooling layer with stride size 2. This 2 nd max-pooling layer outputs a feature map of dimension of 8 × 8 × 16. This feature map is then fed into 3 rd convolutional layer with padding 1 like before. In this layer, 32 kernels each of dimension 3 × 3 × 16 are used for convolution, which results in 32 feature maps each of 8 × 8 × 1. Next, the 8 × 8 × 32 feature map is fed through the batch normalization layer, ReLU layer and max-pooling layer. As a result, we get a feature map with dimension of 4 × 4 × 32. This 3 rd max-pooling layer extracts those features, which will be used for recognizing the true classes of the characters. Now, the feature map is flattened as a column vector of 512 × 1 and fed to the fully connected (FC) layer. Next layer of our model is softmax layer, where the probability of each of the predefined classes of the characters is calculated. The equation of the softmax layer is: where, x i refers to each element of logits vector, p(x i ) refers to the probability of x i , j is the number of elements in the logits vector.
Finally, a classification layer follows the softmax layer. The task of classification layer is to specify the class of a character based on the results obtained from the softmax layer. This completes a single epoch of the training process. After each epoch the loss is calculated which is used to update the parameters of each of the layers using back propagation. After several epochs, the model is trained enough to distinguish and recognize the classes of the hand written characters. The flowchart of the aforementioned model is shown in Figure 2 and the parameter chart is depicted in Table 4.

Result and Discussion
Because of the versatility of Bangla handwritten characters in shape and size, it is quite a complex task to recognize them in comparison to other languages. Our purpose was to classify and recognize Bangla digits, vowels, consonants separately and finally recognizing the combined classes using the same classifier. All the Bangla characters are taken from "Bangla-Lekha Isolated" dataset. Some of Bangla handwritten digits, vowels and consonants are taken randomly before    preprocessing are shown in Figures 3(a) We have kept the batch size as a variable parameter and user have the liberty to change the batch size. At first we have trained and tested our model separately for digits, vowels and consonants to determine the category wise recognition performance. Then the model is trained with all the classes at once for combined recognition performance. We have used "Accuracy" as our performance evaluation metric. Accuracy is the number of characters our model has classified correctly. Accuracy can be calculated by the following formula:

Tp Tn Accuracy
Tp Fp Tn Fn where   Figure 5(a) and Figure 4(b) respectively. Next, for Bangla handwritten consonants, we got training and validation loss of 0.0039 and 0.3738 respectively. The training accuracy is found as 99.97% and validation accuracy is 90.00% after 10 epochs. The variation of loss and accuracy for consonants are shown in Figure 6(a) and Figure 6(b).
Finally, we run the proposed model on combined Bangla hand written character set (digits, vowels and consonants are mixed together) and we got training loss of 0.0344 and validation loss of 0.3204. Our obtained training accuracy is 99.08% and the validation accuracy is 92.25% after 10 epochs as shown in Figure  7(a) and Figure 7     The entire result of this section is accumulated in Table 5 for visualization at a glance.
Validation accuracy provides an idea about the performance of the model-how the model will perform in predicting the class of a character which is unseen to the model. Training accuracy tells us about how accurately the model has been trained with the training set. From Table 5 we can see, validation accuracy of our model for Bangla digit is 99.5%, that means there is 99.5% probability that our model will correctly classify an unseen digit. The probability of correctly predicting an unseen vowel and consonant is 93.18% and 90.00% respectively. For combined class, this probability is 92.25% which is higher than some state of the art methods. The percentages of training accuracy for different category shows us that, our model has been trained very accurately with the training set. Another noticeable point of Table 5 is that, the training accuracy and the validation accuracy are very close which proved that our model is not overfitted. It fits the training data very well and also generalizes the training which makes it able to make accurate prediction for unseen data.

Conclusion
The paper shows an effective utilization of proposed version of CNN model to classify or recognize Bangla handwritten characters and provides better result compared to some previous methods like MLP and SVM. In this paper, we ignored the compound handwritten symbols, which will be included in future with some modification of the proposed CNN. Still we have scopes to extend our work in future in the fields like: face detection, facial expression identification, fingerprint and other biometric item recognition, iris recognition, vehicle detection from both still images and videos.