_{1}

^{*}

Recognizing digits from natural images is an important computer vision task that has many real-world applications in check reading, street number recognition, transcription of text in images, etc. Traditional machine learning approaches to this problem rely on hand crafted feature. However, such features are difficult to design and do not generalize to novel situations. Recently, deep learning has achieved extraordinary performance in many machine learning tasks by automatically learning good features. In this paper, we investigate using deep learning for hand written digit recognition. We show that with a simple network, we achieve 99.3% accuracy on the MNIST dataset. In addition, we use the deep network to detect images with multiple digits. We show that deep networks are not only able to classify digits, but they are also able to localize them.

Text recognition from images is an important task that has multiple real-world applications such as text localization [

Recent techniques in deep learning have allowed efficient automatic learning of features that are superior to hand designed features. As a result, we are able to train classifier that is significantly more accurate compared to previous methods. In this paper, we investigate using deep learning to classify handwritten digits, and show that with a simple deep network, we can classify digits with near-perfect accuracy.

We test our methods on the MNIST dataset [

We also investigate classifying multiple digits, where more than one digit is present in an image. An example of this task is shown in

A supervised learning task consists of two components, the input x and label y. For example, the input can be images of handwritten digits, or image of natural objects, and the label is the corresponding digit class or object class. The goal is to learn the correct mapping f from input x to label y. To accomplish this a learner is provided with examples of the correct mapping ( x i , y i ) , i = 1 , ⋯ , N where x_{i} is an example input and y_{i} is the corresponding label provided by human annotators. Ideally after learning, f should map each input in the dataset ( x i , y i ) , i = 1 , ⋯ , N to the correct label, i.e. f(x_{i}) = y_{i}. The hope is that the learner can learn the correct mapping between x and y based on these examples, so that on unseen data, the learner f can also correctly classify.

Usually f is selected from a class of functions indexed by a parameter θ. For example, the class of functions can be quadratic functions

f ( x ) = a x 2 + b x + c

in this example, θ = ( a , b , c ) are the parameters. We will denote the function selected by a parameter choice as f θ .

To encourage the learner to select a f θ that maps each x i to the correct y i we define a loss function such as

L θ = ∑ i = 1 N ( f θ ( x i ) − y i ) 2

In general any loss function that takes a smaller value when f(x_{i}) is closer to y_{i} can be used. For classification tasks we use a special class of functions f θ that outputs a probability distribution. That is for each x i , f θ j ( x i ) is the probability the input belongs to the j-th class. Then we can use the cross-entropy loss

L θ = ∑ j = 1 K I ( y i = j ) log f θ j (xi)

where K is the number of classes, and I ( y j = f ) = 1 if y i = j and equals to 0 otherwise.

To train the model, we use gradient descent on the loss function L_{θ}. This is described by the following process:

1) We start from a random parameter θ that can be arbitrarily chosen.

2) We compute the gradient of the loss function ∇_{θ}L_{θ}. Computation of this gradient is discussed in the next section.

3) We update the parameters by θ: = θ − α∇_{θ}L_{θ}. This changes θ in the direction that minimizes the loss L_{θ}. α is the learning rate that controls the step size. The larger the step size, the faster θ changes. However, step size that is too large may lead to instability or even divergence. Therefore, the learning rate α is an important hyperparameter that is selected based on the specific problem.

4) We repeat from Step 2 until θ stops changing.

The above algorithm reduces L_{θ} during each iteration. The hope is that when L_{θ} is minimized, f_{θ}(x_{i}) will be close to y_{i}, that is, the function f_{θ} we selected can correctly predict the label y_{i} given x_{i} on the training set.

However, even if f_{θ} correctly predicts every example we provided, this does not mean that f_{θ} will classify correctly on new data. For example, f_{θ} may have only memorized the training dataset. Therefore we need additional examples (x_{j},y_{j}), j = 1 , ⋯ , M that the learner has not seen during training. The learner should only be able to classify these new examples correctly if it has learned the correct mapping between x and y. We can compute the testing accuracy by dividing the number of examples f_{θ} correctly classifies by the total number of examples. This is the final measurement of performance that we use to evaluate our learner.

In the previous section we left an open question: which class of functions { f θ } to select from during training. This section introduces an important function class of deep networks [

The key idea of deep learning is to compose very simple functions g θ 1 1 , g θ 2 2 , ⋯ g θ d d into a very complex function f θ ( x ) = g θ d d ( g θ d − 1 d − 1 ( ⋯ g θ 2 2 ( g θ 1 1 ( x ) ) ) ) . Each function g θ i i is a simple function with parameters θ_{i}. Then the parameters of f is simply the combined parameters of all the layers θ = ( θ 1 , ⋯ , θ d ) . Common functions used in deep learning include

1) Matrix multiplication g(x) = Ax + b where the parameters are matrix A and vector b.

2) Rectified Linearity (ReLU) [

g ( x ) = { x x > 0 o x < 0

This function does not contain a parameter.

3) Softmax function: the softmax function “squashes” a n-dimensional vector of arbitrary real values to a n-dimensional vector of real values in the range [0, 1] that add up to 1. The function applied to an n-dimensional input vector z is given by

Sigmoid ( z j ) = e z j ∑ k = 1 n e Z k

Note that sigmoid naturally produces a distribution because the output sum to 1

∑ j Sigmoid ( z j ) = 1

4) Convolution [

o x , y , c = ∑ i = 0 S ∑ i = 0 S w i , j , k c ∗ z x + i , y + j , k , ∀ 1 ≤ x , y , c ≤ M , N , C

where S is called the size of the ﬁlter map, and each w c is an array of size S × S × K. All the w c combined { w c , c = 1 , ⋯ , C } is the set of parameters of the convolution function.

5) Pooling: Pooling is a process which reduces a M × N × K array into a smaller array, e.g. of size M / 2 × N / 2 × C . Usually we keep the number of “channels” unchanged. For example, in the digit classification task, it can reduce the scale of a clearer picture into a more ambiguous one, making it easier to process in the later steps.

We shall denote the output of g θ i i ( ⋯ ( g θ 1 1 ) ) as h i . Then f θ ( x ) = g θ d d ( h d − 1 ) , h d − 1 = g θ d − 1 d − 1 ( h d − 2 ) etc. Finally, we have h 1 = g θ 1 1 ( x ) . Intuitively the network must map raw input images into highly abstract and meaningful labels, which is a highly complex mapping. The network accomplishes through a sequence of simple mappings composed together. Each function g θ i i can be viewed as one “layer” of a network. This function g θ i processes the output of the previous functions h_{i}_{−1} into higher level representations h_{i}. The network therefore can be viewed as processing the input through a sequence of “layers” whose output become increasingly more high level and abstract, until we finally reach the output layer, which corresponds to the labels. This intuition is illustrated in

Now that we have defined our model class, to implement the algorithm in Section 2.1, we must be able to compute the gradient ∇_{θ}L_{θ}. This is accomplished with the back-propagation algorithm [

The back-propagation algorithm sequentially computes ∇ y L θ , ∇ h d − 1 L θ , ⋯ , ∇ h 1 L θ . Intuitively, this tells us how each hidden layer must change to minimize loss L_{θ}. When all the g θ i i are simple functions, we can compute ∇_{hi}_{−1}L_{θ} from ∇_{hi}L_{θ} analytically, and this can be computed automatically by software such as Tensorflow [_{hi}L_{θ} over each layer h_{i}, we can correspondingly compute the gradient ∇_{θi}L_{θ} over parameters θ_{i} analytically. This can also be automatically computed by Tensorflow.

Intuitively the computation flows “backward” through the next (hence the name back-propagation). We compute gradient in the following sequence

∇ y L θ , ∇ θ d L θ , ∇ h d − 1 L θ , ∇ θ d − 1 L θ , ⋯ , ∇ h 1 L θ , ∇ θ 1 L θ

In many real-world problems, such as car plate detection [

Previously we trained a classifier f θ ( x ) that takes as input an image, and outputs a probability distribution over all possible digits. We observe that when the input is an image that do not contain any digit, the output is a distribution

with high entropy, that is, the network is not highly confident that any digit has been observed. On the other hand, when presented with an image that contains a digit, the output is a distribution with low entropy, and the network generally outputs the correct digit with very high confidence.

We can then take advantage of this property. We measure the difference between the highest probability score and the second highest probability score. If the image contains a digit, the top prediction should have high probability score compared to the second highest. If the image does not contain a digit, all the possible predictions should be assigned similar probability and there should not be a significant difference. We show that this approach works very well in practice and we are able to accurately discover digits in an image in the experiments.

We use 50,000 digit figures from the MNSIT training dataset to accomplish our training. Each example is a 28 by 28 single-color image. Our network architecture is as follows

1) A convolution layer with filter map of size 5 that takes as input the 28 × 28 × 1 image and outputs a feature map of shape 28 × 28 × 32

2) A pooling layer that reduces the size from 28 × 28 × 32 to 14 × 14 × 32

3) A ReLU layer

4) A convolution layer with filter map of size 5 and outputs a feature map of shape 14 × 14 × 64

5) A pooling layer that reduces the size from 14 × 14 × 64 to 7 × 7 × 64

6) A matrix multiplication layer that maps a vector of size 7 × 7 × 64 to 1024

7) A ReLU layer

8) A matrix multiplication layer that maps a vector of size 1024 to 10

9) A softmax layer

We train our network with gradient descent with a learning rate of 1 e − 4 for 20,000 iterations. We also use a new adaptive gradient descent algorithm known as Adam [

For multi-digit classification, we first extract all 28 by 28 image patches with a stride of 2. Then we run our classification network on all the patches, we take the most confident digit prediction in a region as our digit class prediction.

After training our network, we use another 10,000 test data to test the accuracy of our network. We achieved a testing accuracy of 0.993, which indicates that the network only makes a mistake in 7 out of every 1000 digits. We show the training curve in

For the multi-digit classification, we show in

We also show in

This paper applies deep networks to digit classification. Instead of hand designed features, we automatically learn them with a deep network and the back-propagation algorithm. We use a convolutional neural network with ReLU activations. In addition, we use pooling layers to remove unnecessary detail and learn higher level features.

We train our network with stochastic gradient descent. Training progresses quickly, we are able to achieve 90% accuracy with only 1000 iterations. After 100 k iterations, we achieve test performance of 99.3% on the MNIST dataset.

We also study multi-digit classification and propose a method to detect digits in an image with multiple digits. We utilize the fact that our classifier produces a probability distribution. We observe that when the input contains a digit, the classifier produces a distribution with low entropy and high confidence on

the correct label. On the other hand, when the input does not contain a digit, the classifier produces an almost uniform distribution. We use this different to detect whether an image patch contains an image. We experiment on multiple digit detection and our method is able to successfully localize digits and classify them.

Future work should further improve accuracy and handle different size of digits in the multi-digit detection task.

Yang, R.Z. (2018) Classifying Hand Written Digits with Deep Learning. Intelligent Information Management, 10, 69-78. https://doi.org/10.4236/iim.2018.102005