Tuning Recurrent Neural Networks for Recognizing Handwritten Arabic Words

Artificial neural networks have the abilities to learn by example and are capable of solving problems that are hard to solve using ordinary rule-based programming. They have many design parameters that affect their performance such as the number and sizes of the hidden layers. Large sizes are slow and small sizes are generally not accurate. Tuning the neural network size is a hard task because the design space is often large and training is often a long process. We use design of experiments techniques to tune the recurrent neural network used in an Arabic handwriting recognition system. We show that best results are achieved with three hidden layers and two subsampling layers. To tune the sizes of these five layers, we use fractional factorial experiment design to limit the number of experiments to a feasible number. Moreover, we replicate the experiment configuration multiple times to overcome the randomness in the training process. The accuracy and time measurements are analyzed and modeled. The two models are then used to locate network sizes that are on the Pareto optimal frontier. The approach described in this paper reduces the label error from 26.2% to 19.8%.


Introduction
Artificial neural networks are richly connected networks of simple computational elements.They are capable of solving problems that linear computing cannot [1].Recurrent neural networks (RNN) have demonstrated excellent results in recognizing handwritten Arabic words [2,3].Their advantage comes from using the context information as they contain memory elements and have cyclical connections.
A neural network has a fixed number of inputs, hiddenness, and output nodes arranged in layers.The number and sizes of these layers determine the performance of the network, among other network parameters.Small size networks often suffer limited information processing power.However, large networks may have redundant nodes and connections and high computations cost [4,5].On the other hand, the size of the network determines its generalization capabilities.Based on what the network has learned during the training phase, generalization determines its capability to decide upon data unknown to it.
To achieve good generalization, the network size should be 1) large enough to learn the similarities within same class samples and at the same time what makes one class different from other classes and 2) small enough to learn the differences among the data of the same class [6].The latter condition avoids the problem of overfitting or overtraining.Overfitting is the adaptation of the network to small differences among specific training data set resulting in false classification of the test samples [7].
In this paper, we tune a RNN that is used in a system built for recognizing handwritten Arabic words.We show how the RNN size is tuned to achieve high recognition accuracy and reasonable training and recognition times.As the design space of the RNN sizes is huge and each training experiment takes a long time, we use design of experiments techniques to collect as much information as possible with small number of experiments.The results of the conducted experiments are analyzed and modeled.The derived models are used to select a network size that is on the optimal front and has excel-lent accuracy and time cost.
The rest of this section reviews related work on neural network tuning.Section 2 describes the used Arabic handwriting recognition system.Section 3 describes the design of experiments techniques used in this paper.Section 4 presents the experimental work, results, and their analysis.Finally, the conclusions are presented in Section 5.

Related Work
The accuracy of a neural network depends on the settings of its parameters, e.g., the number and sizes of the hidden layers and the learning scheme.Setting these parameters can be accomplished by many approaches including trial and error, analytical methods [8], pruning techniques [9][10][11], and constructive technique [12,13].Optimal settings of these parameters are often a time consuming process.
Analytical methods employ algebraic or statistical techniques for this purpose [8].The disadvantage of these methods is that they are static and do not take the cost function into consideration.
Constructive and pruning (destructive) algorithms can be used to obtain network structures automatically [14,15].The constructive algorithm starts with a small network, and connections are added dynamically to expand the network.Fahlman and Lebiere started with an input and output layers only [12].Hidden neurons are added and connected to the network.The network is trained to maximize the correlation between the new units and output units, and measure the residual error to decide if the new unit should be added.
Lin et al. proposed a self-constructing fuzzy neural network which is developed to control a permanent magnet synchronous motor speed drive system [14].It starts by initially implementing only input and output nodes.The membership and rule nodes are dynamically generated according to the input data during the learning process.
On the other hand, the destructive algorithm starts with large network, and connections with little influence on the cost are deleted dynamically.Le Cun et al. and Hassibi et al. calculate the parameters sensitivity after training the network [9,10].Those values with small or insufficient contribution in the formation of the network output are removed.Weigend et al. introduced a method based on cost function regularization by including penalty term in the cost function [11].
Teng and Wah developed learning mechanism by reducing the number of hidden units of a neural network when trained [16].Their approach was applied to solve the problem of classification with binary output.The learning time is long, however, the resulting network is small and fast when deployed in target applications.The stopping criterion in this technique is based on a pre-selected threshold.
Genetic algorithms were also used to find the optimal size of neural networks to meet certain application needs.Leung et al. applied a genetic algorithm to tune neural networks [17].An improved genetic algorithm was used to reduce the cost of fully-connected neural network to a partially-connected network.This approach was applied for forecasting the sun spots and tuning associative memory.
Another approach is pattern classification.Weymaere and Martens applied standard pattern classification techniques to fairly-general, two-layer network [18].They show that it can be easily improved to a near-optimum state.Their technique automatically determines the network topology (hidden layers and direct connections between hidden layers and output nodes) yielding the best initial performance.
The above approaches suffer from long learning time and complex implementations.On the other hand, the statistical techniques of design of experiments (DoE) can be applied for better selection of the parameters of artificial neural networks.The application of DoE techniques to optimize neural network parameters was reported in literature [1,[19][20][21][22].DoE techniques can estimate optimum settings in less time with small number of experimental runs.
Balestrassi et al. applied DoE to determine the parameters of a neural network in a problem of non-linear time series forecasting [23].They applied classical factorial designs to set the parameters of neural network, such that, minimum prediction error could be reached.The results suggest that identifying the main factors and interactions using this approach can perform better compared to nonlinear auto-regressive models.
Behmanesh and Rahimi used DoE to optimize the RNN in training process for modeling production control process and services [24].Packianather et al. applied the Taguchi DoE in the optimization of neural network required to classify defect in birch wood veneer [21].
Bozzo et al. applied DoE techniques to optimize the digital measurement of partial discharge to support diagnosing the defect of power electric components [25].The measuring process is influenced by several factors and there is no simple mathematical model available.DoE solved the latter problem by analyzing the results of 81 tests performed on a simple physical model that quantified the interaction of factors and allowed for derived criterion to select optimal values for such factors.
Staiculescu et al. optimize and characterize a microwave/millimeter wave flip chip [26].Two optimization techniques are combined in a factorial design with three replicates.Olusanya quantified the effect of silane coupling agents on the durability of titanium joints by using DoE technique [27].
In this paper, we use partial factorial DoE with replication to select the sizes of the hidden layers of a recurrent neural network.

System Overview
Figure 1 shows the processing stages of our system for recognizing handwritten Arabic words (JU-OCR2).An earlier version of this system (JU-OCR) has participated in ICDAR 2011 Arabic handwriting recognition competition [28].This system achieves now state-of-the-art accuracy and is described in detail in Ref. [29].
The five stages are: sub-word segmentation, grapheme segmentation, feature extraction, sequence transcription, and word matching.Each stage consists of one or more steps and is briefly described below.

Processing Stages
The first stage segments the input word into sub-words.This stage starts by estimating the word's horizontal baseline and identifying the secondary bodies above and below the main bodies.The main bodies are extracted as sub-words along with their respective secondary bodies.
These sub-words are then segmented into graphemes in two steps: morphological feature points such as end, branch, and edge points are first detected from the skeleton of the main bodies, then these points are used in a rulebased algorithm to segment the sub-words into graphmes.These segmentation algorithms are described in Ref. [30].Efficient features are then extracted from the segmented graphemes.Although some of these features are extracted in the segmentation process, the majority of features are extracted in the feature extraction stage.A total of 30 features are used including statistical, configuration, skeleton, boundary, elliptic Fourier descriptors, and directional features.Using feature statistics from the training samples, the feature vectors are normalized to zero mean and unit standard deviation.
The normalized feature vectors of the graphemes are then passed to the sequence transcription stage.The sequence transcription stage maps sequences of feature vectors to sequences of recognized characters.This stage uses a recurrent neural network and is further described in the following subsection.
Finally, the word matching stage uses the dictionary of valid words to correct transcription errors.

Transcription Using RNN
Our sequence transcription is carried out using a recurrent neural network (RNN) with the bidirectional Long Short-Term Memory architecture (BLSTM) [31].The Connectionist Temporal Classification (CTC) [32] is used in the output layer.
Our experiments on BLSTM-CTC were carried out with the open source software library RNNLIB [33].This library is selected because it has been used in recognition systems that have won three handwriting recognition competitions [3,34,35].
RNNs exploit the sequence context through cyclic connections in the hidden layer [36].In order to have access to future as well as past context, bidirectional RNNs are used.In BRNNs, the training sequence is presented forwards and backwards to two separate recurrent hidden layers.This layer pair is connected to the same next hidden layer or to the output layer.
The BLSTM architecture provides access to longrange context in both input sequence directions.This architecture consists of the standard BRNN architecture with LSTM blocks used in the hidden layer.The LSTM blocks replace the non-linear units in the hidden layer of simple RNNs [37].Figure 2 shows an LSTM memory block which consists of a core memory cell and three gates.The input gate controls storing into the memory cell and allows holding information for long periods of time.The output gate controls the output activation function, and the forget gate affects the internal state.
The CTC output layer is used to determine a probability distribution over all possible character sequences, given a particular feature sequence.A list of the most probable output sequences are then selected and passed along to the final word matching stage of recognition.
To improve accuracy, multiple levels of LSTM RNN hidden layers can be stacked on top of each other.How-ever, this leads to a very large number of connections between the forward and backward layers of successive levels, and consequently, increase computational cost.As shown in Figure 3, subsampling layers are used to control the number of connections between successive levels.A subsampling layer works as intermediate layer between two levels, one level feeds forward to the subsampling layer, which in turn feeds forward to the next level.This way, the number of weights is reduced and is controlled by the size of the subsampling layer.
The performance and computational cost of our RNN is determined by many factors including its topology manifested by the number and sizes of the hidden layers and subsampling layers.In this paper, we use experimental approach to determine the RNN topology.

Design of Experiments
In this section, we give an introduction about the design of experiments techniques and describe some DoE techniques that maximize information with the number of experiments.

Introduction to DoE
The goal of DoE is to obtain the maximum information with the minimum number of experiments [38].This is particularly important when each experiment is very long such as an experiment to train and evaluate a large RNN using tens of thousands of handwritten samples.DoE is often needed when the performance of a system is a function of multiple factors and it is required to select the optimal levels for these factors or to evaluate the effect of each factor and the interactions among the factors.
An experimental design consists of specifying the number of experiments and the factor level combinations for every experiment.In the simple design, we start with a base configuration and vary one of the factors at time to find out how each factor affects performance.This type of DoE requires 1 experiments, where i is the number of levels of Factor .However, this technique is not efficient and cannot evaluate interactions among factors.
A technique that allows evaluating all effects and interactions is the full factorial design which includes all possible combinations of all levels of all factors.This would sum up to a total of 1 experiments.The drawback of this technique is getting large number of experiments when the number of factors and levels is large.
An alternative technique is fractional factorial design which consists of a fraction of the full factorial experiments.Although this technique saves time compared with the full factorial design, it offers less information and the evaluation of factor effects and interactions is less precise.Further detail about factorial DoE is in the following subsections.

Factorial Design 2 k
One variant of the full factorial design is the factorial design.This design reduces the number of experi- ments to and allows the evaluation of factor effects and interactions.This design works well when the system response is a unidirectional function of each factor.
In this design, only two levels are considered for each factor.The two levels are usually the minimum level (referred to by −1) and the maximum level (+1).Table 1 shows this design for two Factors A and B. The table illustrates for each of the experiments, the levels of factors A and B and the measured response .The unit vector (I) in this table is needed for estimateing the average response and the vector (AB) is the product of A and B and is needed for estimating the interaction between vectors A an B. From the experimental results, the following model can be derived.i Since the four vectors of Table 1 are orthogonal, the four coefficients are easily computed as: 1) the average response is and 4) the interaction between A and B is 4 And generally, for factors k x , the following model is used.
This model has terms; the average response, 2 k k factor effects, two-factor interactions, three-2 factor interactions, etc.The coefficients can be similarly computed, e.g., the average response x y, and the interaction between j x and

Factorial Design with Replication 2 k r
Many measurements have experimental error or involve some randomness.For example, the initial weights used in training a neural network are randomly selected.Consequently, the performance of a neural network changes from one experiment to another.The 2 factorial design does not estimate such errors.The alternative is using the factorial design with replication.Here each factor level combination is repeated replications and a total of experiments is carried out.
The mean response i y of every replications is calculated and is used in place of i to calculate the model coefficients, as described above.Thus, as increases, the effect of the random behavior is averaged out.k  factors and .The three factors are initially labeled A, B, and C. Note that this table includes the sign vectors of four two-and threefactor interactions.For the case when we have five factors, e.g., L1, L2, L3, S1, and S2, three factors are mapped to A, B, and C, and the remaining two factors are mapped to high-degree interactions.In this example, S1 and S2 are mapped to the interactions BC and ABC, respectively.

 p
For replications, the mean response of experiments is used in estimating the model coefficients as described in the previous subsection.The model of Table 2 has  .These eight coefficients estimate the average response, five factor effects, and two interactions specified in the following model.
When compared with a model, this model has one fourth the number of coefficients.This model confounds four effects or interactions in one coefficient.The confounding groups can be found through Algebra of confounding [38].For example, the coefficient 2 S includes the effect of factor S2 and the interactions L1L2L3, L2L3S1S2, and L1S1.This problem of reduced information is often tolerated as the factor effects are usually larger than the interactions and the value of a coefficient is dominated by its factor effect.

Allocation of Variation
The fraction of variation explained by each factor or interaction is found relative to the total variation of the response.The total variation or total sum of squares is found by The variation explained by x is .

Experiments and Results
This section describes the experiments carried out to tune the topology of the RNN sequence transcriber for efficient results.First, we describe the database of handwritten Arabic words used.Then we describe the two sets of conducted experiments and present and analyze their results.The first set of experiments was carried out to select the best number of layers and the second set to select the size of each layer.

Samples
This work uses the IfN/ENIT database of handwritten Arabic words [39].This database is used by more than 110 research groups in about 35 countries [28].The database version used is v2.0p1e and consists of 32,492 Arabic words handwritten by more than 1000 writers.This database is organized in five training sets and two test sets summarized in Table 3.The table shows the number of samples, the number of sub-words (parts of Arabic words), and the number of characters that each set has.
The two test sets are publicly unavailable and are used in competitions.Therefore, we use the five training sets for training, validation, and testing.Set e is the hardest set and has the largest variety of writers.Recognition systems often score worst on this set.Therefore, in all the experiments described in this paper, we use set e as the test set and use the first four sets for training and validation.We have randomly selected 90% of the samples of the first four sets for training and the rest 10% for validation.

Selecting the Number of Layers
To select the number of layers of the RNN transcriber, we have carried out six experiments of varying numbers of layers.The configurations used in these six experiments are: 1) One hidden layer of size 100.
2) Two hidden layers of size 60 and 180.2s) Two hidden layers of size 60 and 180 with subsampling layer of size 60.
3) Three hidden layers of size 40, 80, and 180.3s) Three hidden layers of size 40, 80, and 180 with two sub-sampling layers of sizes 40 and 80.
4) Four hidden layers of size 40, 80, 120, and 180.These layer sizes are the default sizes that are found in the RNNLIB library's configuration files.
Figure 4 shows the label error of these six configurations.The label error rate is the ratio of insertions, deletions, and substitutions on the output to match the target labels of the test set .e These results show that the accuracy improves with more layers and with using sub-sampling layers.However, the accuracy does not increase when increasing the number of layers from three to four.Therefore, we adopt the topology of three layers with two sub-sampling layers.

Selecting the Layer Sizes
After concluding that it is best to use three hidden layers with two sub-sampling layers, we wanted to find the sizes of these five layers.We have noticed that increasing  Selecting the sizes of the five layers is a DoA problem of five factors.As each factor may take many levels, we considered design.This consideration is justified because the RNN response is generally monotonic with the layer sizes.

k
However, as the neural network training involves some randomness, the neural network response varies from one experiment to another.Therefore, each configuration should be repeated repetitions to get average values.This is a design.With and r 2 k r 5 k  4 r  , we need 128 experiments that would take too long time.
Therefore, we decided to use design with , , and .This design reduces the number of experiments to 32.The selected design is shown in Table 2 where the three hidden layers are referred to as L1, L2, and L3, and the two sub-sampling layers are S1 and S2.Table 4 shows the levels used in the eight configurations.Note that the minimum level (−1) is selected as one half the default value in the 3S configuration described in Subsection 4.2 above and the maximum level (+1) is twice the default value.
Table 5 shows the label error for the eight configuretions on four replications.The table also shows the average label error of each four replications.Note that the label error decreases from 23.9% for the smallest layer sizes to 20.1% for the largest sizes.The fraction of variation due to experimental error (SSE/SST) = 2.0/43.0= 4.5%.
Table 6 shows the time of each experiment in hours.Note that this time includes the training and testing times.These experiments were carried out on Ubuntu 10.10 computers with Intel Core i7-2600 quad processors running at 3.4 GHz and equipped with 4 GB memory.Note that this time is highly affected by the neural network size and ranges from 13.8 hours to 6 days and 19 hours.Moreover, due to the randomness in training the neural networks, the training time highly changes from one replication to another.The fraction of variation due to experimental error in experiment time (SSE/SST) = 4230/ 87,700 = 4.8%.

Analysis
We used the model of Equ. 3 on the results shown in Tables 5 and 6.Table 7 shows the computed eight model coefficients for the label error and for the experiment time.This table also shows the fraction of variation explained by each factor.
The contribution of layer L3 on the label error is the largest among other factors at 36.5%.The two sub-sampling layers S1 and S2 come next and have almost equal contributions at 22.0% and 23.1%, respectively.
Layer L3 also has the largest effect on the experiment time at 69.3%.Next comes the effect of the sub-sampling layer S2 at 14.4%.
As L3 has the largest contribution, increasing it greatly lowers the label error, but increases the execution time.Also, increasing the sizes of S1 and S2 decreases the label error and increases the execution time.However, increasing L1 also enhances the label error with little increase in execution time, similar to S1.On the other hand, L2 has minor effect, increasing its value does not give measurable enhancement.
To explore the design space of accuracy and time, we use Figure 5.This figure shows the results of the eight configurations of Table 4 (drawn with "+" sign) and the base, default configuration 3S described in Subsection 1.4.2(square sign at 45 hrs and 21.0%).Moreover, the figure shows the estimated label error and experiment time for 24 additional configurations using the model of Equation ( 3) and the coefficients shown in Table 7 ("×" sign).These 24 configurations are the 32 possible configurations of five binary levels minus the eight configurations of Table 4.
The designer should select configurations that are on the Pareto optimal frontier.This frontier consists here of the points of low label error and low experiment time  error and takes 89 hours.This configuration was adopted for its excellent accuracy and time trade-off.A slightly higher accuracy can be achieved using much larger configuration.We have experimented with a large configuration of L1 = 100, L2 = 100, L3 = 360, S1 = 120, and S2 = 180.This configuration achieves 19.8% label error and takes 281 hours.

Conclusions
In this paper, we have presented our approach and results for tuning a recurrent neural network sequence transcriber.This transcriber is used in the recognition stage of our system for recognizing Arabic handwritten words (JU-OCR2).
We have used design of experiments techniques to find a RNN topology that gives good recognition accuracy and experiment time.The experimental results presented in this paper show that it is best to construct the RNN with three hidden layers and two subsampling layers.
To select the sizes of these five layers, we designed a set of experiments using the r p k  Our analysis of the label error and experiment time of the 32 experiments show that the third hidden layer has the largest contribution on label error and experiment time, whereas the first hidden layer has the smallest contribution.
Two models were constructed from these experiments to find the label error and experiment time as functions of the sizes of the five layers.These models were able to predict a configuration that lies on the Pareto optimal frontier.This configuration is L1 = 80, L2 = 40, L3 = 360, S1 = 80, and S2 = 160.We have experimentally verified that this is an excellent design point that achieves 20.2% label error and takes 89 hours.

Acknowledgements
This work was partially supported by the Deanship of the Scientific Research in the University of Jordan.We would like to thank Alex Graves for making the RNNLIB publically available [33] and for his help in using it.

Figure 1 .
Figure 1.Processing stages of our Arabic handwriting recognition system.

8 
coefficients.Each coefficient is found as one eighth the dot product of its vector by the fraction of variation explained by x is .SST SSx/ Similarly, the fraction of variation due to the experimental error can be found from the sum of square errors by , where SSE is found by SST

Figure 4 .
Figure 4.The label error for set on neural networks of six topologies.e

Figure 5 .
Figure 5. Design space of the label error and experiment time.
the randomness process in training RNNs.