_{1}

^{*}

With the boost of artificial intelligence, the study of neural network intrigues scientists. Artificial neural network, which was first designed theoretically in 1943 based on understanding of human brains, demonstrated impressing computational and learning capabilities. In this paper, we investigated the neural network’s learning capability by using a feed-forward neural network to recognize human’s digit hand-writing. Controlled experiments were executed by changing the input values of different parameters, such as learning rates and hidden layer units. After investigating upon the effects of each parameter on the overall learning performance of the neural network, we concluded that, when an intermediate value of one given parameter was implemented, the neural network achieved the highest learning efficiency, and potential problems like over-fitting would be prevented.

The human brain has always intrigued scientists. The brain’s function is very powerful and efficient [

The basic structure of neural network consists of large number of artificial neurons, which execute similar function as biological neurons, but in more abstract forms [

In this paper, we are investigating the effects of these parameters on the neural network. We are using a feed-forward neural network to recognize human’s digit handwriting. The dataset is the USPS collection of handwritten digits, which contains images of digits that people wrote [

In this section, we applied different learning rates to the neural network, and explored their effects on training performance by comparing the loss on the training data [

The results show that for each momentum, the training loss is the lowest at either 0.2 or 1.0 learning rate, while the value is larger at very small or very large learning rate.

Momentum = 0 | Momentum = 0.1 | ||
---|---|---|---|

Learning rates | Training loss | Learning rates | Training loss |

0.002 | 2.30428 | 0.002 | 2.30422 |

0.01 | 2.30212 | 0.01 | 2.30183 |

0.05 | 2.29297 | 0.05 | 2.29170 |

0.2 | 2.22897 | 0.2 | 2.21025 |

1.0 | 1.59884 | 1.0 | 1.52411 |

5.0 | 2.30132 | 5.0 | 2.30185 |

20.0 | 2.30259 | 20.0 | 2.30259 |

Momentum = 0.5 | Momentum = 0.9 | ||
---|---|---|---|

Learning rates | Training loss | Learning rates | Training loss |

0.002 | 2.30372 | 0.002 | 2.30014 |

0.01 | 2.29971 | 0.01 | 2.28402 |

0.05 | 2.28010 | 0.05 | 2.00861 |

0.2 | 1.99455 | 0.2 | 1.08343 |

1.0 | 1.13944 | 1.0 | 2..01872 |

5.0 | 2.30255 | 5.0 | 2.30259 |

20.0 | 2.30259 | 20.0 | 2.30259 |

momentum. We can see that the training loss value is smaller with momentum, and in our case, the value is the smallest at the biggest momentum (momentum = 0.9). Therefore, momentum is a good way to accelerate the training process. However, since the biggest momentum in our experiment is 0.9, we cannot

conclude that the behavior of the neural network with very large momentum would be better. In practice, we expect that too large momentum will also deteriorate the learning performance, which is known as “overshooting”.

The neural network converges faster if the loss on the training data is smaller at certain iterations. From the data above, we can conclude that the best learning rate at which the neural network works falls at a specific range, in our case between 0.2 and 5.0, varying due to different momentum. With extremely small learning rate values (i.e. very close to 0), it would take very long time to train the neural network. On the other hand, with extremely large learning rate values, the neural network may draw unnecessary information from the data, and thus also decrease the efficiency of the training.

In this step, we are trying to find a good generalization for the neural network by examining the classification loss on the validation data. We first investigate early stopping as the easiest way to improve network generalization [

We then ran the data with different weight decay (wd) coefficients [

The pattern is similar to that of learning rates. The lowest value of validation loss occurs in a specific range, in our case between wd coefficient of 0.0001 and 0.01. As wd coefficient goes toward extreme values (either close to 0 or infinity), the efficiency of the neural network decreases. This is reasonable. If a certain data is multiplied by too big or too small weight values, it would affect the overall summation by a significant degree, and thus decrease the efficiency of the neural network.

Another possible solution for overfitting is to regularize the number of hidden units [

The data and graph demonstrate similar results as above that the best generalization occurs at a specific value, in this case, between hidden units 10 and 100. It is thus reasonable to conclude that different means of regularization would all work with the best efficiency at a certain value of that parameter.

What if we combine different ways of regularization methods? Would it generate even better results [

Wd coefficient | Validation loss |
---|---|

0 | 0.430185 |

0.0001 | 0.34829 |

0.001 | 0.28791 |

0.01 | 0.50976 |

1 | 2.30259 |

5 | 2.30259 |

Hidden Layers Units | Validation Loss |
---|---|

10 | 0.42171 |

30 | 0.31708 |

100 | 0.36859 |

130 | 0.3976 |

200 | 0.430185 |

which achieves the best results in above experiments.

In this section, we first implemented Restricted Boltzmann machine (RBM) to learn the feature of USPS data in an un-supervised fashion [

Validation Loss | |
---|---|

a3_rbm_w | 0.056 |

a3 | 0.091 |

can reduce the possibility of over-fitting by a great degree. When implemented, the code functions so that the neural network works to discover relevant information in the distribution of the input images. It means that the neural network will not focus only on the difference of the digit class labels, but will also analyze other information from the input data. In comparison to early-stopping the model in only a few iterations, this method makes the neural network work on something else which is also valuable.

In the process above, we turned off early-stopping to investigate solely on the effects of good weight-initialization. Now we turn on early-stopping to see if another regularization method would affect the results. The new validation loss with carefully weights initialization is 0.058 which is bigger than the loss without early-stopping but not significantly. Therefore, implementing early-stop is not necessary in our case when we have a good weight initialization.

As we explore the effects of different parameters on the feed-forward neural network, we discover the pattern that the best model is generated when the implemented parameter falls at an intermediate value. It is surprisingly similar to the learning pattern of human brain: learning too slow or having too few neurons to process the learned information would harm the learning efficiency, but learning too fast or thinking too much on a simple topic would also decrease the productivity. As we dig deeper into the function of neural network, we could also decipher more about the secrets of human brains.

Fu, Y.X. (2018) An Optimization of Neural Network Hyper-Parameter to Increase Its Performance. Intelligent Information Management, 10, 99-107. https://doi.org/10.4236/iim.2018.104008