A Study on the Convergence of Gradient Method with Momentum for Sigma-Pi-Sigma Neural Networks

Xun Zhang; Naimin Zhang

doi:10.4236/jamp.2018.64075

Journal of Applied Mathematics and Physics > Vol.6 No.4, April 2018

A Study on the Convergence of Gradient Method with Momentum for Sigma-Pi-Sigma Neural Networks

Xun Zhang, Naimin Zhang^*
School of Mathematics and Information Science, Wenzhou University, Wenzhou, China.
DOI: 10.4236/jamp.2018.64075 PDF HTML XML 670 Downloads 1,466 Views Citations

Abstract

In this paper, a gradient method with momentum for sigma-pi-sigma neural networks (SPSNN) is considered in order to accelerate the convergence of the learning procedure for the network weights. The momentum coefficient is chosen in an adaptive manner, and the corresponding weak convergence and strong convergence results are proved.

Keywords

Sigma-Pi-Sigma Neural Network, Momentum Term, Gradient Method, Convergence

Share and Cite:

Zhang, X. and Zhang, N. (2018) A Study on the Convergence of Gradient Method with Momentum for Sigma-Pi-Sigma Neural Networks. Journal of Applied Mathematics and Physics, 6, 880-887. doi: 10.4236/jamp.2018.64075.

1. Introduction

Pi-sigma network (PSN) is a kind of high order feedforward neural network which is characterized by the fast convergence rate of the single-layer network, and the unique high order network nonlinear mapping capability [1] . In order to further improve the application capacity of the network, Li introduces more complex network structures based on PSN called sigma-pi-sigma neural network (SPSNN) [2] . SPSNN can be learned to implement static mapping in the similar manner to that of multilayer neural networks and the radial basis function networks.

The gradient method is often used for training neural networks, and the main disadvantages of this method are the slow convergence and the local minimum problem. To speed up and stabilize the training iteration procedure for the gradient method, a momentum term is often added to the increment formula for the weights, in which the present weight updating increment is a combination of the present gradient of the error function and the previous weight updating increment [3] . Many researchers have developed the theory about momentum and extended its applications. For the back-propagation algorithm, Phansalkar and Sastry give a stability analysis with adding the momentum term [4] . Torii and Bhaya discuss the convergence of the gradient method with momentum under the restriction that the error function is quadratic [5] [6] . Shao et al. study the adaptive momentum for both batch gradient method and online gradient method, and compare the efficiency of momentum with penalty [7] [8] [9] [10] [11] . The key for the convergence analysis for momentum algorithms is the monotonicity of the error function during the learning procedure, which is generally proved under the uniformly boundedness assumption of the activation function and its derivatives. In [8] [10] [12] [13] , for the gradient method with momentum, some convergence results are given for both two-layer and multi-layer feedforward neural networks. In this paper, we will consider the gradient method with momentum for sigma-pi-sigma neural networks and discuss its convergence.

The rest of the paper is organized as follows. In Section 2 we introduce the neural network model of SPSNN and the gradient method with momentum. In Section 3 we give the convergence analysis of the gradient method with momentum for training SPSNN. Numerical experiments are given in Section 4. Finally, in Section 5, we end the paper with some conclusions.

2. The Neural Network Model of SPsnn and Gradient Method with Momentum

In this section we introduce the sigma-pi-sigma neural network that is composed of multilayer neural network. The output of SPSNN has the form $\sum_{n = 1}^{K} \prod_{i = 1}^{n} \sum_{j = 1}^{N_{v}} f_{n i j} (x_{j})$ , where $x_{j}$ is an input, $N_{v}$ is the number of inputs, $f_{n i j} ()$ is a function to be generated through the network training, and K is the number of pi-sigma network(PSN) that is the basic building block for SPSNN. The expression of the function $f_{n i j} (x_{j})$ is $\sum_{k = 1}^{N_{q} + N_{e} - 1} w_{n i j k} B_{i j k} (x_{j})$ , where the function $B_{i j k} ()$ is either 0 or 1, and $w_{n i j k}$ is weight values stored in memory. $N_{q}$ and $N_{e}$ are information numbers stored in $x_{j}$ . For a K-th order SPSNN, the total weight value will be

$\frac{1}{2} \times K \times (K + 1) \times N_{v} \times (N_{q} + N_{e} - 1)$ .

For a set of training examples ${(S_{t}, O_{t}) \in R^{N_{v}} \times R}$ , where $O_{t}$ is the ideal output, $t = 1, 2, \dots, T$ , we have the following actual output:

$y_{t} = \sum_{n = 1}^{K} \prod_{i = 1}^{n} \sum_{j = 1}^{N_{v}} \sum_{k = 1}^{N_{q} + N_{e} - 1} w_{n i j k} B_{i j k} (x_{j}^{(S_{t})})$ ,

where $x_{j}^{(S_{t})}$ denotes the jth element of a given input vector $S_{t}$ .

In order to train the SPSNN, we choose a quadratic error function $E (W)$ :

$E (W) = \frac{1}{2} \sum_{t = 1}^{T} {(O_{t} - y_{t})}^{2} \equiv \sum_{t = 1}^{T} g_{t} (W), g_{t} (W) = \frac{1}{2} {(O_{t} - y_{t})}^{2}$

where $W = {(w_{1111}, w_{1112}, \dots, w_{K, K, N_{v}, N_{q} + N_{e} - 1})}^{T}$ . For convenience we denote $w_{α} = w_{K, K, N_{v}, N_{q} + N_{e} - 1}$ .

The gradient method with momentum is used to train weights. The gradients of $E (W)$ and $g_{t} (W)$ are denoted by

$\nabla E (W) = {(\frac{\partial E (W)}{\partial w_{1111}}, \frac{\partial E (W)}{\partial w_{1112}}, \dots, \frac{\partial E (W)}{\partial w_{n i j k}}, \dots, \frac{\partial E (W)}{\partial w_{α}})}^{T}$ ,

$\nabla g_{t} (W) = {(\frac{\partial g_{t} (W)}{\partial w_{1111}}, \frac{\partial g_{t} (W)}{\partial w_{1112}}, \dots, \frac{\partial g_{t} (W)}{\partial w_{n i j k}}, \dots, \frac{\partial g_{t} (W)}{\partial w_{α}})}^{T}$ ,

and the Hessian matrices of $g_{t} (W^{m})$ and $E (W^{m})$ at $W^{m}$ are denoted by

$\nabla^{2} g_{t} (W^{m}) = (\begin{matrix} \frac{\partial^{2} g_{t} (W^{m})}{\partial w_{1111}^{m} \partial w_{1111}^{m}} & \dots & \frac{\partial^{2} g_{t} (W^{m})}{\partial w_{1111}^{m} \partial w_{α}^{m}} \\ ⋮ & ⋱ & ⋮ \\ \frac{\partial^{2} g_{t} (W^{m})}{\partial w_{α}^{m} \partial w_{1111}^{m}} & \dots & \frac{\partial^{2} g_{t} (W^{m})}{\partial w_{α}^{m} \partial w_{α}^{m}} \end{matrix})$ ,

$\nabla^{2} E (W^{m}) = (\begin{matrix} \frac{\partial^{2} E (W^{m})}{\partial w_{1111}^{m} \partial w_{1111}^{m}} & \dots & \frac{\partial^{2} E (W^{m})}{\partial w_{1111}^{m} \partial w_{α}^{m}} \\ ⋮ & ⋱ & ⋮ \\ \frac{\partial^{2} E (W^{m})}{\partial w_{α}^{m} \partial w_{1111}^{m}} & \dots & \frac{\partial^{2} E (W^{m})}{\partial w_{α}^{m} \partial w_{α}^{m}} \end{matrix})$ .

Given any arbitrarily initial weight vectors $W^{0}$ , $W^{1}$ , the gradient method with momentum updates the weight vector W by

$W^{m + 1} = W^{m} - η \nabla E (W^{m}) + τ^{m} (W^{m} - W^{m - 1}), m = 1, 2, \dots$ , (1)

where $η > 0$ is the learning rate, $W^{m} - W^{m - 1}$ is called the momentum term, $τ^{m}$ is the momentum coefficient.

Similar to [12] [14] , in this paper, we choose $τ^{m}$ as follows:

$τ^{m} = {\begin{cases} \frac{μ ‖ \nabla E (W^{m}) ‖}{‖ Δ W^{m} ‖}, if Δ W \neq 0 \\ 0, else \end{cases}$

where m is a positive number and $Δ W^{m} = W^{m} - W^{m - 1}$ , and $‖ \cdot ‖$ is 2-norm in this paper.

Notice the component form of (1) is

$w_{n i j k}^{m + 1} = w_{n i j k}^{m} - η \frac{\partial E (W^{m})}{\partial w_{n i j k}^{m}} + τ^{m} (w_{n i j k}^{m} - w_{n i j k}^{m - 1})$ .

In fact,

$y_{t} = P S N_{1} + P S N_{2} + \dots + P S N_{K}$ ,

where $P S N_{n} = \prod_{i = 1}^{n} U_{n i}$ . Recalling $f_{n i j} (x_{j}) = \sum_{k = 1}^{N_{q} + N_{e} - 1} w_{n i j k} B_{i j k} (x_{j})$ , then

$\begin{matrix} \frac{\partial E (W)}{\partial w_{n i j k}} = \sum_{t = 1}^{T} (y_{t} - O_{t}) \frac{\partial y_{t}}{\partial w_{n i j k}} \\ = \sum_{t = 1}^{T} (y_{t} - O_{t}) \frac{\partial P S N_{n}}{\partial U_{n i}} \frac{\partial U_{n i}}{\partial f_{n i j}} \frac{\partial f_{n i j}}{\partial w_{n i j k}} \\ = \sum_{t = 1}^{T} (y_{t} - O_{t}) {\prod_{p \neq i} U_{n p}} B_{i j k} (xj) \end{matrix}$

3. Convergence Results

Similar to [12] [14] , we need the following assumptions.

(A1): The elements of the Hessian matrix $\nabla^{2} E (W^{m})$ be uniformly bounded for any $W^{m}$ .

(A2): The number of the elements of $Ω = {W | \nabla E (W) = 0}$ be finite.

From (A1), it is easy to see that there exists a constant $M > 0$ such that

$‖ \nabla^{2} E (W^{m}) ‖ \leq M, m = 0, 1, 2, \dots$ .

Lemma 3.1 ( [15] ) Let $f : R^{n} \to R$ be continuously differentiable, the number of the elements of the set $Ω = {x | \nabla f (x) = 0}$ be finite, and the sequence ${x^{k}}$ satisfy

$\lim_{k \to \infty} ‖ x^{k} - x^{k - 1} ‖ = 0$ ,

$\lim_{k \to \infty} ‖ \nabla f (x^{k}) ‖ = 0$ .

Then there exists a $x^{*} \in R^{n}$ such than

$\lim_{k \to \infty} x^{k} = x^{*}, \nabla f (x^{*}) = 0$ .

Theorem 3.2 If Assumption (A1) is satisfied. Then there exists $E^{*} \geq 0$ such that for $η \in (0, \frac{2}{M})$ and $μ \in (0, \frac{- 1 - M η + \sqrt{1 + 4 M η}}{M})$ , it holds the following weak convergence result for the iteration (1):

$E (W^{m + 1}) \leq E (W^{m})$ ,

$\lim_{m \to \infty} E (W^{m}) = E^{*}$ ,

$\lim_{m \to \infty} ‖ \nabla E (W^{m}) ‖ = 0$ .

Furthermore, if Assumption (A2) is also valid, then it holds the strong convergence result, that is there exists $W^{*}$ such that

$\lim_{m \to \infty} W^{m} = W^{*}, \nabla E (W^{*}) = 0$ .

Proof.

Using Taylor’s formula, we expand $g_{t} (W^{m + 1})$ at $W^{m}$ :

$\begin{matrix} g_{t} (W^{m + 1}) = g_{t} (W^{m}) + {(\nabla g_{t} (W^{m}))}^{T} (W^{m + 1} - W^{m}) \\ + \frac{1}{2} {(W^{m + 1} - W^{m})}^{T} \nabla^{2} g_{t} (ξ^{m}) (W^{m + 1} - W^{m}) \end{matrix}$ (2)

where $ξ^{m}$ lies in between $W^{m}$ and $W^{m + 1}$ .

From (2) we have

$\begin{matrix} \sum_{t = 1}^{T} g_{t} (W^{m + 1}) = \sum_{t = 1}^{T} g_{t} (W^{m}) + \sum_{t = 1}^{T} {(\nabla g_{t} (W^{m}))}^{T} (W^{m + 1} - W^{m}) \\ + \sum_{t = 1}^{T} \frac{1}{2} {(W^{m + 1} - W^{m})}^{T} \nabla^{2} g_{t} (ξ^{m}) (W^{m + 1} - W^{m}) \end{matrix}$

The above equation is equivalent to

$E (W^{m + 1}) = E (W^{m}) + δ_{1} + δ_{2}$ (3)

where

$δ_{1} = \sum_{t = 1}^{T} {(\nabla g_{t} (W^{m}))}^{T} (W^{m + 1} - W^{m})$ ,

$δ_{2} = \sum_{t = 1}^{T} \frac{1}{2} {(W^{m + 1} - W^{m})}^{T} \nabla^{2} g_{t} (ξ^{m}) (W^{m + 1} - W^{m})$ .

It is easy to see that

$\begin{matrix} δ_{1} = {(\nabla E (W^{m}))}^{T} Δ W^{m + 1} \\ = {(\nabla E (W^{m}))}^{T} (- η \nabla E (W^{m}) + τ^{m} Δ W^{m}) \\ = - η {(\nabla E (W^{m}))}^{T} \nabla E (W^{m}) + τ^{m} {(\nabla E (W^{m}))}^{T} Δ W^{m} \\ \leq - η {‖ \nabla E (W^{m}) ‖}^{2} + μ ‖ {(\nabla E (W^{m}))}^{T} ‖ \frac{‖ \nabla E (W^{m}) ‖}{‖ Δ W^{m} ‖} ‖ Δ W^{m} ‖ \\ = (- η + μ) {‖ \nabla E (W^{m}) ‖}^{2} \end{matrix}$

$\begin{matrix} δ_{2} = \frac{1}{2} {(Δ W^{m + 1})}^{T} \sum_{t = 1}^{T} \nabla^{2} g_{t} (ξ^{m}) Δ W^{m + 1} \\ = \frac{1}{2} {(Δ W^{m + 1})}^{T} \nabla^{2} E (ξ^{m}) Δ W^{m + 1} \\ \leq \frac{1}{2} | {(Δ W^{m + 1})}^{T} \nabla^{2} E (ξ^{m}) Δ W^{m + 1} | \\ \leq \frac{1}{2} ‖ {(Δ W^{m + 1})}^{T} ‖ ‖ \nabla^{2} E (ξ^{m}) ‖ ‖ Δ W^{m + 1} ‖ \\ \leq \frac{1}{2} M {‖ Δ W^{m + 1} ‖}^{2} \end{matrix}$

$\begin{array}{l} = \frac{1}{2} M {‖ - η \nabla E (W^{m}) + τ^{m} Δ W^{m} ‖}^{2} \\ \leq \frac{1}{2} M {(‖ - η \nabla E (W^{m}) ‖ + ‖ τ^{m} Δ W^{m} ‖)}^{2} \\ = \frac{1}{2} M ({‖ - η \nabla E (W^{m}) ‖}^{2} + {‖ τ^{m} Δ W^{m} ‖}^{2} + 2 ‖ - η \nabla E (W^{m}) ‖ ‖ τ^{m} Δ W^{m} ‖) \\ \leq \frac{1}{2} M (η^{2} {‖ \nabla E (W^{m}) ‖}^{2} + μ^{2} {‖ \nabla E (W^{m}) ‖}^{2} + 2 η μ {‖ \nabla E (W^{m}) ‖}^{2}) \\ = \frac{1}{2} M {(η + μ)}^{2} {‖ \nabla E (W^{m}) ‖}^{2} \end{array}$

Together with (3), we have

$\begin{matrix} E (W^{m + 1}) = E (W^{m}) + δ_{1} + δ_{2} \\ \leq E (W^{m}) - (η - μ - \frac{1}{2} M {(η + μ)}^{2}) {‖ \nabla E (W^{m}) ‖}^{2} \end{matrix}$

Set $β = η - μ - \frac{1}{2} M {(η + μ)}^{2}$ . Then

$E (W^{m + 1}) \leq E (W^{m}) - β {‖ \nabla E (W^{m}) ‖}^{2}$ . (4)

It is easy to see that $β > 0$ when

${\begin{cases} η \in (0, \frac{2}{M}) \\ μ \in (0, \frac{- 1 - M η + \sqrt{1 + 4 M η}}{M}) \end{cases}$ . (5)

If η and μ satisfy (5), then the sequence ${E (W^{m})}$ is monotonically decreasing. Since $E (W^{m})$ is nonnegative, it must converge to some $E^{*} \geq 0$ , that is

$\lim_{m \to \infty} E (W^{m}) = E^{*}$ .

By (4) it is easy to see for any positive integer N, it holds

$β \sum_{m = 0}^{N - 1} {‖ \nabla E (W^{m}) ‖}^{2} \leq E (W^{0}) - E (W^{N})$ .

Let $N \to \infty$ , then we have $\sum_{m = 0}^{\infty} {‖ \nabla E (W^{m}) ‖}^{2} \leq \infty$ , so $\lim_{m \to \infty} ‖ \nabla E (W^{m}) ‖ = 0$ , which finishes the proof for the weak convergence.

By (1), we have

$‖ W^{m + 1} - W^{m} ‖ \leq η ‖ \nabla E (W^{m}) ‖ + τ^{m} ‖ Δ W^{m} ‖ \leq (η + μ) ‖ \nabla E (W^{m}) ‖$ ,

which indicates

$\lim_{m \to \infty} ‖ W^{m + 1} - W^{m} ‖ = 0$ .

From Lemma 3.1, it holds

$\lim_{m \to \infty} W^{m} = W^{*}, \nabla E (W^{*}) = 0$ ,

which finishes the proof for the strong convergence.

4. Numerical Results

In this section, we propose an example to illustrate the convergence behavior of the iteration (1) by comparing the iteration steps (IT), elapsed CPU time in seconds (CPU) and relative residual error (RES). The experiment is terminated when the current iteration satisfies $RES \leq 10^{- 8}$ or the number of the max iteration steps k = 1000 are exceeded. The computations are implemented in MATLAB on a PC computer with Intel (R) Core (R) CPU 1000 M @ 1.80 GHz, and 2.00 GB memory.

Example 4.1 ( [16] ) Four-dimensional parity problem (Table 1)

Table 1. The data samples.

Table 2. Optimal parameters, CPU times, iteration numbers, and residuals.

In this simulation experiment, the initial weights $W^{0}$ is a zero vector of 24 dimensional and $W^{1}$ is a 24 dimensional vector whose elements are all 1. The learning rate $η = 0.00001$ and momentum factor $μ = 0.00005$ . The number of training samples is $T = 16$ . In the above Table 2, we compare the convergence behavior of the gradient method with momentum and the gradient method with no momentum. It can be seen that the network training is improved significantly after added the momentum item.

5. Conclusion

In this paper, we study the gradient method with momentum for training sigma-pi-sigma neural networks. We take the momentum coefficient in an adaptive manner, and the corresponding weak convergence and strong convergence results are proved. The Assumptions A1 and A2 in this paper seem to be a little severe, so how to weaken the one or two assumptions will be our future work.

Support Information

This author is supported by National Natural Science Foundation of China under grant No. 61572018 and Zhejiang Provincial Natural Science Foundation of China under Grant No. LY15A010016.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1]	Shin, Y. and Ghosh, J. (1991) The Pi-Sigma Network: An Efficient Higher-Order Neural Network for Pattern Classification and Function Approximation. International Joint Conference on Neural Networks, 1, 13-18. https://doi.org/10.1109/IJCNN.1991.155142
[2]	Li, C.K. (2003) A Sigma-Pi-Sigma Neural Network (SPSNN). Neural Processing Letters, 17, 1-19. https://doi.org/10.1023/A:1022967523886
[3]	Rumelhart, D.E., Hinton, G.E. and Williams, R.J. (1986) Learning Representations by Back-Propagating Errors. Nature, 323, 533-536. https://doi.org/10.1038/323533a0
[4]	Phansalkar, V.V. and Sastry, P.S. (1994) Analysis of the Back-Propagation Algorithm with Momentum. IEEE Transactions on Neural Networks, 5, 505-506. https://doi.org/10.1109/72.286925
[5]	Torii, M. and Hagan, M.T. (2002) Stability of Steepest Descent with Momentum for Quadratic Functions. IEEE Transactions on Neural Networks, 13, 752-756. https://doi.org/10.1109/TNN.2002.1000143
[6]	Bhaya, A. and Kaszkurewicz, E. (2004) Steepest Descent with Momentum for Quadratic Functions Is a Version of the Conjugate Gradient Method. Neural Networks, 17, 65-71. https://doi.org/10.1016/S0893-6080(03)00170-9
[7]	Qian, N. (1999) On the Momentum Term in Gradient Descent Learning Algorithms. Neural Networks, 12, 145-151. https://doi.org/10.1016/S0893-6080(98)00116-6
[8]	Shao, H. and Zheng, G. (2011) Convergence Analysis of a Back-Propagation Algorithm with Adaptive Momentum. Neurocomputing, 74, 749-752. https://doi.org/10.1016/j.neucom.2010.10.008
[9]	Shao, H. and Zheng, G. (2011) Boundedness and Convergence of Online Gradient Method with Penalty and Momentum. Neurocomputing, 74, 765-770. https://doi.org/10.1016/j.neucom.2010.10.005
[10]	Shao, H., Xu, D., Zheng, G. and Liu, L. (2012) Convergence of an Online Gradient Method with Inner-Product and Adaptive Momentum. Neurocomputing, 747, 243-252. https://doi.org/10.1016/j.neucom.2011.09.003
[11]	Xu, D., Shao, H. and Zhang, H. (2012) A New Adaptive Momentum Algorithm for Split-Complex Recurrent Neural Networks. Neurocomputing, 93, 133-136. https://doi.org/10.1016/j.neucom.2012.03.013
[12]	Zhang, N., Wu, W. and Zheng, G. (2006) Convergence of Gradient Method with Momentum for Two-Layer Feedforward Neural Networks. IEEE Transactions Neural Networks, 17, 522-525. https://doi.org/10.1109/TNN.2005.863460
[13]	Wu, W., Zhang, N., Li, Z., Li, L. and Liu, Y. (2008) Convergence of Gradient Method with Momentum for Back-Propagation Neural Networks. Journal of Computational Mathematics, 4, 613-623.
[14]	Gori, M. and Maggini, M. (1996) Optimal Convergence of On-Line Backpropagation. IEEE Transactions on Neural Networks, 7, 251-254. https://doi.org/10.1109/72.478415
[15]	Yuan, Y. and Sun, W. (1997) Optimization Theory and Methods. Science Press, Beijing.
[16]	Yan, X. and Chao, Z. (2008) Convergence of Asynchronous Batch Gradient Method with Momentum for Pi-Sigma Networks. Mathematica Applicata, 21, 207-212.

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies