A Study on the Convergence of Gradient Method with Momentum for Sigma-Pi-Sigma Neural Networks ()
1. Introduction
Pi-sigma network (PSN) is a kind of high order feedforward neural network which is characterized by the fast convergence rate of the single-layer network, and the unique high order network nonlinear mapping capability [1] . In order to further improve the application capacity of the network, Li introduces more complex network structures based on PSN called sigma-pi-sigma neural network (SPSNN) [2] . SPSNN can be learned to implement static mapping in the similar manner to that of multilayer neural networks and the radial basis function networks.
The gradient method is often used for training neural networks, and the main disadvantages of this method are the slow convergence and the local minimum problem. To speed up and stabilize the training iteration procedure for the gradient method, a momentum term is often added to the increment formula for the weights, in which the present weight updating increment is a combination of the present gradient of the error function and the previous weight updating increment [3] . Many researchers have developed the theory about momentum and extended its applications. For the back-propagation algorithm, Phansalkar and Sastry give a stability analysis with adding the momentum term [4] . Torii and Bhaya discuss the convergence of the gradient method with momentum under the restriction that the error function is quadratic [5] [6] . Shao et al. study the adaptive momentum for both batch gradient method and online gradient method, and compare the efficiency of momentum with penalty [7] [8] [9] [10] [11] . The key for the convergence analysis for momentum algorithms is the monotonicity of the error function during the learning procedure, which is generally proved under the uniformly boundedness assumption of the activation function and its derivatives. In [8] [10] [12] [13] , for the gradient method with momentum, some convergence results are given for both two-layer and multi-layer feedforward neural networks. In this paper, we will consider the gradient method with momentum for sigma-pi-sigma neural networks and discuss its convergence.
The rest of the paper is organized as follows. In Section 2 we introduce the neural network model of SPSNN and the gradient method with momentum. In Section 3 we give the convergence analysis of the gradient method with momentum for training SPSNN. Numerical experiments are given in Section 4. Finally, in Section 5, we end the paper with some conclusions.
2. The Neural Network Model of SPsnn and Gradient Method with Momentum
In this section we introduce the sigma-pi-sigma neural network that is composed of multilayer neural network. The output of SPSNN has the form
, where
is an input,
is the number of inputs,
is a function to be generated through the network training, and K is the number of pi-sigma network(PSN) that is the basic building block for SPSNN. The expression of the function
is
, where the function
is either 0 or 1, and
is weight values stored in memory.
and
are information numbers stored in
. For a K-th order SPSNN, the total weight value will be
.
For a set of training examples
, where
is the ideal output,
, we have the following actual output:
,
where
denotes the jth element of a given input vector
.
In order to train the SPSNN, we choose a quadratic error function
:
where
. For convenience we denote
.
The gradient method with momentum is used to train weights. The gradients of
and
are denoted by
,
,
and the Hessian matrices of
and
at
are denoted by
,
.
Given any arbitrarily initial weight vectors
,
, the gradient method with momentum updates the weight vector W by
, (1)
where
is the learning rate,
is called the momentum term,
is the momentum coefficient.
Similar to [12] [14] , in this paper, we choose
as follows:
where m is a positive number and
, and
is 2-norm in this paper.
Notice the component form of (1) is
.
In fact,
,
where
. Recalling
, then
3. Convergence Results
Similar to [12] [14] , we need the following assumptions.
(A1): The elements of the Hessian matrix
be uniformly bounded for any
.
(A2): The number of the elements of
be finite.
From (A1), it is easy to see that there exists a constant
such that
.
Lemma 3.1 ( [15] ) Let
be continuously differentiable, the number of the elements of the set
be finite, and the sequence
satisfy
,
.
Then there exists a
such than
.
Theorem 3.2 If Assumption (A1) is satisfied. Then there exists
such that for
and
, it holds the following weak convergence result for the iteration (1):
,
,
.
Furthermore, if Assumption (A2) is also valid, then it holds the strong convergence result, that is there exists
such that
.
Proof.
Using Taylor’s formula, we expand
at
:
(2)
where
lies in between
and
.
From (2) we have
The above equation is equivalent to
(3)
where
,
.
It is easy to see that
Together with (3), we have
Set
. Then
. (4)
It is easy to see that
when
. (5)
If η and μ satisfy (5), then the sequence
is monotonically decreasing. Since
is nonnegative, it must converge to some
, that is
.
By (4) it is easy to see for any positive integer N, it holds
.
Let
, then we have
, so
, which finishes the proof for the weak convergence.
By (1), we have
,
which indicates
.
From Lemma 3.1, it holds
,
which finishes the proof for the strong convergence.
4. Numerical Results
In this section, we propose an example to illustrate the convergence behavior of the iteration (1) by comparing the iteration steps (IT), elapsed CPU time in seconds (CPU) and relative residual error (RES). The experiment is terminated when the current iteration satisfies
or the number of the max iteration steps k = 1000 are exceeded. The computations are implemented in MATLAB on a PC computer with Intel (R) Core (R) CPU 1000 M @ 1.80 GHz, and 2.00 GB memory.
Example 4.1 ( [16] ) Four-dimensional parity problem (Table 1)
Table 2. Optimal parameters, CPU times, iteration numbers, and residuals.
In this simulation experiment, the initial weights
is a zero vector of 24 dimensional and
is a 24 dimensional vector whose elements are all 1. The learning rate
and momentum factor
. The number of training samples is
. In the above Table 2, we compare the convergence behavior of the gradient method with momentum and the gradient method with no momentum. It can be seen that the network training is improved significantly after added the momentum item.
5. Conclusion
In this paper, we study the gradient method with momentum for training sigma-pi-sigma neural networks. We take the momentum coefficient in an adaptive manner, and the corresponding weak convergence and strong convergence results are proved. The Assumptions A1 and A2 in this paper seem to be a little severe, so how to weaken the one or two assumptions will be our future work.
Support Information
This author is supported by National Natural Science Foundation of China under grant No. 61572018 and Zhejiang Provincial Natural Science Foundation of China under Grant No. LY15A010016.