Least Squares Method from the View Point of Deep Learning II : Generalization

The least squares method is one of the most fundamental methods in Statistics to estimate correlations among various data. On the other hand, Deep Learning is the heart of Artificial Intelligence and it is a learning method based on the least squares method, in which a parameter called learning rate plays an important role. It is in general very hard to determine its value. In this paper we generalize the preceding paper [K. Fujii: Least squares method from the view point of Deep Learning: Advances in Pure Mathematics, 8, 485-493, 2018] and give an admissible value of the learning rate, which is easily obtained.


Introduction
This paper is a sequel to the preceding paper [1].
The least squares method in Statistics plays an important role in almost all disciplines, from Natural Science to Social Science.When we want to find properties, tendencies or correlations hidden in huge and complicated data we usually employ the method.See for example [2].
On the other hand, Deep Learning is the heart of Artificial Intelligence and will become a most important field in Data Science in the near future.As to Deep Learning see for example [3]- [10].
Deep Learning may be stated as a successive learning method based on the least squares method.Therefore, to reconsider it from the view point of Deep Learning is natural and instructive.We carry out the calculation thoroughly of the successive approximation called gradient descent sequence, in which a parameter  called learning rate plays an important role.One of main points is to determine the range of the learning rate, which is a very hard problem [8].We showed in [1] that a difference in methods between Statistics and Deep Learning leads to different results when the learning rate changes.
We generalize the preceding results to the case of the least squares method by polynomial approximation.Our results may give a new insight to both Statistics and Data Science.

Least Squares Method
Let us explain the least squares method by polynomial approximation [9].The model function

( )
f x is a polynomial in x of degree M given by For N pieces of two dimensional real data , , , , , , x y x y x y  we assume that their scatter plot is given like Figure 1.
The coefficients of (1) must be determined by the data set (T denotes the transposition of a vector or a matrix).
For this set of data the error function is given by Figure  ( ) However, in this paper another approach based on quadratic form is given, which is instructive.
Let us calculate the error function (3).By using the definition of inner product y y y w w w = = y w   and ( ) Let us deform (4).From ) Namely, we have a general quadratic form ( ) On the other hand, the deformation of ( 5) is well-known.
Formula For a symmetric and invertible matrix A (: The proof is easy.Since ( )

x b b x b b
and this gives (6).
Therefore, our case becomes then the minimum is given by where N E is the N-dimensional identity matrix.
Our method is simple and clear ("smart" in our terminology).

Least Squares Method from Deep Learning
In this section we reconsider the least squares method in Section 2 from the view point of Deep Learning.
First we arrange the data in Section 2 like Teacher signal : , , , N y y y  and consider a simple neuron model in [11] (see Figure 2).
Here we use the polynomial (1) instead of the sigmoid function In this case the square error function becomes ( ) ( ) We in general use ( ) L w instead of ( ) Our aim is also to determine the parameters w in order to minimize ( ) L w .However, the procedure is different from the least squares method in Section 2. This is an important and interesting point.
The parameters w are determined successively by the gradient descent method (see for example [12]): For 0,1, 2, where w w w y y w y y (12) and ( ) is a small parameter called the learning rate.
The initial value ( ) 0 w is given appropriately.Pay attention that t is discrete time and T is the transposition.
Let us calculate (11) This equation is easily solved to be for 0,1, 2, t =  .The proof is left to readers.Since this is not a final form let us continue the calculation.From (14) we have ( ) ( ) Therefore we can arrange all eigenvalues like where Q is an element in ( ) ) and D is a diagonal matrix
By substituting (17) into ( 14) and using the equation ( ) we finally obtain Theorem I A general solution to ( This is our main result. Next, let us show how to choose the learning rate ( ) Let us remember 16) and (18) the equations ( ) ( ) Theorem II The learning rate  must satisfy an inequality The greater the value of  , the sooner goes the gradient descent (11) so long as the convergence ( 19) is guaranteed.Let us note that the choice of the initial values ( ) is irrelevant when the convergence condition (20) is satisfied.
Comment For example, if we choose  like

How to Estimate the Learning Rate
How do we calculate 1 λ ?Since { } j λ are the eigenvalues of the matrix T Φ Φ , they satisfy the equation This is abstract, so let us deform (21).For simplicity we write Φ as Then it is easy to see , where the notation a b is the (real) inner product of vectors.
For clarity let us write down (21) explicitly.
( ) As far as we know there is no viable method to determine the greatest root of ; for each i.This is a closed disc centered at ii a with radius i R called the Gerschgorin's disc.
The proof is simple.See for example [7].
Our case is real and , where [ ] A B is a closed interval and ( ) ( ) ( ) ( ) and .
then it is easy to see Thus we arrive at an admissible value of the learning rate  which is easily obtained.
Theorem III An admissible value of  is Example In this case it is easy to see and we set for simplicity.Moreover, we may assume 0 x > .Then from (21) we have

Concluding Remarks
In this paper we have discussed the least squares method by polynomial approximation from the view point of Deep Learning and carried out calculation of the gradient descent thoroughly.A difference in methods between Statistics and Deep Learning delivers different results when the learning rate  is changed.Theorem III is the first result to provide an admissible value of  as far as we know.
Deep Learning plays an essential role in Data Science and maybe in almost all fields of Science.Therefore it is desirable for undergraduates to master it in the early stages.To master it they must study Calculus, Linear Algebra and Statistics from Mathematics.My textbook [7] is recommended.
symmetric and invertible by the assumption.If we choose ( )

where 1 MO
+ is the N-dimensional zero matrix.(15) is just the equation (8) and it is independent of  .Let us evaluate(14) further.The matrix T Φ Φ is positive definite, so all eigenvalues are positive.This can be shown as follows.Let us consider the eigenvalue equation

(
then we cannot recover (15), which shows a difference in methods between Statistics and Deep Learning.
an example in the case of 1 M = ([1]), which is very instructive for non-experts.
F λ = if M is very large1.Therefore, let us get satisfied by obtaining an approximate value which is both greater than 1 λ and easy to calculate.For the purpose the Gerschgorin's theorem is very useful 2 .Let To check this inequality is left to readers.Therefore, from (28) the admissible