Least Squares Method from the View Point of Deep Learning

The least squares method is one of the most fundamental methods in Statistics to estimate correlations among various data. On the other hand, Deep Learning is the heart of Artificial Intelligence and it is a learning method based on the least squares. In this paper we reconsider the least squares method from the view point of Deep Learning and we carry out the computation thoroughly for the gradient descent sequence in a very simple setting. Depending on the values of the learning rate, an essential parameter of Deep Learning, the least squares methods of Statistics and Deep Learning reveal an interesting difference.


Introduction
The least squares method in Statistics plays an important role in almost all disciplines, from Natural Science to Social Science. When we want to find properties, tendencies or correlations hidden in huge and complicated data we usually employ the method. See for example [1].
On the other hand, Deep Learning is the heart of Artificial Intelligence and will become a most important field in Data Science in the near future. As to Deep Learning see for example [2] [3] [4] [5] [6].
Deep Learning may be stated as a successive learning method based on the least squares method. Therefore, to reconsider it from the view point of Deep Learning is very natural and we carry out the calculation thoroughly of the successive approximation called gradient descent sequence. x y x y x y  we assume that their scatter plot is like Figure 1 Then a model function is linear For this function the error (or loss) function is defined by The aim of least squares method is to minimize the error function (2) with x a x ab nb x y a y b y Then the equations for the stationality give a linear equation for a and b and its solution is given by To check that a and b give the minimum of (2) is a good exercise.
Note We have an inequality and the equal sign holds if and only if 1 2 .

Least Squares Method from Deep Learning
In this section we reconsider the least squares method in Section 1 from the view point of Deep Learning.
First we arrange the data in Section 1 like Input data : Teacher signal : , , , n y y y  and consider a simple neuron model in [7] (see Figure 2) Our aim is also to determine the parameters { } , a b in order to minimize ( ) , L a b . However, the procedure is different from the least squares method in Section 1. This is an important and interesting point.
For later use let us perform a little calculation We determine the parameters { } , a b successively by the gradient descent method, see for example [8]. For 0,1, , , is given appropriately. As will be shown shortly in Theorem I, their explicit values are not important.
Comment The parameter  is called the learning rate and it is very hard to choose  properly as emphasized in [7]. In this paper we provide an estimation (see Theorem II).
Let us write down (10) explicitly: For simplicity by setting ( ) ( ) ( ) where E is a unit matrix. Due to (8) The solution is easy and given by Note Let us consider a simple difference equation where O is a zero matrix. (14) is just the equation (5).
Let us evaluate (13) further. For the purpose we make some preparations from Linear Algebra [9]. For simplicity we set  We set the two eigenvectors of matrix A, corresponding to λ + and λ − , in a matrix form from (16) and we also set Then it is easy to see Namely, Q is an orthogonal matrix. Then the diagonalization of A becomes By substituting (19) into (13) and using we finally obtain Theorem I A general solution to (12) is This is our main result.
Lastly, let us show how to choose the learning rate ( )

Problem
In this section we present the outline of a simple generalization of the results in Section 2. The actual calculation is left as a problem (exercise) to readers.
For n pieces of three dimensional real data     Teacher signal : , , , n z z z  and consider another simple neuron model (see Figure 4) Then we present Problem Carry out the corresponding calculation as given in Section 2.

Concluding Remarks
In this paper we discussed the least squares method from the view point of Deep Learning and carried out calculation of the gradient descent thoroughly. A difference in methods between Statistics and Deep Learning delivers different results when the learning rate  is changed. The result of Theorem II is the first one as far as we know.
Deep Learning plays an essential role in Data Science and maybe in almost all fields of Science. Therefore it is desirable for undergraduates to master it as soon as possible. To master it they must study Calculus, Linear Algebra and Statistics I am planning to write a comprehensive textbook in the near future [10].