_{1}

^{*}

The least squares method is one of the most fundamental methods in Statistics to estimate correlations among various data. On the other hand, Deep Learning is the heart of Artificial Intelligence and it is a learning method based on the least squares. In this paper we reconsider the least squares method from the view point of Deep Learning and we carry out the computation thoroughly for the gradient descent sequence in a very simple setting. Depending on the values of the learning rate, an essential parameter of Deep Learning, the least squares methods of Statistics and Deep Learning reveal an interesting difference.

The least squares method in Statistics plays an important role in almost all disciplines, from Natural Science to Social Science. When we want to find properties, tendencies or correlations hidden in huge and complicated data we usually employ the method. See for example [

On the other hand, Deep Learning is the heart of Artificial Intelligence and will become a most important field in Data Science in the near future. As to Deep Learning see for example [

Deep Learning may be stated as a successive learning method based on the least squares method. Therefore, to reconsider it from the view point of Deep Learning is very natural and we carry out the calculation thoroughly of the successive approximation called gradient descent sequence.

When the learning rate changes a difference in method between Statistics and Deep Learning gives different results.

Theorem I and II in the text are our main results and a related problem (exercise) is presented for readers. Our results may give a new insight to both Statistics and Data Science.

First of all let us explain the least squares method for readers in a very simple setting. For n pieces of two dimensional real data

{ ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x n , y n ) }

we assume that their scatter plot is like

Then a model function is linear

f ( x ) = a x + b . (1)

For this function the error (or loss) function is defined by

E ( a , b ) = 1 2 ∑ i = 1 n { y i − ( a x i + b ) } 2 . (2)

The aim of least squares method is to minimize the error function (2) with respect to { a , b } . A little calculation gives

E ( a , b ) = 1 2 { ∑ i = 1 n x i 2 a 2 + 2 ∑ i = 1 n x i a b + n b 2 − 2 ∑ i = 1 n x i y i a − 2 ∑ i = 1 n y i b + ∑ i = 1 n y i 2 } .

Then the equations for the stationality

∂ E ∂ a = ∂ E ∂ b = 0 (3)

give a linear equation for a and b

( ∑ i = 1 n x i 2 ∑ i = 1 n x i ∑ i = 1 n x i n ) ( a b ) = ( ∑ i = 1 n x i y i ∑ i = 1 n y i ) (4)

and its solution is given by

( a b ) = ( ∑ i = 1 n x i 2 ∑ i = 1 n x i ∑ i = 1 n x i n ) − 1 ( ∑ i = 1 n x i y i ∑ i = 1 n y i ) . (5)

Explicitly, we have

a = n ( ∑ i = 1 n x i y i ) − ( ∑ i = 1 n x i ) ( ∑ i = 1 n y i ) n ( ∑ i = 1 n x i 2 ) − ( ∑ i = 1 n x i ) 2 , b = − ( ∑ i = 1 n x i ) ( ∑ i = 1 n x i y i ) + ( ∑ i = 1 n x i 2 ) ( ∑ i = 1 n y i ) n ( ∑ i = 1 n x i 2 ) − ( ∑ i = 1 n x i ) 2 . (6)

To check that a and b give the minimum of (2) is a good exercise.

Note We have an inequality

n ( x 1 2 + x 2 2 + ⋯ + x n 2 ) ≥ ( x 1 + x 2 + ⋯ + x n ) 2 (7)

and the equal sign holds if and only if

x 1 = x 2 = ⋯ = x n .

Since { x 1 , x 2 , ⋯ , x n } are data we may assume that x i ≠ x j for some i ≠ j . Therefore

n ( x 1 2 + x 2 2 + ⋯ + x n 2 ) − ( x 1 + x 2 + ⋯ + x n ) 2 > 0

gives

| ∑ i = 1 n x i 2 ∑ i = 1 n x i ∑ i = 1 n x i n | > 0. (8)

In this section we reconsider the least squares method in Section 1 from the view point of Deep Learning.

First we arrange the data in Section 1 like

Input data : { ( x 1 ,1 ) , ( x 2 ,1 ) , ⋯ , ( x n ,1 ) }

Teacher signal : { y 1 , y 2 , ⋯ , y n }

and consider a simple neuron model in [

Here we use the linear function (1) instead of the sigmoid function z = σ ( x ) .

In this case the square error function becomes

L ( a , b ) = 1 2 ∑ i = 1 n ( y i − f ( x i ) ) 2 = 1 2 ∑ i = 1 n { y i − ( a x i + b ) } 2 .

We usally use L ( a , b ) instead of E ( a , b ) in (2).

Our aim is also to determine the parameters { a , b } in order to minimize L ( a , b ) . However, the procedure is different from the least squares method in Section 1. This is an important and interesting point.

For later use let us perform a little calculation

∂ L ∂ a = − ∑ i = 1 n ( y i − f ( x i ) ) ∂ f ∂ a = − ∑ i = 1 n ( y i − f ( x i ) ) x i , ∂ L ∂ b = − ∑ i = 1 n ( y i − f ( x i ) ) ∂ f ∂ b = − ∑ i = 1 n ( y i − f ( x i ) ) . (9)

We determine the parameters { a , b } successively by the gradient descent method, see for example [

( a ( 0 ) b ( 0 ) ) → ⋯ → ( a ( t ) b ( t ) ) → ( a ( t + 1 ) b ( t + 1 ) ) → ⋯

and

a ( t + 1 ) = a ( t ) − ϵ ∂ L ∂ a ( t ) , b ( t + 1 ) = b ( t ) − ϵ ∂ L ∂ b ( t ) , (10)

where

L = L ( a ( t ) , b ( t ) ) = 1 2 ∑ i = 1 n { y i − ( a ( t ) x i + b ( t ) ) } 2

and ϵ ( 0 < ϵ < 1 ) is small enough. The initial value ( a ( 0 ) , b ( 0 ) ) T is given appropriately. As will be shown shortly in Theorem I, their explicit values are not important.

Comment The parameter ò is called the learning rate and it is very hard to choose ò properly as emphasized in [

Let us write down (10) explicitly:

a ( t + 1 ) = a ( t ) + ϵ ∑ i = 1 n { y i − ( a ( t ) x i + b ( t ) ) } x i = ( 1 − ϵ ∑ i = 1 n x i 2 ) a ( t ) − ϵ ( ∑ i = 1 n x i ) b ( t ) + ϵ ∑ i = 1 n x i y i

and

b ( t + 1 ) = b ( t ) + ϵ ∑ i = 1 n { y i − ( a ( t ) x i + b ( t ) ) } = − ϵ ( ∑ i = 1 n x i ) a ( t ) + ( 1 − n ϵ ) b ( t ) + ϵ ∑ i = 1 n y i .

These are cast in a vector-matrix form

( a ( t + 1 ) b ( t + 1 ) ) = ( ( 1 − ϵ ∑ i = 1 n x i 2 ) a ( t ) − ϵ ( ∑ i = 1 n x i ) b ( t ) + ϵ ∑ i = 1 n x i y i − ϵ ( ∑ i = 1 n x i ) a ( t ) + ( 1 − n ϵ ) b ( t ) + ϵ ∑ i = 1 n y i ) = ( 1 − ϵ ∑ i = 1 n x i 2 − ϵ ∑ i = 1 n x i − ϵ ∑ i = 1 n x i 1 − n ϵ ) ( a ( t ) b ( t ) ) + ( ϵ ∑ i = 1 n x i y i ϵ ∑ i = 1 n y i ) = { ( 1 0 0 1 ) − ϵ ( ∑ i = 1 n x i 2 ∑ i = 1 n x i ∑ i = 1 n x i n ) } ( a ( t ) b ( t ) ) + ϵ ( ∑ i = 1 n x i y i ∑ i = 1 n y i ) . (11)

For simplicity by setting

x ( t ) = ( a ( t ) b ( t ) ) , A = ( ∑ i = 1 n x i 2 ∑ i = 1 n x i ∑ i = 1 n x i n ) , f = ( ∑ i = 1 n x i y i ∑ i = 1 n y i )

we have a simple equation

x ( t + 1 ) = ( E − ϵ A ) x ( t ) + ϵ f (12)

where E is a unit matrix. Due to (8) the matrix A is invertible ( ∃ A − 1 ), that is, we exclude the trivial and uninteresting case x 1 = x 2 = ⋯ = x n .

The solution is easy and given by

x ( t ) = ( E − ϵ A ) t x ( 0 ) + { E − ( E − ϵ A ) t } A − 1 f . (13)

Note Let us consider a simple difference equation

x ( n + 1 ) = a x ( n ) + b ( a ≠ 1 )

for n = 0 , 1 , 2 , ⋯ . Then, the solution is given by

x ( n ) = a n x ( 0 ) + a n − 1 a − 1 b = a n x ( 0 ) + 1 − a n 1 − a b .

Check this.

Comment The solution (13) gives

l i m t → ∞ x ( t ) = A − 1 f (14)

if

l i m t → ∞ ( E − ϵ A ) t = O (15)

where O is a zero matrix. (14) is just the equation (5).

Let us evaluate (13) further. For the purpose we make some preparations from Linear Algebra [

A = ( ∑ i = 1 n x i 2 ∑ i = 1 n x i ∑ i = 1 n x i n ) ≡ ( α β β δ ) , f = ( ∑ i = 1 n x i y i ∑ i = 1 n y i ) ≡ (ef)

and want to diagonalize A.

The characteristic polynomial of A is

f ( λ ) = | λ E − A | = | λ − α − β − β λ − δ | = ( λ − α ) ( λ − δ ) − β 2 = λ 2 − ( α + δ ) λ + α δ − β 2

and the solutions are given by

λ ± = α + δ ± ( α + δ ) 2 − 4 ( α δ − β 2 ) 2 = α + δ ± ( α − δ ) 2 + 4 β 2 2 . (16)

It is easy to see

λ + > λ − > 0 ,

λ + > 1 ( δ = n and n ≥ 2 ) . (17)

We set the two eigenvectors of matrix A, corresponding to λ + and λ − , in a matrix form

Q ˜ = ( λ + − δ β β λ − − α ) .

It is easy to see

( λ + − δ ) 2 + β 2 = ( λ − − α ) 2 + β 2 ≡ Λ 2

from (16) and we also set

Q = 1 Λ ( λ + − δ β β λ − − α ) . (18)

Then it is easy to see

Q T Q = Q Q T = E ⇒ Q T = Q − 1 .

Namely, Q is an orthogonal matrix. Then the diagonalization of A becomes

A = Q ( λ + 0 0 λ − ) Q T . (19)

By substituting (19) into (13) and using

E − ϵ A = Q ( 1 − ϵ λ + 0 0 1 − ϵ λ − ) Q T

we finally obtain

Theorem I A general solution to (12) is

( a ( t ) b ( t ) ) = Q ( ( 1 − ϵ λ + ) t 0 0 ( 1 − ϵ λ − ) t ) Q T ( a ( 0 ) b ( 0 ) ) + Q ( 1 − ( 1 − ϵ λ + ) t λ + 0 0 1 − ( 1 − ϵ λ − ) t λ − ) Q T ( e f ) . (20)

This is our main result.

Lastly, let us show how to choose the learning rate ϵ ( 0 < ϵ < 1 ) , which is a very important problem in Deep Learning. Let us remember

λ + > λ − > 0 and λ + > 1

from (17). From (15) the equations

lim t → ∞ ( E − ϵ A ) t = O ⇔ lim t → ∞ ( 1 − ϵ λ + ) t = lim t → ∞ ( 1 − ϵ λ − ) t = 0

determine the range of ò. Noting

| 1 − x | < 1 ( ⇔ 0 < x < 2 ) ⇒ lim n → ∞ ( 1 − x ) n = 0

we obtain

Theorem II The learning rate ò must satisfy an inequality

0 < ϵ λ + < 2 ⇔ 0 < ϵ < 2 λ + . (21)

From (21) ò becomes very small when λ + is large enough. It is easy to see that the second condition l i m t → ∞ ( 1 − ϵ λ − ) t = 0 is automatically satisfied.

Under Theorem II we can recover (14)

lim t → ∞ ( a ( t ) b ( t ) ) = O ( 1 λ + 0 0 1 λ − ) O T ( e f ) = A − 1 f

by (19).

Comment For example, if we choose ò like

2 λ + < ϵ < 1

then we cannot recover (14), which shows a difference between Statistics and Deep Learning. Let us emphasize that the choice of the initial values { a ( 0 ) , b ( 0 ) } is irrelevant when the convergence condition (21) is satisfied.

As a result, how to choose ò properly in Deep Learning becomes a very important problem when the number of data is huge. As far as we know the result like Theorem II has not been obtained.

In this section we present the outline of a simple generalization of the results in Section 2. The actual calculation is left as a problem (exercise) to readers.

For n pieces of three dimensional real data

{ ( x 1 , y 1 , z 1 ) , ( x 2 , y 2 , z 2 ) , ⋯ , ( x n , y n , z n ) }

we assume that its scatter plot is like

Then a model function is linear

f ( x , y ) = a x + b y + c (22)

and the error (or loss) function is defined by

E ( a , b , c ) = 1 2 ∑ i = 1 n ( z i − f ( x i , y i ) ) 2 = 1 2 ∑ i = 1 n { z i − ( a x i + b y i + c ) } 2 . (23)

The aim of least squares method is to minimize the error function (22) with respect to { a , b , c } .

As we want to treat the least squares method above from the view point of Deep Learning, we again arrange the data like

Input data : { ( x 1 , y 1 ,1 ) , ( x 2 , y 2 ,1 ) , ⋯ , ( x n , y n ,1 ) }

Teacher signal : { z 1 , z 2 , ⋯ , z n }

and consider another simple neuron model (see

Then we present

Problem Carry out the corresponding calculation as given in Section 2.

In this paper we discussed the least squares method from the view point of Deep Learning and carried out calculation of the gradient descent thoroughly. A difference in methods between Statistics and Deep Learning delivers different results when the learning rate ò is changed. The result of Theorem II is the first one as far as we know.

Deep Learning plays an essential role in Data Science and maybe in almost all fields of Science. Therefore it is desirable for undergraduates to master it as soon as possible. To master it they must study Calculus, Linear Algebra and Statistics from Mathematics. However we don’t know a good and compact textbook leading to Deep Learning.

I am planning to write a comprehensive textbook in the near future [

We wish to thank Ryu Sasaki for useful suggestions and comments.

Fujii, K. (2018) Least Squares Method from the View Point of Deep Learning. Advances in Pure Mathematics, 8, 485-493. https://doi.org/10.4236/apm.2018.85027