_{1}

In this paper, we introduce the concepts of square neurons, power neu-rons, and new learning algorithms based on square neurons, and power neurons. First, we briefly review the basic idea of the Boltzmann Machine, specifically that the invariant distributions of the Boltzmann Machine generate Markov chains. We further review ABM (Attrasoft Boltzmann Machine). Next, we review the <i>θ</i>-transformation and its completeness, <i> i.e. </i>any function can be expanded by <i>θ </i>-transformation. The invariant distribution of the ABM is a <i>θ </i>-transformation; therefore, an ABM can simulate any distribution. We review the linear neurons and the associated learning algorithm. We then discuss the problems of the exponential neurons used in ABM, which are unstable, and the problems of the linear neurons, which do not discriminate the wrong answers from the right answers as sharply as the exponential neurons. Finally, we introduce the concept of square neurons and power neurons. We also discuss the advantages of the learning algorithms based on square neurons and power neurons, which have the stability of the linear neurons and the sharp discrimination of the exponential neurons.

Neural networks and deep learning currently provide the best solutions to many supervised learning problems. In 2006, a publication by Hinton, Osindero, and Teh [

After Hinton’s initial attempt of training one layer at a time, Deep Neural Networks train all layers together. Examples include TensorFlow [

In addition to neural network algorithms, there are numerous learning algorithms. We select a few such algorithms below.

Principal Component Analysis [

Sparse coding [

L s c = ‖ W H − X ‖ 2 2 + λ ‖ H ‖ 1

where, W is a matrix of transformation, H is a matrix of inputs, and X is a matrix of the outputs. λ implements a trade of between sparsity and reconstruction.

Auto encoders [

L a e = ‖ W σ ( W T X ) − X ‖ 2 2

where σ is some neural network functions. Note that L_{sc} looks almost like L_{ae} once we set H = σ ( W T X ) . The difference is that: 1) auto encoders do not encourage sparsity in their general form; 2) an auto encoder uses a model for finding the codes, while sparse coding does so by means of optimization.

K-means clustering [

If we limit the learning architecture to one layer, all of these algorithms have some advantages for some applications. The deep learning architectures currently provide the best solutions to many supervised learning problems, because two layers, when “properly” constructed, are better than one layer. One question is the existence of a solution for a given problem. This will often be followed by an effective solution development, i.e. an algorithm for a solution. This will often be followed by the stability of the algorithm. This will often be followed by an efficiency study of solutions. Although these theoretical approaches are not necessary for the empirical development of practical algorithms, the theoretical studies do advance the understanding of the problems. The theoretical studies will prompt new and better algorithm development of practical problems. Along the direction of solution existence, Hornik, Stinchcombe, & White [

Hornik, Stinchcombe, & White [

Deep Belief Networks (DBN) are generative neural network models with many layers of hidden explanatory factors, recently introduced by Hinton, Osindero, and Teh, along with a greedy layer-wise unsupervised learning algorithm. The building block of a DBN is a probabilistic model called a Restricted Boltzmann machine (RBM), used to represent one layer of the model. Restricted Boltzmann machines are interesting because inference is easy in them and because they have been successfully used as building blocks for training deeper models. Roux & Bengio [

An alternative to the direction of “deep layers”, higher order is another direction. In our earlier paper [

Once we accept that the deep learning architectures currently provide the best solutions, the next question is what is in each layer; in this paper, we intend to fill these layers with the square and power neurons.

In [

In Section 2, we briefly review how to use probability distributions in a Supervised Learning Problem. In this approach, given an input A, an output B, and a mapping from A to B, one can convert this problem to a probability distribution [

In Section 3, the Boltzmann machines [

In Section 4, we review the ABM (Attrasoft Boltzmann Machine) [^{th} order; and 2) all of the connections up to n^{th} order are determined by the ABM algorithm [

In Section 5, we review θ-transformation [

In Section 6, we review the completeness of the θ-transformation [

In Section 7, we discuss how the invariant distribution of an ABM implements a θ-transformation [

In Section 8, we discuss the stability problem of the exponential neurons.

In Section 9, we review linear neurons [

In Section 10, we review the linear neuron learning algorithms.

In Section 11, we will take the middle ground between the exponential neurons and the linear neurons, which has the advantages of both algorithms and avoids the disadvantages of the both algorithms. The new contribution of this paper is that we introduce the concept of square neurons and power neurons.

In Section 12, we also discuss the advantages of the two new learning algorithms based on square neurons and power neurons, which has the stability of the linear neurons and the sharp discrimination of the exponential neurons.

In Section 13, we introduce a simple example to demonstrate the improvement of the square neurons and power neurons over linear neurons.

The basic supervised learning [

p = p ( a , b ) , a ∈ A , b ∈ B .

If a_{1} does not match with b_{1}, the probability is 0 or close to 0. If a_{1} matches with b_{1}, the probability is higher than 0. This can reduce the problem of inferencing of a mapping from A to B to inferencing a distribution function.

An irreducible finite Markov chain possesses a stationary distribution [

A Boltzmann machine [

x = ( x 1 , x 2 , ⋯ , x n )

of a Boltzmann machine is called a configuration. The configuration transition is mathematically described by a Markov chain with 2^{n} configurations x ∈ X , where X is the set of all points, ( x 1 , x 2 , ⋯ , x n ) . When all of the configurations are connected, it forms a Markov chain. A Markov chain has an invariant distribution [

The Boltzmann machine [

More formally, let F be the set of all functions. Let T be the parameter space of a family of Boltzmann machines. Given an unknown f ∈ F , one can find a Boltzmann machine such that the equilibrium distribution of this Boltzmann machine realizes, as faithfully as possible, the unknown function [

Let T be the parameter space of a family of Boltzmann machines, and let F_{T} be the set of all functions that can be inferred by the Boltzmann Machines over T; obviously, F_{T} is a subset of F. It turns out that F_{T} is significantly smaller than F and it is not a good approximation for F. The main contribution of the Boltzmann machine is to establish a framework for inferencing a mapping from A to B.

The invariant distribution of a Boltzmann machine [

p ( x ) = b e ∑ i < j M i j x i x j (1)

If the threshold vector does not vanish, the distributions are:

p ( x ) = b e ∑ i < j M i j x i x j − ∑ T i x i (2)

By rearranging the above distribution, we have:

p ( x ) = e c − ∑ T i x i + ∑ i < j M i j x i x j

It turns out that the third order Boltzmann machines have the following type of distributions:

p ( x ) = e c − ∑ T i x i + ∑ i < j M i j x i x j + ∑ i < j < k M i j k x i x j x k (3)

An ABM [^{th} order. All of the connections up to the n^{th} order are determined by the ABM algorithm [

By adding additional terms, the invariant distribution for an ABM is:

p ( x ) = e H ,

H = θ 0 + ∑ θ 1 i 1 x i 1 + ∑ θ 2 i 1 i 2 x i 1 x i 2 + ∑ θ 3 i 1 i 2 i 3 x i 1 x i 2 x i 3 + ⋯ + θ n 12 ⋯ n x 1 x 2 ⋯ x n

ABM is significantly more powerful to simulate an unknown function. As more and more terms are added, from the second order terms to the n^{th} order terms, the invariant distribution space will become larger and larger. Like the Boltzmann Machines in the last section, ABM implements a transformation, F B → T . We hope ultimately that this ABM transformation is complete so that given any function f ∈ F , we can find an ABM, t ∈ T , such that the equilibrium distribution of this ABM realizes precisely the unknown function. We show that this is exactly the case.

We first introduce some notations used in this paper [

p = ( p 0 , p 1 , ⋯ , p N − 1 ) ,

which is the x-representation of p in the x-coordinate systems.

In the x-coordinate system, there are two representations of a vector:

・ {p_{i}} in the x-representation, and

・ { p m i 1 i 2 ⋯ i m } in the θ-representation.

In the θ-coordinate system, there are two representations of a vector:

・ {θ_{i}} in the x-representation, and

・ { θ m i 1 i 2 ⋯ i m } in the θ-representation.

The reason for the two different representations is that the x-representation is natural for the x-coordinate system, and the θ-representation is natural for the θ-coordinate system.

The transformations between {p_{i}} and { p m i 1 i 2 ⋯ i m } , and those between {θ_{i}} and { θ m i 1 i 2 ⋯ i m } , are similar. Because of the similarity, in the following, only the transformation between {p_{i}} and { p m i 1 i 2 ⋯ i m } will be introduced. Let N = 2^{n} be the number of neurons. An N-dimensional vector, p, is:

p = ( p 0 , p 1 , ⋯ , p N − 1 ) (4)

Consider p_{x}, because x ∈ { 0 , 1 , ⋯ , N − 1 = 2 n − 1 } is the position inside a distribution, then x can be rewritten in the binary form:

x = x n 2 n − 1 + ⋯ + x 2 2 1 + x 1 2 0 (5)

Some of the coefficients x_{i} might be zero. In dropping those coefficients which are zero, we write:

x = x i 1 x i 2 ⋯ x i m = 2 i m − 1 + ⋯ + 2 i 2 − 1 + 2 i 1 − 1 . (6)

This generates the following transformation:

p m i 1 i 2 ⋯ i m = p x = p 2 i m − 1 + ⋯ + 2 i 2 − 1 + 2 i 1 − 1 (7)

where

1 ≤ i 1 < i 2 < ⋯ < i m ≤ n (8)

In this θ-representation, a vector p looks like:

{ p 0 , p 1 1 , p 1 2 , p 1 3 , ⋯ , p 2 12 , p 2 13 , p 2 23 , ⋯ , p 3 123 , ⋯ }

The 0-th order term is p 0 , the first order terms are: p 1 1 , p 1 2 , p 1 3 , ⋯ , … The first few terms in the transformation between {p_{i}} and { p m i 1 i 2 ⋯ i m } are:

p 0 = p 0 , p 1 1 = p 1 , p 1 2 = p 2 , p 2 12 = p 3 , p 1 3 = p 4 , p 2 13 = p 5 , p 2 23 = p 6 , p 3 123 = p 7 , p 1 4 = p 8 , ⋯ (9)

The x-representation is the normal representation, and the θ-representation is a form of binary representation.

Denote a distribution by p, which has a x-representation in the x-coordinate system, p(x), and a θ-representation in the θ-coordinate system, p(θ). When a distribution function, p(x) is transformed from one coordinate system to another, the vectors in both coordinates represent the same abstract vector. When a vector q is transformed from the x-representation q(x) to the θ-representation q(θ), then q(θ) is transformed back to q ′ ( x ) , q ′ ( x ) = q ( x ) .

The θ-transformation uses a function F, called a generating function. The function F is required to have the inverse:

F G = G F = I , G = F − 1 . (10)

Let p be a vector in the x-coordinate system. As already discussed above, it can be written either as:

p ( x ) = ( p 0 , p 1 , ⋯ , p N − 1 ) (11)

or

p ( x ) = ( p 0 ; p 1 1 , ⋯ , p 1 n ; p 2 12 , ⋯ , p 2 n − 1 , n ; p 3 123 , ⋯ , p n 12 ⋯ n ) . (12)

The θ-transformation transforms a vector from the x-coordinate to the θ-coordinate via a generating function. The components of the vector p in the x-coordinate, p(x), can be converted into components of a vector p(θ) in the θ-coordinate:

p ( θ ) = ( θ 0 ; θ 1 1 , ⋯ , θ 1 n ; θ 2 12 , ⋯ , θ 2 n − 1 , n ; θ 3 123 , ⋯ , θ n 12 ⋯ n ) , (13)

or

p ( θ ) = ( θ 0 , θ 1 , ⋯ , θ N − 1 ) . (14)

Let F be a generating function, which transforms the x-representation of p in the x-coordinate to a θ-representation of p in the θ-coordinate system. The θ-components are determined by the vector F[p(x)] as follows:

F [ p ( x ) ] = θ 0 + ∑ θ 1 i 1 x i 1 + ∑ θ 2 i 1 i 2 x i 1 x i 2 + ∑ θ 3 i 1 i 2 i 3 x i 1 x i 2 x i 3 + ⋯ + θ n 12 ⋯ n x 1 x 2 ⋯ x n (15)

where

1 ≤ i 1 < i 2 < ⋯ < i m ≤ n (16)

Prior to the transformation, p(x) is the x-representation of p in the x-coordinate; after transformation, F[p(x)] is a θ-representation of p in the θ-coordinate system.

There are N components in the x-coordinate and N components in the θ-coordinate. By introducing a new notation X:

X 0 = X 0 = 1 , X 1 1 = X 1 = x 1 , X 1 2 = X 2 = x 2 , X 2 12 = X 3 = x 1 x 2 , X 1 3 = X 4 = x 3 , X 2 13 = X 5 = x 1 x 3 , X 2 23 = X 6 = x 2 x 3 , X 3 123 = X 7 = x 1 x 2 x 3 , X 1 4 = X 8 = x 1 x 2 x 3 x 4 , ⋯ (17)

then the vector can be written as:

F [ p ( x ) ] = ∑ θ J X J (18)

By using the assumption GF = I, we have:

p ( x ) = G { ∑ θ J X J } (19)

where J denotes the index in either of the two representations in the θ-coordinate system.

The transformation of a vector p from the x-representation, p(x), in the x-coordinate system to a θ-representation, p(θ), in the θ-coordinate system is called θ-transformation [

The θ-transformation is determined by [

θ m i 1 i 2 ⋯ i m = F [ p m i 1 i 2 ⋯ i m ] + F [ p m − 2 i 1 ⋯ i m − 2 ] + ⋯ + F [ p m − 2 i 3 ⋯ i m ] + F [ p m − 4 ⋯ ] + ⋯ − F [ p m − 1 i 1 ⋯ i m − 1 ] − ⋯ − F [ p m − 1 i 2 ⋯ i m ] − F [ p m − 3 i 1 ⋯ i m − 3 ] − ⋯ (20)

The inverse of the θ-transformation [

p m i 1 i 2 ⋯ i m = G ( θ 0 + θ 1 i 1 + θ 1 i 2 + ⋯ + θ 1 i m + θ 2 i 1 i 2 + θ 2 i 1 i 3 + ⋯ + θ 2 i m − 1 i m + ⋯ + θ m i 1 i 2 ⋯ i m ) (21)

Because the θ-transformation is implemented by normal function, FG = GF = I, as long as there is no singular points in the transformation, any distribution function can be expanded. If we require p i ≥ ε , which is a predefined small number, then there will be no singular points in the transformation.

An ABM with n neurons has neural connections up to the n^{th} order. The invariant distribution is:

p ( x ) = e H ,

H = θ 0 + ∑ θ 1 i 1 x i 1 + ∑ θ 2 i 1 i 2 x i 1 x i 2 + ∑ θ 3 i 1 i 2 i 3 x i 1 x i 2 x i 3 + ⋯ + θ n 12 ⋯ n x 1 x 2 ⋯ x n .

An ABM implements a θ-transformation [

F ( y ) = log ( y ) , G ( y ) = exp ( y ) .

We call the neurons in the ABM algorithm the exponential neurons, because of its exponential generating function. Furthermore, the “connection matrix” element can be calculated as follows [

θ m i 1 i 2 ⋯ i m = log p m i 1 i 2 ⋯ i m p m − 2 i 1 ⋯ i m − 2 ⋯ p m − 2 i 3 ⋯ i m p m − 4 ⋯ p m − 1 i 1 ⋯ i m − 1 ⋯ p m − 1 i 2 ⋯ i m p m − 3 i 1 ⋯ i m − 3 ⋯ (22)

The reverse problem is as follows: given an ABM, the invariant distribution can be calculated as follows [

p m i 1 i 2 ⋯ i m = exp ( θ 0 + θ 1 i 1 + θ 1 i 2 + ⋯ + θ 1 i m + θ 2 i 1 i 2 + θ 2 i 1 i 3 + ⋯ + θ 2 i m − 1 i m + ⋯ + θ m i 1 i 2 ⋯ i m ) (23)

Therefore, an ABM can realize a θ-expansion, which in turn can approximate any distribution. The starting point of the algorithm is a complete expansion; thus, it has the advantage of accuracy [

p m i 1 i 2 ⋯ i m = exp ( θ 0 ) exp ( θ 1 i 1 ) exp ( θ 1 i 2 ) ⋯ exp ( θ 1 i m ) ⋅ exp ( θ 2 i 1 i 2 ) exp ( θ 2 i 1 i 3 ) ⋯ exp ( θ 2 i m − 1 i m ) ⋯ exp ( θ m i 1 i 2 ⋯ i m )

We call the neurons in the ABM algorithm the exponential neurons, because of its exponential generating function. The ABM algorithm uses multiplication expansion, which raises the question of stability. Therefore, we expect to improve this algorithm.

If we take derivative of the expression:

θ m i 1 i 2 ⋯ i m = log p m i 1 i 2 ⋯ i m p m − 2 i 1 ⋯ i m − 2 ⋯ p m − 2 i 3 ⋯ i m p m − 4 ⋯ p m − 1 i 1 ⋯ i m − 1 ⋯ p m − 1 i 2 ⋯ i m p m − 3 i 1 ⋯ i m − 3 ⋯ ,

let

p a ∈ { p m i 1 i 2 ⋯ i m } , θ b ∈ { θ m i 1 i 2 ⋯ i m } ,

then depending whether p_{a} appears in numerator, absent, or denominator, a partial derivative can be

∂ θ b ∂ p a = 1 p a , 0 , − 1 p a .

Because of 1/p_{a}, a small change in p = ( p 0 , p 1 , ⋯ , p N − 1 ) can cause a large change in

p ( θ ) = ( θ 0 ; θ 1 1 , ⋯ , θ 1 n ; θ 2 12 , ⋯ , θ 2 n − 1 , n ; θ 3 123 , ⋯ , θ n 12 ⋯ n ) .

Expand:

e y = 1 + y + y 2 2 ! + ⋯

As we will argue in the next few sections,

・ If we replace G ( y ) = e y with G ( y ) = y , the θ-transformation will be stable, i.e. a small ∂p_{a} will cause a small ∂ θ b ;

・ If a generating function, G ( y ) = y , can classify a problem correctly, the generating function, G ( y ) = y 2 , can discriminate the wrong answers from the right answers more sharply than G ( y ) = y ;

・ The generating function, G ( y ) = y n + 1 , can discriminate the wrong answers from the right answers more sharply than G ( y ) = y n .

In the next section, we will replace G ( y ) = e y with G ( y ) = y . On one hand, this replacement will stabilize the θ-transformation. On the other hand, the linear term does not discriminate the wrong answers from the right answer as sharply as the exponential neurons, because we can view the exponential neurons consisting of the contributions from linear term, square term, cubic term, …

If we can convert the multiplication expansion to addition expansion, then the performance will be more stable.

Let :

G ( y ) = y , F ( y ) = y ,

From section 5, we have:

θ m i 1 i 2 ⋯ i m = p m i 1 i 2 ⋯ i m + p m − 2 i 1 ⋯ i m − 2 + ⋯ + p m − 2 i 3 ⋯ i m + p m − 4 ⋯ − p m − 1 i 1 ⋯ i m − 1 − ⋯ − p m − 1 i 2 ⋯ i m − p m − 3 i 1 ⋯ i m − 3 − ⋯

p m i 1 i 2 ⋯ i m = θ 0 + θ 1 i 1 + θ 1 i 2 + ⋯ + θ 1 i m + θ 2 i 1 i 2 + θ 2 i 1 i 3 + ⋯ + θ 2 i m − 1 i m + ⋯ + θ m i 1 i 2 ⋯ i m

We call these neurons linear neurons. The new algorithm uses summation in expansion, thus it is more stable compared to exponential neurons. The partial derivatives do not have singular points.

Example: let an ANN have 3 neurons,

( x 1 , x 2 , x 3 )

and let a distribution be:

{ p 0 , p 1 , p 2 , p 3 , p 4 , p 5 , p 6 , p 7 } ,

Then,

p 0 = ( θ 0 ) , p 1 = ( θ 0 + θ 1 ) , p 2 = ( θ 0 + θ 2 ) , p 3 = ( θ 0 + θ 1 + θ 2 + θ 3 ) , p 4 = ( θ 0 + θ 4 ) , p 5 = ( θ 0 + θ 1 + θ 4 + θ 5 ) , p 6 = ( θ 0 + θ 2 + θ 4 + θ 6 ) , p 7 = ( θ 0 + θ 1 + θ 2 + θ 3 + θ 4 + θ 5 + θ 6 + θ 7 ) .

When the expansion uses addition, it has the advantage of stability. As we will show below, it also has a third advantage of fast training (low time complexity).

In [_{1}-distance between two configurations is:

d ( x ′ , x ) = | x ′ 1 − x 1 | + | x ′ 2 − x 2 | + ⋯

For example, d ( 111 , 111 ) = 0 ; d ( 111 , 110 ) = 1 .

The linear neuron learning algorithm can be summarized into a single formula:

θ m i 1 i 2 ⋯ i m = 2 ^ ( D − d ( x m i 1 i 2 ⋯ i m , x ) ) , if 0 ≤ d ( x m i 1 i 2 ⋯ i m , x ) ≤ D

θ m i 1 i 2 ⋯ i m = 0 , if d ( x m i 1 i 2 ⋯ i m , x ) > D

where d ( x m i 1 i 2 ⋯ i m , x ) is the distance between a neuron configuration, x, and a training neuron configuration, x m i 1 i 2 ⋯ i m , and D is called connection radius. Beyond this radius, all connections are 0.

The linear neuron learning algorithm is [

Step 1. The First Assignment (d = 0)

The first step is to assign the first connection matrix element for training vector, x = x m i 1 i 2 ⋯ i m . We will assign:

θ x = θ m i 1 i 2 ⋯ i m = 2 D ,

while D is the radius of connection space.

Step 2. The Rest of the Assignment

The next step is to assign the rest of the weight:

θ m i 1 i 2 ⋯ i m = 2 ^ ( D − d ( x m i 1 i 2 ⋯ i m , x ) ) , if 0 ≤ d ( x m i 1 i 2 ⋯ i m , x ) ≤ D

θ m i 1 i 2 ⋯ i m = 0 , if d ( x m i 1 i 2 ⋯ i m , x ) > D

Step 3. Modification

The algorithm uses bit “1” to represent an input pattern or an output class; so, the input vectors or the output vectors cannot be all 0’s; otherwise, these coefficients are 0.

Step 4. Retraining

Repeat the last three steps for all training patterns; if there is an overlap, take the maximum values:

θ m i 1 i 2 ⋯ i m ( t + 1 ) = max { θ m i 1 i 2 ⋯ i m ( t ) , θ m i 1 i 2 ⋯ i m } .

The linear neurons do not discriminate the wrong answers from the right answers as sharply as the exponential neurons. We will use a numerical example to demonstrate this in the later section.

To improve the accuracy of the linear neurons, we define the square neurons using the following generating function:

F ( y ) = ( y ) 1 / 2 , G ( y ) = ( y ) 2 .

We define the power neurons using the following generating function:

F ( y ) = ( y ) 1 / L , G ( y ) = ( y ) L .

For square neurons, we have:

θ m i 1 i 2 ⋯ i m = ( p m i 1 i 2 ⋯ i m + p m − 2 i 1 ⋯ i m − 2 + ⋯ + p m − 2 i 3 ⋯ i m + p m − 4 ⋯ − p m − 1 i 1 ⋯ i m − 1 − ⋯ − p m − 1 i 2 ⋯ i m − p m − 3 i 1 ⋯ i m − 3 − ⋯ ) ^ ( 1 / 2 )

p m i 1 i 2 ⋯ i m = ( θ 0 + θ 1 i 1 + θ 1 i 2 + ⋯ + θ 1 i m + θ 2 i 1 i 2 + θ 2 i 1 i 3 + ⋯ + θ 2 i m − 1 i m + ⋯ + θ m i 1 i 2 ⋯ i m ) ^ 2

For power neurons, we have:

θ m i 1 i 2 ⋯ i m = ( p m i 1 i 2 ⋯ i m + p m − 2 i 1 ⋯ i m − 2 + ⋯ + p m − 2 i 3 ⋯ i m + p m − 4 ⋯ − p m − 1 i 1 ⋯ i m − 1 − ⋯ − p m − 1 i 2 ⋯ i m − p m − 3 i 1 ⋯ i m − 3 − ⋯ ) ^ ( 1 / L )

p m i 1 i 2 ⋯ i m = ( θ 0 + θ 1 i 1 + θ 1 i 2 + ⋯ + θ 1 i m + θ 2 i 1 i 2 + θ 2 i 1 i 3 + ⋯ + θ 2 i m − 1 i m + ⋯ + θ m i 1 i 2 ⋯ i m ) ^ L

The square neuron learning algorithm is similar to the linear neuron learning algorithm except for the linear neurons:

p m i 1 i 2 ⋯ i m = ( θ 0 + θ 1 i 1 + θ 1 i 2 + ⋯ + θ 1 i m + θ 2 i 1 i 2 + θ 2 i 1 i 3 + ⋯ + θ 2 i m − 1 i m + ⋯ + θ m i 1 i 2 ⋯ i m ) ^ 1

And for the square neurons:

p m i 1 i 2 ⋯ i m = ( θ 0 + θ 1 i 1 + θ 1 i 2 + ⋯ + θ 1 i m + θ 2 i 1 i 2 + θ 2 i 1 i 3 + ⋯ + θ 2 i m − 1 i m + ⋯ + θ m i 1 i 2 ⋯ i m ) ^ 2

If the linear neuron can classify a problem correctly, then the square neurons will do better. We will not formally prove this, but we will use a simple example to show the point.

Example. Assume a linear neuron distribution is (1, 2, 3, 4)/10, then based the above expressions, the square neuron distribution is (1^{2}, 2^{2}, 3^{2}, 4^{2})/30. The largest probability is increased from 4/10 to 16/30.

In this section, we will first use the linear neuron algorithm [

The 10 input vectors for digits in

I 0 = ( 1 , 1 , 1 , 0 , 1 , 1 , 1 ) ,

I 1 = ( 0 , 0 , 1 , 0 , 0 , 1 , 0 ) ,

I 2 = ( 1 , 0 , 1 , 0 , 1 , 0 , 1 ) ,

⋯

where I_{0} is an image of “0”. The 10 output vectors for digits in

O 0 = ( 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ) ,

O 1 = ( 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ) ,

O 2 = ( 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ) ,

⋯

where O_{0} is a classification. The 10 training vectors have 17 bits:

T 0 = ( I 0 , O 0 ) = ( ( 1 , 1 , 1 , 0 , 1 , 1 , 1 ) , ( 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ) ) ,

T 1 = ( I 1 , O 1 ) = ( ( 0 , 0 , 1 , 0 , 0 , 1 , 0 ) , ( 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ) ) ,

⋯

We will set the radius: D = 2; the possible elements are: 2^{D}^{-d} = 4, 2, 1, and 0.

We will work out a few examples. As the first example, we rewrite T_{1} as:

T 1 = ( I 1 , O 1 ) = x 3 259 = ( ( 0 , 0 , 1 , 0 , 0 , 1 , 0 ) , ( 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ) ) .

The first connection element (0-distance) is: θ 3 259 = 4 . There are two coefficients for d = 1: θ 2 59 = θ 2 29 = 2 . T_{1} generates 3 coefficients.

As the second example, we rewrite T_{7} as:

x 4 1 , 2 , 5 , 14 = ( ( 1 , 0 , 1 , 0 , 0 , 1 , 0 ) , ( 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 ) ) .

The first connection element (0-distance) is: θ 4 1 , 2 , 5 , 14 = 4 . There are 3 coefficients for d = 1: θ 3 1 , 2 , 14 = θ 3 1 , 5 , 14 = θ 3 2 , 5 , 14 = 2 . There are 3 coefficients for d = 2: θ 2 1 , 14 = θ 2 2 , 14 = θ 2 5 , 14 = 1 . T_{7} generates 7 coefficients.

As the last example, we rewrite T_{4} as:

x 5 1 , 2 , 3 , 5 , 11 = ( ( 0 , 1 , 1 , 1 , 0 , 1 , 0 ) , ( 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 ) ) ,

the first connection element (0-distance) is: θ 5 1 , 2 , 3 , 5 , 11 = 4 . There are 4 coefficients for d = 1: θ 4 1 , 2 , 3 , 11 = θ 4 1 , 2 , 5 , 11 = θ 4 1 , 3 , 5 , 11 = θ 4 2 , 3 , 5 , 11 = 2 . There are 6 coefficients for d = 2: θ 3 1 , 2 , 11 = θ 2 1 , 3 , 11 = 1 . T_{4} generates 11 coefficients.

After training the linear neuron algorithm with { T 0 , T 1 , ⋯ , T 9 } , all of the connection coefficients, θ m i 1 i 2 ⋯ i m , are calculated. Section 9 provides the formula to calculate the probability of each (input, output) pair. For example, the probability is p 3 259 , if the input is “1” and the output is in class 1; the probability is p 3 258 , if the input is “1” and the output is in class 0; the probability is p 3 2 , 5 , 10 , if the input is “1” and the output is in class 2; …

The character recognition results [

The worst case is input = 8, see

This is the main reason for the square neurons and power neurons, which will improve all digits in

For the square neuron algorithm, the results are in

For the power neuron algorithm with L = 4, the results are in

To summarize, the square and power neurons add leverage to the linear neurons performance. The square neuron and power neurons have sharply increased the discrimination of the linear neurons between the correct answer and wrong answer.

Input | p_{0} | p_{1} | p_{2} | p_{3} | p_{4} | p_{5} | p_{6} | p_{7} | p_{8} | p_{9} |
---|---|---|---|---|---|---|---|---|---|---|

0 | 31 | 8 | 6 | 6 | 5 | 6 | 7 | 13 | 8 | 7 |

1 | 0 | 8 | 0 | 0 | 1 | 0 | 0 | 4 | 0 | 0 |

2 | 1 | 2 | 24 | 6 | 1 | 1 | 1 | 4 | 1 | 1 |

3 | 1 | 8 | 6 | 24 | 5 | 6 | 1 | 13 | 1 | 7 |

4 | 0 | 8 | 0 | 1 | 18 | 1 | 0 | 4 | 0 | 1 |

5 | 1 | 2 | 1 | 6 | 5 | 24 | 7 | 4 | 1 | 7 |

6 | 7 | 2 | 6 | 6 | 5 | 24 | 31 | 4 | 8 | 7 |

7 | 0 | 8 | 0 | 1 | 1 | 0 | 0 | 13 | 0 | 0 |

8 | 31 | 8 | 24 | 24 | 18 | 24 | 31 | 13 | 39 | 31 |

9 | 7 | 8 | 6 | 24 | 18 | 24 | 7 | 13 | 8 | 31 |

Input | p_{0} | p_{1} | p_{2} | p_{3} | p_{4} | p_{5} | p_{6} | p_{7} | p_{8} | p_{9} |
---|---|---|---|---|---|---|---|---|---|---|

8 | 31 | 8 | 24 | 24 | 18 | 24 | 31 | 13 | 39 | 31 |

8 | 0.1276 | 0.03292 | 0.09877 | 0.09877 | 0.07407 | 0.09877 | 0.12757 | 0.0535 | 0.16049 | 0.1276 |

Input | p_{0} | p_{1} | p_{2} | p_{3} | p_{4} | p_{5} | p_{6} | p_{7} | p_{8} | p_{9} |
---|---|---|---|---|---|---|---|---|---|---|

0 | 961 | 64 | 36 | 36 | 25 | 36 | 49 | 169 | 64 | 49 |

1 | 0 | 64 | 0 | 0 | 1 | 0 | 0 | 16 | 0 | 0 |

2 | 1 | 4 | 576 | 36 | 1 | 1 | 1 | 16 | 1 | 1 |

3 | 1 | 64 | 36 | 576 | 25 | 36 | 1 | 169 | 1 | 49 |

4 | 0 | 64 | 0 | 1 | 324 | 1 | 0 | 16 | 0 | 1 |

5 | 1 | 4 | 1 | 36 | 25 | 576 | 49 | 16 | 1 | 49 |

6 | 49 | 4 | 36 | 36 | 25 | 576 | 961 | 16 | 64 | 49 |

7 | 0 | 64 | 0 | 1 | 1 | 0 | 0 | 169 | 0 | 0 |

8 | 961 | 64 | 576 | 576 | 324 | 576 | 961 | 169 | 1521 | 961 |

9 | 49 | 64 | 36 | 576 | 324 | 576 | 49 | 169 | 64 | 961 |

Input | p_{0} | p_{1} | p_{2} | p_{3} | p_{4} | p_{5} | p_{6} | p_{7} | p_{8} | p_{9} |
---|---|---|---|---|---|---|---|---|---|---|

8 | 961 | 64 | 576 | 576 | 324 | 576 | 961 | 169 | 1521 | 961 |

0.1437 | 0.00957 | 0.08611 | 0.08611 | 0.04844 | 0.08611 | 0.14367 | 0.02527 | 0.22739 | 0.1437 |

Input | p_{0} | p_{1} | p_{2} | p_{3} | p_{4} | p_{5} | p_{6} | p_{7} | p_{8} | p_{9} |
---|---|---|---|---|---|---|---|---|---|---|

8 | 0.14 | 0.00 | 0.05 | 0.05 | 0.01 | 0.05 | 0.14 | 0.00 | 0.37 | 0.14 |

In conclusion, we have introduced two new learning algorithms: the square neuron learning algorithm and the power neuron learning algorithm, which are superior to the earlier ABM algorithm [

I would like to thank Gina Porter for proof reading of this paper.

The author declares no conflicts of interest regarding the publication of this paper.

Liu, Y. (2018) Square Neurons, Power Neurons, and Their Learning Algorithms. American Journal of Computational Mathematics, 8, 296-313. https://doi.org/10.4236/ajcm.2018.84024