Square Neurons , Power Neurons , and Their Learning Algorithms

In this paper, we introduce the concepts of square neurons, power neurons, and new learning algorithms based on square neurons, and power neurons. First, we briefly review the basic idea of the Boltzmann Machine, specifically that the invariant distributions of the Boltzmann Machine generate Markov chains. We further review ABM (Attrasoft Boltzmann Machine). Next, we review the θ-transformation and its completeness, i.e. any function can be expanded by θ-transformation. The invariant distribution of the ABM is a θ-transformation; therefore, an ABM can simulate any distribution. We review the linear neurons and the associated learning algorithm. We then discuss the problems of the exponential neurons used in ABM, which are unstable, and the problems of the linear neurons, which do not discriminate the wrong answers from the right answers as sharply as the exponential neurons. Finally, we introduce the concept of square neurons and power neurons. We also discuss the advantages of the learning algorithms based on square neurons and power neurons, which have the stability of the linear neurons and the sharp discrimination of the exponential neurons.


Introduction
Neural networks and deep learning currently provide the best solutions to many supervised learning problems.In 2006, a publication by Hinton, Osindero, and Teh [1] introduced the idea of a "deep" neural network, which first trains a dataflow programming across a range of tasks.It is a symbolic math library, and also used for machine learning applications, such as neural networks [5].It is used for both research and production at Google.Torch is an open source machine learning library and a scientific computing framework.Theano is a numerical computation library for Python.The approach using the single training of multiple layers gives advantages to the neural network over other learning algorithms.
In addition to neural network algorithms, there are numerous learning algorithms.We select a few such algorithms below.
Principal Component Analysis [6] [7] is a statistical procedure that uses an orthogonal transformation to convert a set of vectors into a set of values of linearly uncorrelated variables called principal components.The number of principal components is less than or equal to the number of original variables.
Sparse coding [8] [9] minimizes the objective: where, W is a matrix of transformation, H is a matrix of inputs, and X is a matrix of the outputs.λ implements a trade of between sparsity and reconstruction.
Auto encoders [10]- [15] minimize the objective: ( ) where σ is some neural network functions.Note that L sc looks almost like L ae once we set ( ) The difference is that: 1) auto encoders do not encourage sparsity in their general form; 2) an auto encoder uses a model for finding the codes, while sparse coding does so by means of optimization.K-means clustering [16] [17] [18] [19] is a method of vector quantization which is popular for cluster analysis in data mining.K-means clustering aims to partition n observations into k clusters.Each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.This results in a partitioning of the data space into k clusters.
If we limit the learning architecture to one layer, all of these algorithms have some advantages for some applications.The deep learning architectures currently provide the best solutions to many supervised learning problems, because two layers, when "properly" constructed, are better than one layer.One question is the existence of a solution for a given problem.This will often be followed by an effective solution development, i.e. an algorithm for a solution.This will often be followed by the stability of the algorithm.This will often be followed by an efficiency study of solutions.Although these theoretical approaches are not necessary for the empirical development of practical algorithms, the theoretical stu-  [20] establish that the standard multilayer feedforward networks with hidden layers using arbitrary squashing functions are capable of approximating any measurable function from one finite dimensional space to another to any desired degree of accuracy, provided many hidden units are sufficiently available.In this sense, multilayer feedforward networks are a class of universal approximators.
Deep Belief Networks (DBN) are generative neural network models with many layers of hidden explanatory factors, recently introduced by Hinton, Osindero, and Teh, along with a greedy layer-wise unsupervised learning algorithm.The building block of a DBN is a probabilistic model called a Restricted Boltzmann machine (RBM), used to represent one layer of the model.Restricted Boltzmann machines are interesting because inference is easy in them and because they have been successfully used as building blocks for training deeper models.Roux & Bengio [21] proved that adding hidden units yield a strictly improved modeling power, and RBMs are universal approximators of discrete distributions.
An alternative to the direction of "deep layers", higher order is another direction.In our earlier paper [22], we provided yet another proof: Deep Neural Networks are universal approximators.The advantage of this proof is that it will lead to multiple new learning algorithms.In our approach, Deep Neural Networks implement an expansion and this expansion is complete.These two directions are equivalent [22] [23].There are several learning algorithms characterized by θ-transformation, which are in the direction of higher order, which form a new family of learning algorithms [22] [23].The conversion between these two directions of deep layers and higher orders is beyond the scope of this paper.The first learning algorithm characterized by higher orders and θ-transformation [24] [25] [26] [27] is ABM [28], which has a problem of stability.
Once we accept that the deep learning architectures currently provide the best solutions, the next question is what is in each layer; in this paper, we intend to fill these layers with the square and power neurons.
In [23], by identifying that the ABM algorithm uses exponential neurons, a second learning algorithm was developed to replace the exponential neurons with linear neurons [23], which solved the stability problem.However, the linear neurons do not discriminate the wrong answers from the right answers as sharply as the exponential neurons.In this paper, we will present a third algorithm after [28] and [23].We will take the middle ground between the exponential neurons [28] and the linear neurons [23], which has the advantages Y. Liu DOI: 10.4236/ajcm.2018.84024299 American Journal of Computational Mathematics of both algorithms [23] [28] and avoids the disadvantages of the both algorithms.
In Section 2, we briefly review how to use probability distributions in a Supervised Learning Problem.In this approach, given an input A, an output B, and a mapping from A to B, one can convert this problem to a probability distribution ∈ , then the probability ( ) , p a b will be higher than 0. One can find a Markov chain [34] such that the equilibrium distribution of this Markov chain, ( ) , p a b , realizes, as faithfully as possible, the given supervised training set.
In Section 3, the Boltzmann machines [29]  In Section 4, we review the ABM (Attrasoft Boltzmann Machine) [28] which has an invariant distribution.An ABM is defined by two features: 1) an ABM with n neurons has neural connections up to the n th order; and 2) all of the connections up to n th order are determined by the ABM algorithm [28].By adding more terms in the invariant distribution compared to the second order Boltzmann Machine, ABM is significantly more powerful to simulate an unknown function.Unlike the Boltzmann Machine, ABM's emphasize higher order connections rather than lower order connections.The Boltzmann Machine (order 0, 1, 2) and the ABM (order n, n − 1, n − 2) are at the opposite end of the neuron orders.
In Section 6, we review the completeness of the θ-transformation [24] [25] [26] [27].The θ-transformation is complete, i.e. given a function, one can find a θ-transformation by converting it from the x-coordinate system to the θ-coordinate system.
In Section 7, we discuss how the invariant distribution of an ABM implements a θ-transformation [11] [12] [13] [14], i.e. given an unknown function, one can find an ABM such that the equilibrium distribution of this ABM realizes precisely the unknown function.We introduce the exponential neurons; if we keep only lower orders, this will be the standard Boltzmann machine.
In Section 8, we discuss the stability problem of the exponential neurons.
In Section 9, we review linear neurons [23], which solves the stability problem.
However, the linear neurons do not discriminate the wrong answers from the right answers as sharply as the exponential neurons.
In Section 10, we review the linear neuron learning algorithms.
In Section 11, we will take the middle ground between the exponential neurons Y. Liu DOI: 10.4236/ajcm.2018.84024300 American Journal of Computational Mathematics and the linear neurons, which has the advantages of both algorithms and avoids the disadvantages of the both algorithms.The new contribution of this paper is that we introduce the concept of square neurons and power neurons.
In Section 12, we also discuss the advantages of the two new learning algorithms based on square neurons and power neurons, which has the stability of the linear neurons and the sharp discrimination of the exponential neurons.
In Section 13, we introduce a simple example to demonstrate the improvement of the square neurons and power neurons over linear neurons.

Basic Approach
The basic supervised learning [29] problem is: given a training set {A, B}, where If a 1 does not match with b 1 , the probability is 0 or close to 0. If a 1 matches with b 1 , the probability is higher than 0. This can reduce the problem of inferencing of a mapping from A to B to inferencing a distribution function.
An irreducible finite Markov chain possesses a stationary distribution [34].This invariant distribution can be used to simulate an unknown function.It is the invariant distribution of a Markov chain which eventually allows us to prove that the DNN is complete.

Boltzmann Machine
A Boltzmann machine [29] [30] [31] [32] [33] is a stochastic neural network in which each neuron has a certain probability to be 1.The probability of a neuron to be 1 is determined by the so called Boltzmann distribution.The collection of the neuron states: ( ) of a Boltzmann machine is called a configuration.The configuration transition is mathematically described by a Markov chain with 2 n configurations x X ∈ , where X is the set of all points, ( ) . When all of the configurations are connected, it forms a Markov chain.A Markov chain has an invariant distribution [34].Whatever initial configuration of a Boltzmann starts from, the probability distribution converges over time to the invariant distribution, p(x).The configuration x X ∈ appears with a relative frequency p(x) over a long period of time.
The Boltzmann machine [29]  More formally, let F be the set of all functions.Let T be the parameter space of a family of Boltzmann machines.Given an unknown f F ∈ , one can find a Boltzmann machine such that the equilibrium distribution of this Boltzmann machine realizes, as faithfully as possible, the unknown function [29] [30] [31] [32] [33].Therefore, the unknown, f, is encoded into a specification of a Boltzmann machine, t T ∈ .We call the mapping from F to T as a Boltzmann Machine Let T be the parameter space of a family of Boltzmann machines, and let F T be the set of all functions that can be inferred by the Boltzmann Machines over T; obviously, F T is a subset of F. It turns out that F T is significantly smaller than F and it is not a good approximation for F. The main contribution of the Boltzmann machine is to establish a framework for inferencing a mapping from A to B.

Attrasoft Boltzmann Machines (ABM)
The invariant distribution of a Boltzmann machine [29] [30] [31] [32] [33] is: If the threshold vector does not vanish, the distributions are: ( ) By rearranging the above distribution, we have: ( ) It turns out that the third order Boltzmann machines have the following type of distributions: ( ) An ABM [24] [25] [26] [27] is an extension of the higher order Boltzmann Machine to the maximum order.An ABM with n neurons has neural connections up to the n th order.All of the connections up to the n th order are determined by the ABM algorithm [28].By adding additional higher order terms to the invariant distribution, ABM is significantly more powerful to simulate an unknown function.
By adding additional terms, the invariant distribution for an ABM is: ABM is significantly more powerful to simulate an unknown function.As more and more terms are added, from the second order terms to the n th order terms, the invariant distribution space will become larger and larger.Like the Boltzmann Machines in the last section, ABM implements a transformation, B F T → .We hope ultimately that this ABM transformation is complete so that given any Y.Liu DOI: 10.4236/ajcm.2018.84024302 American Journal of Computational Mathematics function f F ∈ , we can find an ABM, t T ∈ , such that the equilibrium distribution of this ABM realizes precisely the unknown function.We show that this is exactly the case.

Basic Notations
We first introduce some notations used in this paper [24] [25] [26] [27].There are two different types of coordinate systems: the x-coordinate system and the θ-coordinate system [24] [25] [26] [27].Each of these two coordinate systems has two representations, x-representation and θ-representation.An N-dimensional vector, p, is: which is the x-representation of p in the x-coordinate systems.
In the x-coordinate system, there are two representations of a vector: • {p i } in the x-representation, and in the θ-representation.In the θ-coordinate system, there are two representations of a vector: • {θ i } in the x-representation, and in the θ-representation.
The reason for the two different representations is that the x-representation is natural for the x-coordinate system, and the θ-representation is natural for the θ-coordinate system.
The transformations between {p i } and { } will be introduced.Let N = 2 n be the number of neurons.An N-dimensional vector, p, is: is the position inside a distribution, then x can be rewritten in the binary form: Some of the coefficients x i might be zero.In dropping those coefficients which are zero, we write: This generates the following transformation: where In this θ-representation, a vector p looks like: { } The 0-th order term is 0 p , the first order terms are: The x-representation is the normal representation, and the θ-representation is a form of binary representation.

θ-Transformation
Denote a distribution by p, which has a x-representation in the x-coordinate system, p(x), and a θ-representation in the θ-coordinate system, p(θ).When a distribution function, p(x) is transformed from one coordinate system to another, the vectors in both coordinates represent the same abstract vector.When a vector q is transformed from the x-representation q(x) to the θ-representation q(θ), then q(θ) is transformed back to ( ) .
The θ-transformation uses a function F, called a generating function.The function F is required to have the inverse: Let p be a vector in the x-coordinate system.As already discussed above, it can be written either as: ( ) ( ) or ( ) ( ) The θ-transformation transforms a vector from the x-coordinate to the θ-coordinate via a generating function.The components of the vector p in the x-coordinate, p(x), can be converted into components of a vector p(θ) in the θ-coordinate: or ( ) ( ) Let F be a generating function, which transforms the x-representation of p in the x-coordinate to a θ-representation of p in the θ-coordinate system.The θ-components are determined by the vector F[p(x)] as follows: ( ) where Prior to the transformation, p(x) is the x-representation of p in the x-coordinate; after transformation, F[p(x)] is a θ-representation of p in the θ-coordinate system.There are N components in the x-coordinate and N components in the θ-coordinate.By introducing a new notation X: then the vector can be written as: By using the assumption GF = I, we have: where J denotes the index in either of the two representations in the θ-coordinate system.

θ-Transformation Is Complete
Because the θ-transformation is implemented by normal function, FG = GF = I, as long as there is no singular points in the transformation, any distribution function can be expanded.If we require i p ε ≥ , which is a predefined small number, then there will be no singular points in the transformation.

Exponential Neurons
An ABM with n neurons has neural connections up to the n th order.The invariant distribution is: An ABM implements a θ-transformation [24] [25] [26] [27] with: We call the neurons in the ABM algorithm the exponential neurons, because of its exponential generating function.Furthermore, the "connection matrix" element can be calculated as follows [24] [25] [26] [27]: Y. Liu DOI: 10.4236/ajcm.2018.84024305 American Journal of Computational Mathematics The reverse problem is as follows: given an ABM, the invariant distribution can be calculated as follows [24] [25] [26] [27]: ( ) We call the neurons in the ABM algorithm the exponential neurons, because of its exponential generating function.The ABM algorithm uses multiplication expansion, which raises the question of stability.Therefore, we expect to improve this algorithm.

Stability of Exponential Neurons
If we take derivative of the expression: • The generating function, ( ) , can discriminate the wrong answers from the right answers more sharply than ( ) n G y y = .
In the next section, we will replace ( ) e y G y = with ( ) G y y = .On one hand, this replacement will stabilize the θ-transformation.On the other hand, the linear term does not discriminate the wrong answers from the right answer as sharply as the exponential neurons, because we can view the exponential neurons consisting of the contributions from linear term, square term, cubic term, …

Linear Neuron
If we can convert the multiplication expansion to addition expansion, then the performance will be more stable.

G y y
= , ( ) From section 5, we have: We call these neurons linear neurons.The new algorithm uses summation in expansion, thus it is more stable compared to exponential neurons.The partial derivatives do not have singular points.
Example: let an ANN have 3 neurons, ( ) and let a distribution be: , , , , , , , p p p p p p p p , ) When the expansion uses addition, it has the advantage of stability.As we will show below, it also has a third advantage of fast training (low time complexity).

A Linear Neuron Learning Algorithm
In [23], we introduced the linear neuron learning algorithm.The L 1 -distance between two configurations is: The linear neuron learning algorithm can be summarized into a single formula: ( ) ( ) x  , and D is called connection radius.Beyond this radius, all connections are 0. The linear neuron learning algorithm is [23] [28]: Step 1.The First Assignment (d = 0) The first step is to assign the first connection matrix element for training vector, =  .We will assign: , while D is the radius of connection space.
Step 2. The Rest of the Assignment The next step is to assign the rest of the weight: ( ) ( ) ( ) , Step 3. Modification The algorithm uses bit "1" to represent an input pattern or an output class; so, the input vectors or the output vectors cannot be all 0's; otherwise, these coefficients are 0.
Step 4. Retraining Repeat the last three steps for all training patterns; if there is an overlap, take the maximum values: .

Square Neurons and Power Neurons
The linear neurons do not discriminate the wrong answers from the right answers as sharply as the exponential neurons.We will use a numerical example to demonstrate this in the later section.
To improve the accuracy of the linear neurons, we define the square neurons using the following generating function: We define the power neurons using the following generating function: For square neurons, we have: For power neurons, we have: ( )

Square Neurons and Power Neurons Algorithms Are Better
The square neuron learning algorithm is similar to the linear neuron learning algorithm except for the linear neurons: ( ) If the linear neuron can classify a problem correctly, then the square neurons will do better.We will not formally prove this, but we will use a simple example to show the point.

An Example
In this section, we will first use the linear neuron algorithm [23]; then we will use the square and power neuron algorithms.The example is to identify simple digits in Figure 1 [29].Each digit is converted into 7 bits: 0, 1, …, 6. Figure 2 shows the bit location.
After training the linear neuron algorithm with { } 0 1 9 , , , T T T  , all of the connection coefficients, 1 2 m i i i m θ  , are calculated.Section 9 provides the formula to calculate the probability of each (input, output) pair.For example, the probability is 259 3 p , if the input is "1" and the output is in class 1; the probability is 258 3 p , if the input is "1" and the output is in class 0; the probability is 2,5,10 3 p , if the input is "1" and the output is in class 2; … The character recognition results [23] are given in Table 1, where the first column is input, then the next 10 columns are output.The output probability is not normalized in Table 1.The relative probability for (input = 0, output = 0) is 31; those for (input = 0, output = 1) are 8; those for (input = 0, output = 2) are 6; ….So if the input is digit 0, the output is identified as 0. In this problem, the output is a single identification, so the largest weight determines the digit classification.In each case, all input digits are classified correctly.
The worst case is input = 8, see Table 2. Using the largest probability as a classification, if input = 8, then output = 8, which is a correct classification.But the 16% probability for (input = 8, output = 8) is too low compared to the next one, 12.7% for (input = 8, output = 9), or (input = 8, output = 6), or (input = 8, output = 0).Some improvements must be made.This is the main reason for the square neurons and power neurons, which will improve all digits in Table 1.The character recognition results for the square neurons simply square every number in Table 1, see Table 3.In the following, we will only study the worst case of Table 2.
For the square neuron algorithm, the results are in Table 4. Now the probability of the correct output in the worst case is increased from 16% to 22.7%.
For the power neuron algorithm with L = 4, the results are in Table 5.Now the probability of the correct output in the worst case is increased from 16% to 37%.
This can be further improved by increasing L.
To summarize, the square and power neurons add leverage to the linear neurons performance.The square neuron and power neurons have sharply increased the discrimination of the linear neurons between the correct answer and wrong answer.

Conclusion
In conclusion, we have introduced two new learning algorithms: the square neuron learning algorithm and the power neuron learning algorithm, which are superior to the earlier ABM algorithm [10] and the linear neuron algorithm [15].The reason for this improvement is that the ABM has a problem of stability and the linear neuron algorithm has a problem of discriminating wrong answers.In this paper, we have introduced the concepts of square neurons and power neurons.We have shown that these two new learning algorithms, based on square neurons and power neurons, have advantages over both the ABM learning algorithm and the linear neuron algorithm.
simple supervised model, and then adds on a new layer on top and trains the parameters for the new layer alone.You keep adding layers and training layers Y. Liu DOI: 10.4236/ajcm.2018.84024297 American Journal of Computational Mathematics in this fashion until you have a deep network.Later, this condition of training one layer at a time is removed.After Hinton's initial attempt of training one layer at a time, Deep Neural Networks train all layers together.Examples include TensorFlow [2], Torch [3], and Theano [4].Google's TensorFlow is an open-source software library for

[ 29 ]
[30] [31][32] [33] of (A, B): ( ) , , , p a b a A b B ∈ ∈ .If an input is a A ∈ and an output is b B [30] [31][32] [33] defines a Markov chain.Each configuration of the Boltzmann machine is a state of the Markov chain.The Boltzmann machine has a stable distribution.Let T be the parameter space of a family of Boltzmann machines.An unknown function can be considered as a stable distribution of a Boltzmann machine.Given an unknown distribution, a Boltzmann Y. Liu DOI: 10.4236/ajcm.2018.84024301 American Journal of Computational Mathematics machine can be inferred such that its invariant distribution realizes, as faithfully as possible, the given function.Therefore, an unknown function is transformed into a specification of a Boltzmann machine.
θ  , are similar.Because of the similarity, in the following, only the transformation between {p i } and { }

•=
p a appears in numerator, absent, or denominator, As we will argue in the next few sections, If we replace ( ) , the θ-transformation will be stable, i.e. a small ∂p a will cause a small b θ ∂ ;• If a generating function, ( )Gy y = , can classify a problem correctly, the generating function, ( ) 2 G y y = , can discriminate the wrong answers from Y. Liu DOI: 10.4236/ajcm.2018.84024306 American Journal of Computational Mathematics the right answers more sharply than ( ) an image of "0".The 10 output vectors for digits in Figure2have 10 bits: a classification.The 10 training vectors have 17 bits:
find a mapping from A to B. It turns out that if we can reduce this from a discrete problem to a continuous problem, it will be very [32]ful.The first step is to convert this problem to a probability[29][30][32][33]:

Table 1 .
The classifications from the linear neuron algorithm without normalization.

Table 2 .
The classifications of "8" from the linear neuron algorithm.

Table 4 .
The square neuron algorithm.