Completeness Problem of the Deep Neural Networks

Hornik, Stinchcombe & White have shown that the multilayer feed forward networks with enough hidden layers are universal approximators. Roux & Bengio have proved that adding hidden units yield a strictly improved modeling power, and Restricted Boltzmann Machines (RBM) are universal approximators of discrete distributions. In this paper, we provide yet another proof. The advantage of this new proof is that it will lead to several new learning algorithms. We prove that the Deep Neural Networks implement an expansion and the expansion is complete. First, we briefly review the basic Boltzmann Machine and that the invariant distributions of the Boltzmann Machine generate Markov chains. We then review the θ-transformation and its completeness, i.e. any function can be expanded by θ-transformation. We further review ABM (Attrasoft Boltzmann Machine). The invariant distribution of the ABM is a θ-transformation; therefore, an ABM can simulate any distribution. We discuss how to convert an ABM into a Deep Neural Network. Finally, by establishing the equivalence between an ABM and the Deep Neural Network, we prove that the Deep Neural Network is complete.


Introduction
Neural networks and deep learning currently provide the best solutions to many supervised learning problems.In 2006, a publication by Hinton, Osindero, and Teh [1] introduced the idea of a "deep" neural network, which first trains a simple supervised model; then adds on a new layer on top and trains the parameters for the new layer alone.You keep adding layers and training layers in this fashion until you have a deep network.Later, this condition of training one layer at a time is removed [2] [3] [4] [5].

After Hinton's initial attempt of training one layer at a time, Deep Neural
Networks train all layers together.Examples include TensorFlow [6], Torch [7], and Theano [8].Google's TensorFlow is an open-source software library for dataflow programming across a range of tasks.It is a symbolic math library, and also used for machine learning applications such as neural networks [3].It is used for both research and production at Google.Torch is an open source machine learning library and a scientific computing framework.Theano is a numerical computation library for Python.The approach using the single training of multiple layers gives advantages to the neural network over other learning algorithms.
One question is the existence of a solution for a given problem.This will often be followed by an effective solution development, i.e. an algorithm for a solution.This will often be followed by the stability of the algorithm.This will often be followed by an efficiency study of solutions.Although these theoretical approaches are not necessary for the empirical development of practical algorithms, the theoretical studies do advance the understanding of the problems.The theoretical studies will prompt new and better algorithm development of practical problems.Along the direction of solution existence, Hornik, Stinchcombe, & White [9] have shown that the multilayer feedforward networks with enough hidden layers are universal approximators.Roux & Bengio [10] have shown the same, Restricted Boltzmann machines are universal approximators of discrete distributions.
Hornik, Stinchcombe, & White [9] establish that the standard multilayer feedforward networks with hidden layers using arbitrary squashing functions are capable of approximating any measurable function from one finite dimensional space to another to any desired degree of accuracy, provided sufficiently many hidden units are available.In this sense, multilayer feedforward networks are a class of universal approximators.
Deep Belief Networks (DBN) are generative neural network models with many layers of hidden explanatory factors, recently introduced by Hinton, Osindero, and Teh, along with a greedy layer-wise unsupervised learning algorithm.The building block of a DBN is a probabilistic model called a Restricted Boltzmann machine (RBM), used to represent one layer of the model.Restricted Boltzmann machines are interesting because inference is easy in them and because they have been successfully used as building blocks for training deeper models.Roux & Bengio [10] proved that adding hidden units yield a strictly improved modeling power, and RBMs are universal approximators of discrete distributions.
In this paper, we provide yet another proof.The advantage of this proof is that it will lead to several new learning algorithms.We once again prove that Deep American Journal of Computational Mathematics Neural Networks are universal approximators.In our approach, Deep Neural Networks implement an expansion and this expansion is complete.
In this paper, a Deep Neural Network (DNN) is an Artificial Neural Network (ANN) with multiple hidden layers between the input and output layers.The organization of this paper is as follows.
In Section 2, we briefly review how to study the completeness problem of Deep Neural Networks (DNN).In this approach, given an input A, an output B, and a mapping from A to B, one can convert this problem to a probability distribution [3]  ∈ , then the probability p(a, b) will be close to 1.One can find a Markov chain [11] such that the equilibrium distribution of this Markov chain, p(a, b), realizes, as faithfully as possible, the given supervised training set.
In Section 3, the Boltzmann machines [3] [4] are briefly reviewed.All possible distributions together form a distribution space.All of the distributions, implemented by Boltzmann machines, define a Boltzmann Distribution Space, which is a subset of the distribution space [12] [13] [14].Given an unknown function, one can find a Boltzmann machine such that the equilibrium distribution of this Boltzmann machine realizes, as faithfully as possible, the unknown function.
In Section 4, we review the ABM (Attrasoft Boltzmann Machine) [15] which has an invariant distribution.An ABM is defined by two features: 1) an ABM with n neurons has neural connections up to the n th order; and 2) all of the connections up to n th order are determined by the ABM algorithm [15].By adding more terms in the invariant distribution compared to the second order Boltzmann Machine, ABM is significantly more powerful in simulating an unknown function.Unlike Boltzmann Machine, ABM's emphasize higher order connections rather than lower order connections.Later, we will discuss the relationships between the higher order connections and DNN.
In Section 7, we discuss how the invariant distribution of an ABM implements a θ-transformation [12] [13] [14], i.e. given an unknown function, one can find an ABM such that the equilibrium distribution of this ABM realizes precisely the unknown function.Therefore, an ABM is complete.
The next two sections are the new contributions of this paper.In section 8, we show that we can reduce an ABM to a DNN, i.e. we show that a higher order ANN can be replaced by a lower order ANN by increasing layers.We do not seek an efficient conversion from a higher order ANN to a lower order ANN with more layers.We will merely prove this is possible.
In Section 9, we prove that the DNN is complete, i.e. given an unknown function, one can find a Deep Neural Network that can simulate the unknown function.

Basic Approach for Completeness Problem
The goal of this paper is to prove that given any unknown function from A to B, one can find a DNN such that it can simulate this unknown function.It turns out that if we can reduce this from a discrete problem to a continuous problem, it will be very helpful.In this section, we introduce the basic idea of how to study the completeness problem.
The basic supervised learning [2] problem is: given a training set {A, B}, where , , B b b =  , find a mapping from A to B. The first step is to convert this problem to a probability [3] [4]: If a 1 matches with b 1 , the probability is 1 or close to 1.If a 1 does not match with b 1 , the probability is 0 or close to 0. This can reduce the problem of inferencing a mapping from A to B, to inferencing a distribution function.
An irreducible finite Markov chain possesses a stationary distribution [11].
This invariant distribution can be used to simulate an unknown function.It is the invariant distribution of the Markov Chain which eventually allows us to prove that the DNN is complete.

Boltzmann Machine
A Boltzmann machine [3] [4] is a stochastic neural network in which each neuron has a certain probability to be 1.The probability of a neuron to be 1 is determined by the so called Boltzmann distribution.The collection of the neuron states: ( ) of a Boltzmann machine is called a configuration.The configuration transition is mathematically described by a Markov chain with 2 n configurations x X ∈ , where X is the set of all points, ( ) When all of the configurations are connected, it forms a Markov chain.A Markov chain has an invariant distribution [11].Whatever initial configuration a Boltzmann machine starts from, the probability distribution converges over time to the invariant distribution, p(x).The configuration x X ∈ appears with a relative frequency p(x) over a long period of time.The Boltzmann machine [3] [4] defines a Markov chain.Each configuration of the Boltzmann machine is a state of the Markov chain.The Boltzmann machine has a stable distribution.Let T be the parameter space of a family of Boltzmann machines.An unknown function can be considered as a stable distribution of a Boltzmann machine.Given an unknown distribution, a Boltzmann machine can be inferred such that its invariant distribution realizes, as faithfully as possible, the given function.Therefore, an unknown function is transformed into a specification of a Boltzmann machine.
More formally, let F be the set of all functions.Let T be the parameter space of a family of Boltzmann machines.Given an unknown f F ∈ , one can find a Boltzmann machine such that the equilibrium distribution of this Boltzmann machine realizes, as faithfully as possible, the unknown function [3] [4].Therefore, the unknown, f, is encoded into a specification of a Boltzmann machine, t T ∈ .We call the mapping from F to T, a Boltzmann Machine Transformation: Let T be the parameter space of a family of Boltzmann machines, and let F T be the set of all functions that can be inferred by the Boltzmann Machines over T; obviously, F T is a subset of F. It turns out that F T is significantly smaller than F and that it is not a good approximation for F. The main contribution of the Boltzmann Machine is to establish a framework for inferencing a mapping from A to B.

Attrasoft Boltzmann Machines (ABM)
The invariant distribution of a Boltzmann machine [3] [4] is: If the threshold vector does not vanish, the distributions are: By rearranging the above distribution, we have: ( ) It turns out that the third order Boltzmann machines have the following type of distributions: ( ) An ABM [12] [13] [14] is an extension of the higher order Boltzmann Machine to the maximum order.An ABM with n neurons has neural connections up to the n th order.All of the connections up to the n th order are determined by the ABM algorithm [15].By adding additional higher order terms to the invariant distribution, ABM is significantly more powerful in simulating an unknown function.
By adding additional terms, the invariant distribution for an ABM is, ABM is significantly more powerful in simulating an unknown function.As more and more terms are added, from the second order terms to the n th order terms, the invariant distribution space will become larger and larger.Like Boltzmann Machines of the last section, ABM implements a transformation, B F T → .Our goal ultimately that this ABM transformation is complete so that given any function f F ∈ , we can find an ABM, t T ∈ , such that the equilibrium distribution of this ABM realizes precisely the unknown function.We show that this is exactly the case.

Basic Notations
We first introduce some notations used in this paper [12] [13] [14].There are two different types of coordinate systems: the x-coordinate system and the θ-coordinate system [12] [13] [14].Each of these two coordinate systems has two representations, x-representation and θ-representation.An N-dimensional vector, p, is: which is the x-representation of p in the x-coordinate systems.
In the x-coordinate system, there are two representations of a vector: • { } i p in the x-representation and p  in the θ-representation.In the θ-coordinate system, there are two representations of a vector: The reason for two different representations is that the x-representation is natural for the x-coordinate system, and the θ-representation is natural for the θ-coordinate system.
The transformations between { } i i i m p  will be introduced.Let N = 2 n be the number of neurons.An N-dimensional vector, p, is: , , , .

N p p p p
is the position inside a distribution, then x can be rewritten in the binary form: n n x x x x − = + + +  Some of the coefficients x i might be zero.In dropping those coefficients which are zero, we write:  This generates the following transformation: The 0-th order term is 0 p , the first order terms are: The x-representation is the normal representation, and the θ-representation is American Journal of Computational Mathematics a form of binary representation.
Example Let n = 3, N = 2 n = 8, and consider an invariant distribution: , , , , , , , p p p p p p p p , where p 0 is the probability of state x = 0,  .There are 8 probabilities for 8 different states, { } , , , , , , , p p p p p p p p , is in the x-representation and the second vector { } , , , , , , , p p p p p p p p is in the θ-representation.These two representations are two different expressions of the same vector.

θ-Transformation
Denote a distribution by p, which has a x-representation in the x-coordinate system, p(x), and a θ-representation in the θ-coordinate system, p(θ).When a distribution function, p(x) is transformed from one coordinate system to another, the vectors in both coordinates represent the same abstract vector.When a vector q is transformed from the x-representation q(x) to the θ-representation q(θ), and then q(θ) is transformed back to q'(x), q'(x) = q(x).
The θ-transformation uses a function F, called a generating function.The function F is required to have the inverse: Let p be a vector in the x-coordinate system.As already discussed above, it can be written either as: ( ) ( ) Or ( ) ( ) The θ-transformation transforms a vector from the x-coordinate to the θ-coordinate via a generating function.The components of the vector p in the x-coordinate, p(x), can be converted into components of a vector p(θ) in the θ-coordinate: Or ( ) ( ) Let F be a generating function, which transforms the x-representation of p in the x-coordinate to a θ-representation of p in the θ-coordinate system.The θ-components are determined by the vector F[p(x)] as follows: ( ) where  Prior to the transformation, p(x) is the x-representation of p in the x-coordinate; after transformation, F[p(x)] is a θ-representation of p in the θ-coordinate system.
There are N components in the x-coordinate and N components in the θ-coordinate.By introducing a new notation X:  then the vector can be written as: By using the assumption GF = I, we have: where J denotes the index in either of the two representations in the θ-coordinate system.The transformation of a vector p from the x-representation, p(x), in the x-coordinate system to a θ-representation, p(θ), in the θ-coordinate system is called θ-transformation [12] [13] [14].

An Example
Let an ANN have 3 neurons: ( ) , , x x x and let a distribution be: ( ) .

Reduction of ANN from Higher Order to Lower Order
In this section, we show that we can reduce a higher order ANN to a lower order ANN by introducing more layers.We start with the base case, three neurons; then we go to the inductive step of the mathematical induction.

Third Order ABM
Assume that we have a first ANN with three neurons in one layer, { } , , x x x x = ; the higher order distribution is: There is only one third order term in the above distribution for 3 neurons.
We simulate the first network with a second ANN with two layers, { } , , , y y y y y = . The transition from the first layer to the second layer is: , , , y y y y is: ( ) e H p y = ,

= + + ∑ ∑
There are only connections up to the second order.Separating y4, we have: The (n − 1) order distribution of { } ( ) e H p y = ,

 
Separating y n+1 from the other first order term, substituting y-neurons by x-neurons, and using the condition 1 1 0 n θ + = , we have: Separating y n+1 from the (n − 1) order term, substituting y-neurons by x-neurons, and using the condition: [4] of (A, B): p(a, b), a A ∈ , b B ∈ .If an input is a A ∈ and an output is b B

θ
 , are similar.Because of the similarity, in the following, only the transformation between representation, a vector p looks like: [14]American Journal of Computational MathematicsTherefore, an ABM can realize a θ-expansion, which in turn can approximate any distribution.ABM is complete[12][13][14].