Stochastic Binary Neural Networks for Qualitatively Robust Predictive Model Mapping

We consider qualitatively robust predictive mappings of stochastic environmental models, where protection against outlier data is incorporated. We utilize digital representations of the models and deploy stochastic binary neural networks that are pre-trained to produce such mappings. The pre-training is implemented by a back propagating supervised learning algorithm which converges almost surely to the probabilities induced by the environment, under general er-godicity conditions.


Introduction
We consider the case where the statistical behavior of environmental models must be learned in real time.In particular, we focus on learning such behavior predictively, as may be applicable in data compression, hypothesis testing or model identification, while statistical qualitative robustness for protection against outlier data is sought as well.In this paper, we promote the deployment of stochastic binary neural networks which implement predictive model mappings in real time, in interaction with the environment; i.e. supervised learning, while they also offer sound protection against data outliers.Our approach uses results from stochastic approximation and statistical qualitative robustness [1][2][3][4][5][6][7][8][9].While powerful such results have been in existence for a long time, they have not been given attention synergistically, in the light of neural network implementations.In this paper, our objective is to stimulate interest in the application of the existing theories in such implementations, especially those addressing environmental models.
In the domain of stochastic neural networks, some more recent results address time-delay issues (Liu et al. [35] and Wang et al. [36]), while the book by Ling [37] discusses some general aspects in this area.
The organization of this paper is as follows.In Section 2, we introduce digital finite memory qualitatively robust predictive mappings, as well as the neural network layers needed for their implementation.In Section 3, we describe the operations performed at the predictive neural network layer.In Section 4, we present the supervised learning algorithm used at the predictive layer.In Section 5, we draw some conclusions.

Digital Finite Memory Qualitatively Robust Mapping
We consider digital environmental representations.We start by letting  , the objective of the digital mapping is to predict which one of M distinct regions, the observation x  is going to lie in.Denoting these regions j A ; , let us define the probabilities 1, , , , , which are used to map stochastically an observed sequence x x  onto each of the regions   j A , with corre- sponding probabilities .Two problems arise immediately: 1) Exploding computational load, due to the increasing memory represented by the sequences 1 x x  .
2) Statistical information on the sequences   needed for the computation of the probabilities .
The first problem is resolved if the increasing memory is approximated by finite, say size-m memory.That is, the increasing computational load is, instead, bounded if the process that generates the observations is approximated by an m-order Markov process.Then, the information loss is minimized when the process is Gaussian (see Blahut [26])., ,  p x Thus, to reduce the exploding computational load due to increasing data memory, we may initially model the process that generates the environmental data or observations by an m-order Gaussian Markov process, whose auto-covariance m × m matrix Q has components identical to those of the original process.We name this initial (Gaussian and Markov) process, nominal process.
Starting with our nominal process, but incorporating then statistical uncertainties in the form of unknown data outliers, we are led to a powerful qualitatively robust formalization, which results in a stochastic mapping (see Papantoni-Kazakos et al. [38]), as follows: Given observations 1 , use the m most recent observations for the prediction of the next datum 1 n x  , and defining q y , defined as follows, where , as induced by the Gaussian and Markov nominal process, and where, for some posi- The value of the constant  in (2) represents the level of confidence on the "purity" of the data vector y m , in terms of it being generated by the nominal Gaussian process: the higher the value of  , the higher the level of confidence, where as  decreases, increased weight on purely random mappings (represented by the probability 1 per region) is induced.M Robust estimation of the auto-covariance matrix Q may be also required.The components of the auto-covariance matrix Q should emerge from the statistics of the nominal Gaussian process.A scheme for the robust estimation of the matrix Q may arise from robust parameter estimation techniques, (see Kazakos et al. [3]).
The robust prediction expression in ( 1) is based on a Gaussian assumption regarding the nominal process which generates the data in the environment, where the latter assumption is the result of an information-theoretic approach to the reduction of the computational load caused by increased past memory.The important robust effects induced by the mapping in (1) remain unaltered, however, when instead, the probability in (1) arises from an arbitrary non-Gaussian process, and when its conditioning on y m is substituted by conditioning on quantized values of the scalar quantity m m p y . When quantized values are involved, the implementation of the mapping in (1) requires the following stages: 1) Preprocessing.This stage corresponds to long-term memory and involves the robust pre-estimation (see Kazakos et al. in (1) using inputs from the processing stage, and the subsequent implementation of the prediction mappings.
The three different stages above are performed sequentially by separate but connected neural structures, named preprocessing layer, processing layer, and predictive mapping layer, respectively.Our focus in this paper is on the latter layer: its structure and its operations.Towards that direction, we first note that, due to the quantization operations at the processing layer, the expression in (1) takes the following form: p r q where j , j , and  denote, respectively, the probabilities

The Neural Predictive Layer
Consider the integer M in (3), and let s be a unique positive integer, such that Then, in modulo-2 arithmetic, each state j, can be represented by an s-length 0 -1 binary sequence 1 . The state  is provided as an input to the prediction layer by the processing layer, and the former produces a binary sequence 1 s x x  as a prediction mapping.Given the state R  , the operations of the prediction layer must be such that, a given prediction sequence 1 s x x  is produced stochastically with probability.
where expression ( 4) is the same as expression (3) when the binary representation of the positive integer j in the latter is 1  , and where   x R is the prediction mapping generated by the nominal process that represents the actual data generating environment.Due to the stochastic nature of the rule in (4), such is also the nature of the predictive mapping layer, whose neural representation corresponds then to a stochastic neural network, first developed by Kogiantis et al. [17], when the response of each neuron is limited to binary.We proceed with the description of the latter representation.
Let us temporarily assume that the probabilities have been "learned" and are known.Without lack in generality, let us also assume that M = 2 s .The original constraint of binary firing per neuron in the layer leads us to the digital representation of the future states 1

R
. The design can be accommodated easily in a binary tree structure.In detail, given the observed state  and the resulting R  value, the mapping 1 s x x  r can be obtained via a stochastic binary tree search, on the 2 s -leaves tree, as follows: 1) With probability  a fair tree-search is activated, where the tree-node x 1 , x 1 = 0, 1 is visited with probability 0.5, and each of the two tree-nodes branching off a visited tree-node 1 k , 1 is also visited with probability 0.5; 2) With probability 1   a generally biased tree-search is activated, where the tree-node x 1 is visited with probability   Thus, the predictive mapping layer may be viewed as been comprised of a fair-search binary tree and a number of biased-search binary trees, each of the latter corresponding to a specific observation state  Given R  the common fair-search binary tree is activated with probability r  , while, with probability 1 r   R , the biased-search binary tree that corresponds to the state  is activated, instead; we name the latter tree, the R  tree.The nodes of each of the above binary trees are neurons that "fire," if the corresponding tree-nodes are "visited."Given R  , a specific mapping 1 s x x  r is generated either equiprobably from the fair-search binary tree with probability  , or from the R  -tree via the sequential stochastic representation in (5) with probability 1 r   .It is thus in the R  -tree that the probabilities which generate the data of the environment must be "learned" and then used to generate prediction mappings.R  , consider the Given the observation state R tree in conjunction with the sequential stochastic representation in (5) of the corresponding prediction mappings, as generated by the process representing the actual environmental data.Let u  represent the binary random output of the neuron that corresponds to the node  .Thus, the output x x u  may be viewed as generated by a product, , of mutually independent binary random variables   , whose distributions at the operational stage of the  -tree must be as follows (in view of ( 5)): where The above logical arguments and expressions lead to the following neural structure of the  -tree: 1) The neuron corresponding to the tree-node x 1 ; x 1 = 0, 1 has a binary random variable At the operational stage, the neuron must be activated then, where For , the neuron corresponding to the tree-node 1 k x x  has a binary random variable built in and fires, if and only if the latter variable takes the value 1 and simultaneously the neuron corresponding to the tree-node fires as well.Thus, the binary neural output , where and where, at the operational stage of the  -tree, the probability   must be as in (7).We note that and thus As it is clear from the derivations and arguments in this section, the parameters of interest in the  -tree neural network consist of the independent binary random variables W 1 and must be "learned" in advance, via interaction with the environment.

Learning at the Predictive Layer
Given the  -tree, we observe that, due to (6), any adaptations of the probability back-propa-  , which correspond to the responses of the output or "visible" neurons in the  -tree network.For easiness in presentation, let us now consider a fixed sequence 1 (in conjunction with the fixed observed state  that represents the R  -tree).Let then p denote the value of the probability   , as induced by the environment, and let q denote the value of the probability . Let the natural number n denote discrete observation time from the beginning of the learning stage, and let and denote estimates at time n of the probability values p and q, respectively.Finally, let the random variable V n be defined as equal to 1, if the environmental event occurs at time n, and as equal to 0, otherwise, and let In Kogiantis et al. [17], a Kullback-Leibler matching criterion between p and q was used, in conjunction with Newton's iterative numerical method, to develop the supervised learning algorithm stated below.

Algorithm
Initial Values: Select an initial value , while 1 .Computational Steps:  1) Given computed value and , compute ˆn p 1 ˆn p For some small positive value  , the value q p q q q q q p p p z r p q q n q p p p where (10).

ˆn q
For some small positive value  , the value  is corrected to  , and is corrected to Remarks: 1) Expression (10) is the computationally efficient recursive estimation format of the probabilities that represent the environment.2) The expression in (11) includes correction terms induced by the Newton's iterative numerical method, when the latter is applied on the Kullback-Leibler matching criterion between environmental probabilities and probabilistic adaptations used in the supervised learning process.The last term in (11) converges to zero, as the estimate in (10) converges to its true value.The second term in (11) converges to zero as the estimate of q converges to the estimate of p. 3) The small  and  correction biases are used to prevent the corresponding probability estimates from diverging to the 0 or 1 degenerate values. .

 
We now proceed with the statement of a theorem first proved in Kogiantis et al. [34].
Theorem Let the process which generates the observed data in the environment be ergodic.Let then s denote the probability of the event   1 s x x R   , as induced by the latter process.Then, the supervised learning algorithm converges to the probability s, almost surely, with rate inversely proportional to the sample/iteration size n.
Proof Outline Here, we present an outline of the Theorem's proof. 1) If the observed data are generated by an ergodic process, then the recursive sequential estimate in (10)  , conditioned on n n   q s equals zero, as deduced from expression (11).3) In view of the result in 2), it is then shown that the supremum of the conditional expected drift in 2) multiplied by n  converges to negative values, for all values of the absolute difference q s  n that are larger than some given positive small value.4) Using finally Blum's condition [11], the results in 2) and 3) above guarantee almost sure convergence.
We note that in the Theorem, if the process that generates the observed data in the environment is ergodic, and if     1 s s x x R   denote the prediction mappings induced by the latter process, then, via the learning algorithm and with almost sure convergence, the prediction mappings produced by the predictive mapping layer are governed by the probabilities In Kogiantis et al. [34], it was found that the learning algorithm converges rapidly to predictive probability mappings that are close to those induced by the environment, even under mismatch network conditions.Specifically, when past dependence decays fast with distance, then, even when the network order is less than the order of the Markovian environmental model, convergence to almost the true process is attained in less than fifty iterations in most cases.

Conclusion
We presented a neural network implementation for a digital qualitatively robust predictive mapping of environmental models.The mapping uses synergistically results from statistical qualitative robustness and stochastic binary neural representation to realize digital real-time predictive operations which identify the environmental model, while they simultaneously protect the operations against data outliers.The supervised learning algorithm recommended for the training of the neural network is based on stochastic approximation principles applied to the Kullback-Leibler matching criterion, in conjunction with Newton's iterative numerical method, and converges almost surely for models generated by ergodic processes.The considered predictive mappings have numerous applications, ranging from data compression, to model identification to sequential model hypothesis testing.

2 )
[3]), and storage of the matrix .Processing.This stage corresponds to short-term memory.It uses the matrix from the preprocessing step and the observation vector y m to: a) first compute the quadratic expression in a quantized form comprised of N distinct values and c) finally, use the quantized values in d) to compute the corresponding value of the function r   y m in (2). 3) Predictive Mapping.This stage involves the estima-


gate to adaptations of each of the other involved probabilities.It thus suffices to focus on the learning of the probabilities for the various binary se- 1