Analysis of Soft Decision Trees for Passive-Expert Reinforcement Learning ()
1. Introduction
Artificial neural networks (ANNs) have historically been hard to dissect, and thus, they are conventionally considered as “black boxes” in terms of their internal workings [2]. This has barred ANNs from being widely adopted into various industries due to the lack of “transparency” on how they converge to a final recommendation or conclusion [3] [4]. Their black box stigma persists since ANNs are generally an approximation of available data to create a trend-based function to represent various cause and effect relationships in data regardless of the application [5]. This leads to uncertainty in possible predictions which could have unexpected negative effects when used in commercial applications [6].
Even though complete certainty is unachievable, there exist various methods to observe the decision-making process in ANNs. One such method is to dichotomize image recognition model filters to give context on how these networks process and transform data to make inferences. The application of such a method to reinforcement learning problems is not any different. Instead of sorting categories, actions are classified within a discrete action space [7] [8]. A problematic issue is that most networks used for Deep Q-Learning [9], a subset of reinforcement learning, generally share a lumped-layer like network model where extracting individual weights to determine learned inferences is not feasible. Soft Decision Trees effectively solve this problem as they have a filter at each node in a binary tree that uses softmax or sigmoid activation to determine the next filter to use in the network until the leaf nodes are reached. The leaf node returned contains the probability distribution for the classification of the analyzed data. Therefore, a soft decision tree can take the form of an 2n different possible paths where n is the number of layers in the tree network. At any one time, a soft decision tree can mimic 2n different ANNs reflecting a case-by-case analysis of the data being processed [10] [11] [12]. Applying these passive-expert like networks to various control problems can allow for analysis of the various inferences determined by the model to prevent unexpected negative consequences of bad predictions from an ANN that would be detrimental to commercial applications.
2. Soft Decision Tree Neural Networks
Soft decision trees are essentially a binary tree of nodes along a linear layer with an activation function that generates a probability distribution, specifically sigmoid or softmax. At inner nodes, this output is limited to 2, at leaf nodes it is limited to the desired output space. This is shown in Figure 1. The network uses a loss function, L(x), that minimizes the cross entropy between each leaf, weighted by its path probability and the target distribution (Equation (1)) [1]. To prevent stagnation of the network and a dependency on only one path, a “penalty” is introduced which promotes using various paths to evaluate data in the network. This “penalty” is the cross entropy within the looked-for average distribution 0.5, 0.5 for the two sub-trees and the actual average distribution α, (1 − α) (Equation (2) and (3)) [1]. Next, the sum of all the penalties is taken and multiplied by ƛ, a hyper parameter of the model, which determines the strength of the penalty applied to each layer. This hyper parameter decays proportionally to 2−N for each layer
. This technique was used to achieve better accuracy results [1].
(1)
(2)
where,
Figure 1. An overview of the structure of a soft decision tree [1].
(3)
and Pi(x) is the path probability from root the node 1.
3. Deep Q-Learning Reinforcement Learning
In this study, only deep Q-Learning methods were evaluated for soft decision trees on the various classical control problems implemented in the OpenAi’s Gym package. Deep Q-learning uses an approximation of the value function Q(s, a) through summing the reward at the current state with a discounted reward estimated by evaluating the next state in the environment via the model, and applying the discounting parameter γ (Equation (4)) [9] [13].
(4)
To fit the network to the highest rewarded state-action pairs, a replay buffer is created that randomly samples a queue of n length to generate a dataset with a state and a reward-weighted action output value. Each episode, which is collection of the state, action, reward, and next state, is record into the replay buffer. In this implementation, only the highest rewarded actions were stored in the replay buffer. After sampling and generating a state-rewarded action pair minibatch, the network was fitted to the minibatch. This was done repeatedly until the network converged. This process is demonstrated graphically in Figure 2. This attribute was arbitrary for the various control problems analyzed in this paper.
Figure 2. Reinforcement learning flowchart diagram.
4. Evaluated Models
A basic DQN (Figure 3) was evaluated as a baseline. It consisted of ReLU activations after every layer besides the last one, which was a Softmax activation. Only layers 3 and 6 had dropout layers which both were set at 0.2. The input space matched the state space of the cartpole environment which was 4. The output space matched the action space of the cartpole environment which was 2. The loss function used for the DQN network was the Mean Squared Error (MSE) Loss.
The soft decision tree configured for reinforcement learning was built using 6 layers and a penalty of 5.00E−4 (Figure 4 and Figure 5). The loss function used for the soft decision tree is depicted in Equation (1).
5. Computational Implementation and Results
To help the models converge quicker, a variable reward function (Equation (5)) was used specifically for cartpole.
(5)
The variable reward function presented is a demonstration of the various pre-processing methods used to make pattern recognition in neural networks easier. Specifically, the function takes the accumulated rewards of the current episode and subtracts the square position and theta divided by their episode termination thresholds because those are the maximum values that trigger an environment reset. A target network was not used due to the difficulty of swapping weights with soft decision trees. Therefore, the convergence of the implemented Q-Learning algorithm is sporadic and random given a random seed, 1.
Usually a target and actor network is used to handle the randomness of the various states and constant updating of weights in deep Q-Learning. This is illustrated in Figure 6, where there are significant jumps in rewards between episodes which demonstrate how unstable and unpredictable this algorithm is for reinforcement learning without certain safe guards such as a target-actor network training scheme.
Overall, the results attained during this study are promising. The authors were able to show that the use of soft decision trees in basic reinforcement of DQN
Figure 4. Soft decision tree configuration for deep Q-learning.
Figure 5. Parameters for determining siscounted rewards, replay, and model training.
Figure 6. Pytorch tensorboard summary writer output graph for the baseline DQN Model reporting the accumulated and current state rewards versus time until convergence.
networks can help in understanding the inner workings of optimal Q-Value learning on the ANN models. However, further runs with the addition of carefully designed safe-guarding paradigms are needed to fully assess the efficacy of such an approach.