Double Sarsa and Double Expected Sarsa with Shallow and Deep Learning

Double Q-learning has been shown to be effective in reinforcement learning scenarios when the reward system is stochastic. We apply the idea of double learning that this algorithm uses to Sarsa and Expected Sarsa, producing two new algorithms called Double Sarsa and Double Expected Sarsa that are shown to be more robust than their single counterparts when rewards are stochastic. We find that these algorithms add a significant amount of stability in the learning process at only a minor computational cost, which leads to higher returns when using an on-policy algorithm. We then use shallow and deep neural networks to approximate the actionvalue, and show that Double Sarsa and Double Expected Sarsa are much more stable after convergence and can collect larger rewards than the single versions.

There are multiple approaches to finding the optimal policy ( ) * | a s π , which gives the probability of taking an action a given a state s. One set of techniques, known as Policy Gradient methods, directly search the space of available policies for one that maximizes the accumulated discounted reward per episode, is the discount rate [2]. Another set of approaches, called Temporal Difference (TD) methods [3], estimate the value of a particular state, ( ) V s , or state-action pair,

( )
, Q s a , and use these values to derive a policy ( ) | a s π that maximizes these value functions at each step instead of maximizing g. There are other techniques that combine the ideas of Policy Gradient and Temporal Difference methods, most notably a class of algorithms called Actor Critic [4] [5], but in this paper we only consider algorithms that fall under the Temporal Difference category. Within this category, there are two main types of algorithms: on-policy and off-policy [4]. With off-policy algorithms, the target policy being learned is different from the behavior policy, which is the policy that the agent uses to explore the environment. For example, the behavior policy might be to choose completely random actions, while the target policy might be to always take the action with the largest expected return. In contrast to off-policy algorithms, the target and behavior policies are the same with on-policy algorithms.
One of the most popular Temporal Difference algorithms is Q-learning, first proposed in [6]. Q-learning is an off-policy algorithm that learns the greedy action-value However, in many scenarios an off-policy algorithm is not realistic as it does not account for possible rewards and penalties that might result from an exploratory behavior policy. For example, receiving immediate returns might be more important than a true optimal policy, and while an on-policy algorithm may not converge to the optimal policy, it may still converge in fewer time steps than an off-policy algorithm to a policy which may be considered "sufficient" according to the problem domain. Often, these on-policy algorithms have stochastic policies that encourage exploration of the environment, which can also be a beneficial quality when the environment is subject to change. One such policy is called  -greedy [6], which uses the parameter  to control the probability that the optimal action will be taken over a random one.
A simple on-policy algorithm that is similar to Q-learning is called Sarsa [8]. Like Q-learning, it learns the action-values at each step, but unlike Q-learning, it depends solely on the states visited and actions taken. Because of this, Sarsa's action-value estimate ( ) , Q s a never converges when the learning rate α is constant, although for a sufficiently small α , the policy can converge to one that balances exploration and exploitation. For example, if an  -greedy policy is used, the policy that Sarsa converges to will avoid states that are adjacent to other states with a large negative reward. In other words, the policy will account for the possibility of random actions and take a path which figuratively does not come "too close to the edge". A similar on-policy algorithm is called Expected Sarsa [9]; like Sarsa, this algorithm converges to a policy that balances exploration and exploitation. However, unlike Sarsa, the action-value estimate also converges, which allowing for much higher learning rates to be utilized. Notably, the policies of both of these algorithms can only converge if the reward is deterministic; if it is stochastic, the policies are much less likely to converge (unless a sufficiently small learning rate is used).
The algorithms presented above use a tabular format to store the action-values, i.e. there is a single entry in the table for every s, a pair; as such, they are limited to simple problems where the state-action space is small. For many real problems, this is not the case, especially when the state-space is continuous; function approximation must be used instead. In application to Temporal Difference algorithms, it is the value functions that are approximated [10], and a variety of techniques from supervised learning are used. A more recent development is the application of deep learning [11] to Q-learning, termed a Deep Q-Network [12]. Deep learning function approximation is the term given to neural networks with many layers, and has been shown to be effective in reinforcement learning problems with large state-action spaces, such as those encountered in Atari games. Double learning has also been applied to Deep Q-Networks, which is referred to as Deep Double Q-learning [13], and has shown success in the same domain. However, recent work has shown that shallow networks can achieve similar results [14], so the advantage of deep learning over shallow learning appears to be highly domain-dependent. Additionally, deep learning has been applied to Actor Critic methods, combining Deep Q-networks with recent development in deterministic policy gradients [15] to produce a robust learning algorithm [16].
The current state-of-the-art in reinforcement learning can be seen in [17], which combined many techniques in order to learn the game of Go. This study used supervised learning to initialize a policy network, and then improved this network through self-play and generated new data. This data was then used to train a value network.
During game play, a Monte Carlo Tree Search algorithm was used to simulate future moves and choose the best action. This efficient use of data to solve a problem with about 2.08 × 10 170 states [18] represents a significant achievement in the field of reinforcement learning, and shows the power of combining multiple different techniques.
The study used a combination of supervised learning and reinforcement learning to train both a policy and a value network, and combined real data with simulated data to improve their training. Additionally, although during training many CPUs and GPUs were used, the final rollout action selection was very efficient and ran on a single machine in a short period of time.

Algorithms
In this paper, we present two new algorithms that extend from the Sarsa and Expected Sarsa algorithms, which we refer to as Double Sarsa and Double Expected Sarsa. The concept of doubling the algorithms comes from Double Q-learning, where two estimates of the action-value ( ) , Q s a are decoupled and updated against each other in order to improve the rate of learning in an environment with a stochastic reward system. Although Q-learning and Double Q-learning are off-policy, this concept extends naturally to the on-policy algorithms of Sarsa and Expected Sarsa, producing a variation of each algorithm that is less susceptible to variations in the reward system. In addition, the ideas of Double Sarsa and Double Expected Sarsa can be extended with function approximation of the action-values, in the same way that Q-learning can be extended to Deep Q-networks through function approximation.

Double Sarsa
The update rule for Double Q-learning is what makes it unique from standard Qlearning. In Q-learning, the action-value is updated according to where s is the initial state, a is the action taken from that state, r is the reward observed from taking action a, and s′ is the next state the agent reaches resulting from s, a. In Double Q-learning, the update is decoupled using two tables, A Q and B Q : The key idea is the replacement of the maximum action-value, The update rule for Double Sarsa is very similar to that used for Double Q-learning. However, because it is on-policy, a few modifications are necessary. First, we use an greedy policy that uses the average of the two tables to determine the greedy action, is the probability of taking action a from state s, and a N is the number of actions that can be taken from state s. In general, any policy derived from the average of A Q and B Q can be used, such as  -greedy or softmax [19]. The update rule then becomes Because Sarsa does not take the maximum action value during the update rule, but does so instead during the computation of the greedy policy, there is a weaker decoupling of the two tables. However, A Q is still updated using the value from B Q for the state-action pair , s a ′ ′ , which helps to reduce the variation in the action-value.

Double Expected Sarsa
Expected Sarsa is a more recently developed algorithm that improves on the on-policy nature of Sarsa. Because Sarsa has an update rule that requires the next action a′ , it cannot converge unless the learning rate is reduced ( 0 α → ) or exploration is annealed , as a′ always has a degree of randomness. Expected Sarsa changes this with an update rule that takes the expected action-value instead of the action-value of , s a ′ ′ : Because the update no longer depends on the next action taken, but instead depends on the expected action-value, Expected Sarsa can indeed converge; [9] notes that for the case of a greedy policy π , Expected Sarsa is the same as Q-learning. In order adapt this to Double Expected Sarsa, we change the summation to be over B Q instead of A Q : Although Expected Sarsa can be both on-policy and off-policy, here we discuss only Figure 1. Double Sarsa algorithm, with tabular representation of the action-values. Lines 10 and 11 swap the references to A Q and B Q , meaning each table is updated using half of the experiences each. Note that 0 γ = if the next state s′ is terminal, otherwise it is the discount rate.
the on-policy version as it often has more utility; in Expected Sarsa,

( )
, Q s a represents the estimated action-value under target policy π , which is the same as the behavior policy when it is on-policy. If the behavior policy and target policy are different (i.e. it is off-policy), it is usually more desirable for the target policy to be greedy, and not stochastic, in which case Expected Sarsa degenerates to Q-learning. The on-policy Double Expected Sarsa algorithm is shown in Figure 2, with lines 8 and 9 being the only differences from Double Sarsa. The two tables are again decoupled, this time in calculating the expected value B s V ′ under the current policy. Although the action a′ is chosen in line 7, it is not needed until the next iteration (it is shown as such in order to be consistent with the Double Sarsa algorithm in Figure 1).

Neural Network Approximation of Q(s, a)
Often, it is advantageous to represent the action-value function ( ) is a one-hot encoding for each , s a pair, this degenerates to the tabular form discussed above. However, it is often beneficial to introduce non-linearities into the function approximator; one set of functions that do so are known as neural networks. The action-value function can be written more generally to accommodate this change of form: where ( ) s φ is a feature vector that represents the state s, θ is a vector that represents the parameters of the network, and a f is the component of the vector-valued function f that corresponds to action a. It is important to note that this function approximation allows for a continuous state-space, but a discrete action-space; the approximation can be extended further to continuous action-spaces as well, especially in actor-critic algorithms [16], but in this paper we only discuss the former approximation.
In order to update the Sarsa network, we use a target similar to the target used in the tabular form, and for Expected Sarsa,  (11) and the target for Deep Double Expected Sarsa is As in the tabular algorithms, the policy is derived from the average of   Note that 0 γ = if s′ is terminal, otherwise it is the discount rate.

Results
The experiment used to test the difference between Sarsa, Expected Sarsa, and their respective doubled versions was a simple grid world (see Figure 5) with two terminal states, one with a positive reward of 10 and the other with a negative reward of −10. Additionally, a blocking "wall" was placed in between the terminal states. Every time the agent moves a step in the environment, it receives an average reward r with mean µ and standard deviation σ . The state feature vector was represented by the concatenation of four one-hot encodings of the position of each of the objects,  A is the agent's starting position, W is the "wall", P is the terminal state with a reward of −10 (the "pit"), and G is the second terminal state with a reward of +10 (the "goal"). A is the only position allowed to change throughout the course of an episode.
In this paper, we first compare Sarsa, Expected Sarsa, Double Sarsa, and Double Expected Sarsa in tabular form, where ( ) , Q s a is represented by a single table entry for each s, a pair, varying different parameters of exploration, learning, and rewards. Then, we discuss the extension of these algorithms to Q-Networks and Deep Q-Networks, using neural networks to approximate ( ) , Q s a in a few scenarios that highlight the advantage of applying double learning to Sarsa and Expected Sarsa.

Tabular Representation of Q(s, a)
A comparison of Sarsa, Expected Sarsa, Double Sarsa, and Double Expected Sarsa under a deterministic reward system can be seen in Figure 6(a), showing the average return was over 100,000 episodes. Expected Sarsa and Double Expected Sarsa appear to have almost identical performance, although for small learning rates Expected Sarsa tends to perform marginally better; presumably, this is because the doubled version must train two tables and consequently takes longer to converge than the single version. In the first 1000 episodes under the same reward system, the average return collected by Double Expected Sarsa was about 6.4% less than the reward received by Expected Sarsa (not shown), which supports this hypothesis.
However, unlike the Expected algorithms, there is a clear performance difference between Sarsa and Double Sarsa for a deterministic reward. Like Expected Sarsa, Sarsa performs marginally better than Double Sarsa when the learning rate is small, although this is difficult to see in Figure 6 especially if α is not annealed over time. Double Sarsa reduces this variation by decoupling the two tables, preventing against large changes in ( ) , Q s a , which tends to produce a more stable policy and increase the amount of reward collected. Figure 6(b) shows the same comparison between the four algorithms with a stochastic reward, where 1 µ = − and 7 σ = . For most of the learning rates tested, the doubled versions of the algorithms performed better than their respective single version. Unlike the deterministic case, Expected Sarsa does not have a clear advantage over Double Expected Sarsa in the first 1000 episodes (not shown), and like the deterministic case both exhibit the same trend over 100,000 episodes ( Figure 6(b)), although the trend is significantly different. Figure 7 shows the learning rate below which returns are positive and above which they are negative, comparing the learning rate which produces the same average return. These results indicate that the double estimators employed by Double Sarsa and Double Expected Sarsa allow for faster learning rates under the same stochastic reward conditions. As shown in Figure 7, around a 40% increase in the learning rate can be applied before Double Sarsa and Double Expected Sarsa collect rewards equivalent to Sarsa and Double Expected Sarsa, respectively. In real world applications, this can be a significant advantage, allowing greater returns to be collected earlier on in the learning process.
A comparison of the path length distributions between the four algorithms in the stochastic case is shown in Figure 8. The path length L is the number of steps that it took the algorithm to reach a terminal state in a given episode. Although all four algorithms reach the negative terminal state in 1 L = steps with approximately equal probability, it is apparent that Double Sarsa and Double Expected Sarsa tend to reach  the positive terminal state in fewer steps than Sarsa and Expected Sarsa. This is likely due to the double versions having a more stable policy as a result of having decoupled action-value estimates, preventing against large changes in the action-value, as well as the policy.
Also shown in the table is the average computation time for 100,000 episodes, with 0.1 α = . As can be seen in the table, the extra computational expense of Double Sarsa, Expected Sarsa, and Double Expected Sarsa is marginal, with all three algorithms taking less than 10% more time than Sarsa. This is in contrast to the increase in the zero crossing learning rate ( α where 0 g = ), which in all cases is significantly greater than the original algorithm. This indicates that there is a significant advantage of using the doubled versions of Sarsa and Expected Sarsa when the reward is stochastic.
As shown with the Double Q-Learning algorithm, Double Sarsa and Double Expected Sarsa initially tend to have a lower estimate of the action-value than Sarsa and Expected Sarsa, respectively. Figure 9 -value table, which improves the quality of the policy and consequently increases the total reward. This stability is especially important for on-policy algorithms, as a more stable behavior policy tends to reduce variation the distribution of states visited by the agent, as . The convergence of the returns occurs much faster than the estimated maximum action value of the initial state.
well as the actions taken, making them significantly more predictable. For comparison, in the experiment shown in Figure 9(a), the average variance of the maximum action-value over all 1000 T = episodes, Q s a at episode t over 1000 runs) was computed. For Sarsa, 2 σ was 4.51, 2.44 for Double Sarsa, 4.36 for Expected Sarsa, and 2.32 for Double Expected Sarsa. This is a significant reduction in variation, given the small difference in the average return curves shown in Figure 9(b). Figure 10(a) and Figure 10(b) show similar results from an experiment with the same parameters, except that the number of episodes was increased to 100,000 and the number of runs decreased to 100. As can be seen, the average return collected by

Double Sarsa and Double Expected Sarsa quickly surpasses that of Sarsa and Expected
Sarsa, and the maximum action-value increases accordingly. Once again, this is likely due to the reduction in variation provided by double learning. It is also interesting to note that this is different than what was shown in [7] for Double Q-learning. That study found that the double estimator should, on average, underestimate the single estimator; this is clearly not the case in Figure 9(a). Likely, this is due to the fact that Q-learning is off-policy and takes the max in its update rule, while Sarsa is on-policy and often has a stochastic behavior (and target) policy.
The degree of effectiveness of Double Sarsa and Double Expected Sarsa is highly dependent on the distribution of rewards. Figure 11  The advantage of doubling Sarsa and Expected Sarsa can also be seen in Figure 11 Expected Sarsa than with Sarsa and Expected Sarsa before the returns collapse, meaning a greedy policy can be more quickly achieved and greater returns can be collected; in situations where exploration is highly undesirable (e.g. it is expensive), this can be a significant advantage.

Neural Network Representation of Q(s, a)
In order to test the robustness of each algorithm, we tested each of them with neural network function approximation of ( ) , Q s a . All neural networks were implemented using the Keras library [22], and backpropagation was performed using the RMS Prop technique [23]. A comparison of different neural network architectures applied to each algorithm can be seen in Figure 12. The parameter n represents trials from a range of values; typically, { } 5,10, ,100 n ∈  . The returns were averaged over 16 runs in order to reduce natural variations in performance from random initialization of the network parameters, θ , and the maximum average return for each architecture was taken over n according to max max n n g g = , where n g is the average return for a given architecture with parameter n. As can be seen in the figure, a variety of trends are apparent.
First, as the network architecture transitions from shallow to deep, the average return collected generally decreases. For a random policy, the average return g was determined experimentally to be about −16.03, indicating that any network architecture with an average return 16.03 g > − has learned a policy better than random, and any network architecture with an average return 10 g > − has learned a policy that must reach the positive terminal state at least part of the time.
For  Although it is apparent that doubling Sarsa and Expected Sarsa generally improves the performance the algorithms when neural networks are used to approximate the action-value function, the advantage of deep learning over shallow learning is contradicted by our experiments. Presumably, this is because there are comparatively very few states in our simple grid world environment; it is likely that, as the size of the grid increases, the benefit of neural network approximation might increase. However, even in this case, the advantage of deep learning over shallow learning might not fully become apparent without increasing the complexity of the environment; deep neural networks might not be beneficial until the environment reaches a certain level of complexity and non-linearity.
Even so, the experiments summarized in Figure 12 show the effect of using function approximation on an on-policy algorithm, which in this case decreased the average return significantly, which is something that was not observed with off-policy algorithms.
Likely, this is a product of the increased feedback present in on-policy algorithms; the choice of action a affects the update of θ , which changes the action-values and policy, and consequently affects the choice of the next action a′ . In off-policy algorithms, θ does not affect the policy, meaning that there is a greater degree of stability when training the neural network approximator.

Conclusion
Current on-policy reinforcement algorithms are less effective when rewards are stochastic, requiring a reduction in the learning rate in order to maintain a stable policy.
Two new on-policy reinforcement learning algorithms, Double Sarsa and Double Expected Sarsa, were proposed in this paper to address this issue. Similar to what was found with Double Q-learning, Double Sarsa and Double Expected Sarsa were found to be more robust to random rewards. For a constant learning rate α , these algorithms are more stable to large variations in rewards, allowing them to still achieve significant returns when the standard deviation σ is significantly larger than the magnitude of the rewards received in the terminal states. We found that the estimated action-values of Double Sarsa and Double Expected Sarsa were much more stable than those of both Sarsa and Expected Sarsa, which resulted in a better policy. However, unlike Double Q-learning, we showed that the double estimators of the proposed algorithms could overestimate the single estimators of the original algorithms. In addition, we found that, for the same average return, a more aggressive learning rate could be used with the viding financial support for this study.