^{1}

^{*}

^{1}

^{*}

^{1}

^{*}

Double Q-learning has been shown to be effective in reinforcement learning scenarios when the reward system is stochastic. We apply the idea of double learning that this algorithm uses to Sarsa and Expected Sarsa, producing two new algorithms called Double Sarsa and Double Expected Sarsa that are shown to be more robust than their single counterparts when rewards are stochastic. We find that these algorithms add a significant amount of stability in the learning process at only a minor computational cost, which leads to higher returns when using an on-policy algorithm. We then use shallow and deep neural networks to approximate the actionvalue, and show that Double Sarsa and Double Expected Sarsa are much more stable after convergence and can collect larger rewards than the single versions.

Reinforcement learning is concerned with finding optimal solutions to the class of problems that can be described as agent-environment interactions. The agent explores and takes actions in an environment, which gives the agent a reward, r, for each state,

There are multiple approaches to finding the optimal policy

maximizes the accumulated discounted reward per episode,

is the discount rate [

One of the most popular Temporal Difference algorithms is Q-learning, first proposed in [

However, in many scenarios an off-policy algorithm is not realistic as it does not account for possible rewards and penalties that might result from an exploratory behavior policy. For example, receiving immediate returns might be more important than a true optimal policy, and while an on-policy algorithm may not converge to the optimal policy, it may still converge in fewer time steps than an off-policy algorithm to a policy which may be considered “sufficient” according to the problem domain. Often, these on-policy algorithms have stochastic policies that encourage exploration of the environment, which can also be a beneficial quality when the environment is subject to change. One such policy is called

A simple on-policy algorithm that is similar to Q-learning is called Sarsa [

The algorithms presented above use a tabular format to store the action-values, i.e. there is a single entry in the table for every s, a pair; as such, they are limited to simple problems where the state-action space is small. For many real problems, this is not the case, especially when the state-space is continuous; function approximation must be used instead. In application to Temporal Difference algorithms, it is the value functions that are approximated [

The current state-of-the-art in reinforcement learning can be seen in [^{170} states [

In this paper, we present two new algorithms that extend from the Sarsa and Expected Sarsa algorithms, which we refer to as Double Sarsa and Double Expected Sarsa. The concept of doubling the algorithms comes from Double Q-learning, where two estimates of the action-value

The update rule for Double Q-learning is what makes it unique from standard Q- learning. In Q-learning, the action-value is updated according to

where s is the initial state, a is the action taken from that state, r is the reward observed from taking action a, and

The key idea is the replacement of the maximum action-value,

The update rule for Double Sarsa is very similar to that used for Double Q-learning. However, because it is on-policy, a few modifications are necessary. First, we use an

where

Because Sarsa does not take the maximum action value during the update rule, but does so instead during the computation of the greedy policy, there is a weaker decoupling of the two tables. However,

Expected Sarsa is a more recently developed algorithm that improves on the on-policy nature of Sarsa. Because Sarsa has an update rule that requires the next action

Because the update no longer depends on the next action taken, but instead depends on the expected action-value, Expected Sarsa can indeed converge; [

Although Expected Sarsa can be both on-policy and off-policy, here we discuss only

the on-policy version as it often has more utility; in Expected Sarsa,

Often, it is advantageous to represent the action-value function

If

where

and for Expected Sarsa,

Deep Double Sarsa and Deep Double Expected Sarsa use two different neural networks that have the same structure; we represent these two networks by their parameters

and the target for Deep Double Expected Sarsa is

As in the tabular algorithms, the policy is derived from the average of

The experiment used to test the difference between Sarsa, Expected Sarsa, and their respective doubled versions was a simple grid world (see

where

For comparison, we show the difference between the algorithms for rewards with both a deterministic distribution, where

In this paper, we first compare Sarsa, Expected Sarsa, Double Sarsa, and Double Expected Sarsa in tabular form, where

A comparison of Sarsa, Expected Sarsa, Double Sarsa, and Double Expected Sarsa under a deterministic reward system can be seen in

However, unlike the Expected algorithms, there is a clear performance difference between Sarsa and Double Sarsa for a deterministic reward. Like Expected Sarsa, Sarsa performs marginally better than Double Sarsa when the learning rate is small, although this is difficult to see in

especially if α is not annealed over time. Double Sarsa reduces this variation by decoupling the two tables, preventing against large changes in

A comparison of the path length distributions between the four algorithms in the stochastic case is shown in

the positive terminal state in fewer steps than Sarsa and Expected Sarsa. This is likely due to the double versions having a more stable policy as a result of having decoupled action-value estimates, preventing against large changes in the action-value, as well as the policy.

Also shown in the table is the average computation time for 100,000 episodes, with

As shown with the Double Q-Learning algorithm, Double Sarsa and Double Expected Sarsa initially tend to have a lower estimate of the action-value than Sarsa and Expected Sarsa, respectively.

This stability is especially important for on-policy algorithms, as a more stable behavior policy tends to reduce variation the distribution of states visited by the agent, as

well as the actions taken, making them significantly more predictable. For comparison, in the experiment shown in

tion-value over all

of

The degree of effectiveness of Double Sarsa and Double Expected Sarsa is highly dependent on the distribution of rewards.

with

The advantage of doubling Sarsa and Expected Sarsa can also be seen in

In order to test the robustness of each algorithm, we tested each of them with neural network function approximation of

For the architectures shown, the average increase in return of Double Sarsa (DS) over Sarsa (S),

Although it is apparent that doubling Sarsa and Expected Sarsa generally improves the performance the algorithms when neural networks are used to approximate the action-value function, the advantage of deep learning over shallow learning is contradicted by our experiments. Presumably, this is because there are comparatively very few states in our simple grid world environment; it is likely that, as the size of the grid increases, the benefit of neural network approximation might increase. However, even in this case, the advantage of deep learning over shallow learning might not fully become apparent without increasing the complexity of the environment; deep neural networks might not be beneficial until the environment reaches a certain level of complexity and non-linearity.

Even so, the experiments summarized in

Current on-policy reinforcement algorithms are less effective when rewards are stochastic, requiring a reduction in the learning rate in order to maintain a stable policy. Two new on-policy reinforcement learning algorithms, Double Sarsa and Double Expected Sarsa, were proposed in this paper to address this issue. Similar to what was found with Double Q-learning, Double Sarsa and Double Expected Sarsa were found to be more robust to random rewards. For a constant learning rate

We would like to thank the Summer Research Institute at Houghton College for providing financial support for this study.

Ganger, M., Duryea, E. and Hu, W. (2016) Double Sarsa and Double Expected Sarsa with Shallow and Deep Learning. Journal of Data Ana- lysis and Information Processing, 4, 159- 176. http://dx.doi.org/10.4236/jdaip.2016.44014