^{1}

^{*}

^{1}

^{*}

^{1}

^{*}

Q-learning is a popular temporal-difference reinforcement learning algorithm which often explicitly stores state values using lookup tables. This implementation has been proven to converge to the optimal solution, but it is often beneficial to use a function-approximation system, such as deep neural networks, to estimate state values. It has been previously observed that Q-learning can be unstable when using value function approximation or when operating in a stochastic environment. This instability can adversely affect the algorithm’s ability to maximize its returns. In this paper, we present a new algorithm called Multi Q-learning to attempt to overcome the instability seen in Q-learning. We test our algorithm on a 4 × 4 grid-world with different stochastic reward functions using various deep neural networks and convolutional networks. Our results show that in most cases, Multi Q-learning outperforms Q-learning, achieving average returns up to 2.5 times higher than Q-learning and having a standard deviation of state values as low as 0.58.

Reinforcement learning is a form of machine learning in which an agent attempts to learn a policy that maximizes a numeric reward signal [

Q-learning is one of the most popular TD algorithms [

where r is the observed reward after performing a in s,

where t is the current time-step during training. Thus, the update for the Q-value of the state-action pair

where

Q-learning is a critical algorithm in reinforcement learning and has been successfully applied to a large number of tasks, but it has also been observed that the algorithm will sometimes overestimate the optimal Q-values. This overestimation occurs when the reward function of the MDP is stochastic. Double Q-learning is a variation to Q- learning that addresses this problem [

The roles of

Often times, the Q-values of Q-learning and Double Q-learning are stored in a lookup table. It has been proven that Q-learning will converge to the optimal value when the Q-values are stored in this manner [

Artificial neural networks are one effective method of function approximation. These neural networks are mathematical models made up of parameters that are tuned using back-propagation. Deep learning is a variety of artificial neural networks and has seen great success in learning from high-dimensional data, specifically image recognition [

AlphaGo is a computer program developed by Deep Mind that utilized deep learning along with a Monte Carlo search tree to play Go [

Deep reinforcement learning has also been successfully applied to continuous control problems. By combining deep neural networks with the actor-critic algorithm [

As mention earlier, the instability of value estimates is one of the biggest challenges to overcome when using linear/nonlinear value function approximation. To combat this instability issue, algorithms that use neural networks as their Q functions often train using

Instead of using any of the previously mentioned techniques, we attempt to achieve robust value estimates by introducing a new TD algorithm called Multi Q-learning. Our algorithm outperforms Q-learning and Double Q-learning in our tests and provides more stable Q-values throughout the training process. Multi Q-learning also works well with a diverse range of neural network implementation, making the algorithms beneficial when the ideal network structure is unknown. We find the Multi Q-learning algorithm to be an effective alternative to Q-learning due to its robustness and stability.

In this paper we discuss a new TD algorithm that extends upon Q-learning and Double Q-learning called Multi Q-learning. The main idea behind this new algorithm comes from Double Q-learning’s use of two estimators to find the maximum expected value for each action. Instead of simply using two Q functions: Q^{A} and Q^{B}, Multi Q-learning uses an arbitrary number of Q functions to estimate the action values. The algorithm updates one Q function at each time-step using the average value of all the other Q functions. So when Multi Q-learning contains a list of n Q-functions, and ^{A} becomes

The purpose of using multiple Q functions is to better stabilize the target value. Double Q-learning uncouples the selection and evaluation of an action, creating an unbiased estimate for the value of that action. But any drastic change in one Q function will greatly affect the other. Multi Q-learning addresses this problem by using the average action-value of many Q functions to update a specific Q function. If there is a drastic change in an action-value of one Q function, its effect on the other Q functions will be minimal. This helps when a substantial amount of noise exists in the environment, specifically in the environment’s reward function. Our Multi Q-learning algorithm is presented in Algorithm 1 below.

Algorithm 1. Multi Q-learning.

The algorithm for Multi Q-learning is quite similar to Double Q-learning. The key difference is the use of n Q functions opposed to only two. At each step in a training

episode, we choose a single Q function to update based on the probability

each Q function is chosen to be updated with an equal probability. After the training is finished, the Q-value for each state-action pair becomes the average Q-value of all the Q functions.

Similarly to Q-learning and Double Q-learning’s extension to DQN and Double DQN [

In the original Double DQN algorithm, the weights of a target network

For the Deep Multi Q-learning algorithm, we created an independent neural network for each Q. This fully decouples the selection and evaluation of an action, unlike Double DQN which only was partially decoupled. The target used by the Deep Multi Q-learning algorithm for a network characterized by weights

For our Deep Multi Q-learning algorithm, show in Algorithm 2 below:

Algorithm 2. Deep Multi Q-learning.

In this section, we analyze the performance of Q-learning, Double Q-learning, and Multi Q-learning. The metrics we focus on are the value estimates, the average returns, and the success rates. Our results show the robustness of Multi Q-learning when faced with stochastic rewards and that this robustness increases the performance of the algorithm. We also show the instability of Q-learning and Double Q-learning and how that instability can negatively affect the overall performance of the algorithm.

A 4 × 4 grid world (

The grid world was initialized to the same state at the start of every game. A training episode ends when the agent reaches a terminal state, whether it is the goal or the pit. The algorithms trained for 100,000 episodes and at the end of each episode, we ran the learned policy on the task and recorded the total return and the number of steps taken. If the agent fell into the pit, or took more than 10 steps, a 0 was record for the number of steps taken. The results were averaged over every 100 episodes in order to smooth the graphs.

In the following section, we evaluate our algorithm using a neural network with two hidden layers for the value function. One-hot encoding was used to represent the state as a 64 vector where the position of each object corresponded to a 16 element slice of

that vector. The vector representation of the state was used as the input to our neural network. Both hidden layers were fully connected and consisted of 150 rectifier units. The output layer was a fully connected linear layer with an output for each move. We used the RMSProp gradient descent algorithm [

duced as the training progressed (

When the standard deviation of the reward function was increased, the deviation in the value estimate increased for each algorithm. When using an -greedy behavior policy, Q-learning and Double Q-learning eventually converged to the true value, but with more oscillation compared to Multi Q, especially within the early stages of learning. When actions were completely exploratory, the variance in the value estimates of Q- learning and Double Q-learning greatly increased which caused both algorithms to struggle in approximating the true value. Multi Q-learning’s estimates steadily converged toward the true value and eventually stabilized even when faced with a greater range of rewards (

It is interesting to compare the graphs of Multi Q-learning with those of Double Q-learning and Q-learning. The graphs in

The oscillation in the value estimates had quite an effect on Q-learning and Double Q-learning’s performance.

There was no noticeable difference between Double Q and Multi Q while Q-learning clearly performed the worst. Multi Q-learning performed noticeably better when a random action was always taken. Q-learning had an average return of 1.1, Double Q’s average return was 2.4, and all the Multi Q algorithms had average returns above 3.5 when the behavior policy always choose a random action. This demonstrates Multi Q’s stability regardless its behavior policy.

The high returns are due to Multi Q’s ability to consistently reach the goal throughout its training. The success rates shown in

It is also interesting to note the smoothness of the return curve as the number of estimators increase. Using 4 or more estimators creates a much smoother curve than using 1, 2, or even 3. This smoothness can again be attributed to the stability of Multi Q’s value function approximation. The algorithm steadily converges to the optimal value estimate and has very little fluctuation.

In addition to the neural networks used above, we used three other network architectures to approximate each algorithm’s value function. We demonstrate the benefit of adding convolutional layers to improve value approximation. The first architecture we used had two fully connected hidden layers, both of size ten. The next network replaced the first hidden layer with a convolutional layer. This convolutional layer had ten filters

Algorithm | ||||||||
---|---|---|---|---|---|---|---|---|

1 Q | 2 Q | 3 Q | 4 Q | 5 Q | 6 Q | 7 Q | 8 Q | |

82.4% | 93.3% | 94.6% | 95.4% | 95.3% | 93.2% | 96.3% | 94.6% | |

9 | 75.2% | 88.3% | 94.9% | 94.8% | 92.7% | 93.2% | 92.7% | 92.6% |

11 | 70.7% | 85.9% | 92.3% | 90.1% | 89.1% | 92.5% | 90.0% | 91.4% |

13 | 68.4% | 81.3% | 87.6% | 93.4% | 92.6% | 93.1% | 89.8% | 91.0% |

15 | 57.9% | 81.7% | 85.6% | 87.3% | 87.3% | 90.9% | 87.5% | 84.7% |

17 | 65.8% | 77.9% | 87.0% | 83.7% | 85.6% | 87.5% | 85.8% | 81.2% |

19 | 60.8% | 75.8% | 88.1% | 86.0% | 84.0% | 80.5% | 80.1% | 88.0% |

all with a length of five. The addition of a convolutional layer slightly improved upon the network’s value estimates, leading to higher average returns. The final neural network had three hidden layers. These hidden layers were all fully connected and had a size of ten. Having three hidden layers did not improve upon any of the algorithms’ average returns. In fact, the addition of another layer actually performed worse than having only two hidden layers.

The average returns of the algorithms demonstrate an important advantage Multi Q-learning has over both Double Q-learning and Q-learning. The average returns of both Double Q-learning and Q-learning decreased steadily with the different neural network structures used. In our grid world environment, these two algorithms performed the best with two fully connected layers. Using this structure, Q-learning had an average return of 0.59 and Double Q-learning had an average return of 3.19. The algorithms’ performances dropped when a network structure of one convolutional layer and one fully connected layer was used. In this case, Q-learning had an average return of 0.52 and Double Q-learning had an average return of 2.79. The algorithms perform the worst when the network structure was made up of three fully connected layers. With this structure, Q-learning’s average return was 0.02 and Double Q-learning’s average return was 1.86. The decrease in average returns of Q-learning and Double Q-learning is in contrast Multi Q-learning. The Multi Q-learning algorithms did not see any major decrease in average returns when faced with different neural networks. This is a clear advantage of Multi Q-learning compared to Double Q-learning and Q- learning. The Multi Q algorithm is not as easily effected by the type of network structure or function approximator used to estimate the value function.

To further demonstrate the robustness of the Multi Q-learning algorithm, we show the average return of the algorithms with neural networks of increasing depth.

shows the average returns of each algorithm as the number of hidden layers in the network increased. The behavior policy during these tests was exclusively random, so the agent always took a random action throughout training. The performance of both Q-learning and Double Q-learning declined as the number of layers in the network increased. Contrarily, the performance of 6 Q, 7 Q and 8 Q remained relatively stable regardless of the number of layers. The returns for 6 Q, 7 Q, and 8 Q began to decrease once the network reached a depth of 5 hidden layers. This poor performance can likely be attributed to the vanishing gradient problem with neural networks. This again shows the stability and robustness of the Multi Q-learning algorithm; despite the depth of the network, Multi Q-learning is still able to solve the problem and achieve high returns.

We also ran the Multi Q-learning algorithm using Q-tables in addition to neural networks. The results of using Q-tables were less drastic then those found with the neural networks, but we did find a slight advantage of Multi Q-learning. In the early stages of training, Multi Q-learning appeared to have developed better policies than Q-learning and Double Q-learning.

The paper presents and evaluates a new temporal-difference learning algorithm, Multi Q-learning. Multi Q-learning attempts to increase value estimate stability by using multiple action-value function approximations. We tested our algorithm on a grid world environment and have shown the stability of Multi Q-learning compared to

Q-learning. In our tests, Multi Q-learning achieved better average returns compared to Q-learning, reaching returns up to 2.6 times higher than Q-learning. We also ran Multi Q-learning on various neural network structures, including deep, shallow, and convolutional networks, demonstrating the algorithms ability to remain effective despite its implementation. While Double Q-learning and Q-learning performance decreased with ineffective implementation, Multi Q-learning was still able to retain good returns. We see Multi Q-learning being most useful when the dynamics of a reinforcement learning task are not well known. Multi Q-learning’s improved stability over Q-learning makes it a more generalize algorithm, being able to solve a variety of tasks without the need to tune its parameters.

We would like to thank the Summer Research Institute at Houghton College for providing financial support for this study.

Duryea, E., Ganger, M. and Hu, W. (2016) Exploring Deep Reinforcement Learning with Multi Q-Learning. Intelligent Control and Automation, 7, 129- 144. http://dx.doi.org/10.4236/ica.2016.74012