Quantum Multiple Q-Learning

In this paper, a collection of value-based quantum reinforcement learning algorithms are introduced which use Grover’s algorithm to update the policy, which is stored as a superposition of qubits associated with each possible action, and their parameters are explored. These algorithms may be grouped in two classes, one class which uses value functions ( ( ) V s ) and new class which uses action value functions ( ( ) , Q s a ). The new ( ) , Q s a -based quantum algorithms are found to converge faster than ( ) V s -based algorithms, and in general the quantum algorithms are found to converge in fewer iterations than their classical counterparts, netting larger returns during training. This is due to fact that the ( ) , Q s a algorithms are more precise than those based on ( ) V s , meaning that updates are incorporated into the value function more efficiently. This effect is also enhanced by the observation that the ( ) , Q s a -based algorithms may be trained with higher learning rates. These algorithms are then extended by adding multiple value functions, which are observed to allow larger learning rates and have improved convergence properties in environments with stochastic rewards, the latter of which is further improved by the probabilistic nature of the quantum algorithms. Finally, the quantum algorithms were found to use less CPU time than their classical counterparts overall, meaning that their benefits may be realized even without a full quantum computer.

algorithms which find an optimal sequence of actions to achieve a goal; unlike supervised learning algorithms, reinforcement learning algorithms solve an implicit problem.Because of this, these algorithms may be applied to a wide range of problem domains, from robotics [2] to buying and selling stocks [3].
The main goal of reinforcement learning is to maximize a signal, known as the reward, over a sequence of time steps, known as an episode, by finding a policy which describes what action to take from each state.The combination of the reward signal with the environment with which the agent interacts implicitly defines an optimal policy; the goal of all reinforcement learning algorithms is to find this policy or a good approximation of it.
While many algorithms for reinforcement learning work well in environments with a reasonable number of states, they become ineffective in large state spaces (such as a continuous state space).To address this, algorithms must use function approximation and train with a relatively small number of experiences.One recent success was the application of reinforcement learning to classic Atari games [4], which used a neural network approximation to Q-learning, a well known algorithm.This success was repeated a few years later using double Q-networks [5], demonstrating even greater success.Another recent demonstration of the power of reinforcement learning was its application to the game of Go.AlphaGo, a reinforcement learning algorithm which combines deep neural networks and tree search to learn the game of Go [6], was able to defeat the European champion of Go in multiple rounds.This represents a significant milestone in the field of reinforcement learning because there are about 170 2 10 × unique, legal board configurations [7], rendering exact reinforcement learning methods intractable.

Value-Based Reinforcement Learning Algorithms
Reinforcement learning algorithms consider the problem of finding an optimal policy in a Markov Decision Process (MDP) with respect to a reward signal; the nature of this is discussed in detail by Sutton and Barto [1].While algorithms exist which directly search the policy space for the optimal policy (Williams' REINFORCE [8], for instance), value-based algorithms search instead for a value-function satisfying the Bellman Equation using dynamic programming.This optimality equation can be expressed in two forms, where ( ) r ⋅ is the reward obtained from the transition ( ) ≠ , with i chosen uniformly at random.In Multiple Q-learning, N estimates are maintained of ( ) , Q s a , and at each time step a single estimate is updated using the average of all other 1 N − estimates: where i is chosen uniformly over [ ] 1, N .

Introduction
Recently, there has been increased interest in developing quantum computing algorithms.In the quantum computing paradigm [12], which differs significantly from classical computing, an algorithm can simultaneously process a large number of inputs, expressed as superimposed quantum states, through entanglement and interference.Current quantum computers are relatively small, such as those produced by IBM Q [13] with 16 or 17 qubits, but the size of these computers is expected to increase over time as the technology matures.Potentially, these future quantum computers would be able to solve certain problems International Journal of Intelligence Science which are intractable for classical computers.
Two important algorithms which are the building blocks of more complicated algorithms are known as Shor's algorithm [14] and Grover's algorithm [15] Shor's algorithm finds the prime factors of large integers in ( ) and is an important algorithm in cryptography.However, in this paper we only consider the application of Grover's algorithm to reinforcement learning.
Quantum computing has shown promise in the field of machine learning [16], most importantly offering a reduction in computational complexity when compared to classical algorithms.Quantum versions of principle component analysis (PCA), support vector machines (SVM), neural networks [17] [18], and Boltzmann machines.

Grover's Algorithm
Grover's algorithm [15] is a well-known search algorithm in quantum computing that can find an item in ( ) (which is the run time of classical algorithms).The basic concept of Grover's algorithm is to increase the probability that a given quantum-mechanical system, when measured, will yield the correct answer, which is determined by an oracle function ( ) 0 While originally proposed as a search algorithm, Grover's algorithm may be used more generally as a method to increase the probability of measuring any state, making it a useful process that may be incorporated in other algorithms.

Quantum Reinforcement Learning
With the increase in interest in quantum computing has come interest in applying it to reinforcement learning.As the application of reinforcement learning to real-world problems generally requires a very large state-space, the hope is that the application of quantum computing would significantly reduce the amount of time for the algorithm to reach convergence.A generalized framework for quantum reinforcement learning is described in detail by Cárdenas-López, et al. [19].In this framework, the goal is to maximize the overlap between quantum states stored in registers and the environment through a rewarding system.Other efforts have been directed toward evaluating the adaptability of quantum reinforcement learning agents, as one key component of a reinforcement learning algorithm is its ability to adapt to a changing environment [20] [21].Initial evidence indicates that quantum computing can improve the agent's decision making in a changing environment.
Other work in Quantum Reinforcement Learning has focused on adapting more traditional reinforcement learning algorithms to Quantum computing.
Utilization the free energy of a restricted Boltzmann machine (RBM) to approximate the Q-function was first proposed by Sallans and Hinton [22].Their method was later extended to a general Boltzmann machine (GBM) [23] [24], which was shown to provide drastic improvement over the original method in the early stages of learning.
A specific algorithm known as Quantum Reinforcement Learning (VQRL) [25] International Journal of Intelligence Science utilizes quantum computing to update the policy of a value-based reinforcement learning agent.The basic idea is to store the policy as a superposition of actions so that quantum computing algorithms may be applied to gain the quantum speedup.The policy that the agent follows is given by where a is an action taken from state s, a is its associated eigenstate, and s a is the state of a quantum system that represents the policy for state s.The state of these systems are updated according to the rule where ˆg U is a unitary operator that represents one Grover iteration.This may be expressed as a combination of a reflection and diffusion, ˆa U and ˆs a U [26], respectively, given by ˆˆˆ.
s g a a

U U U =
The effect of ˆa U is to invert the amplitude of a, and the effect of ˆs a U is to invert all the amplitudes of the state about their mean.When applied an equiprobable state, applying ˆg U increases the probability of obtaining a from a measurement of the state.However, ˆg U may not be applied indefinitely; after a certain number of iterations, Grover's algorithm tends to decrease the probability of measuring a.The maximum number of iterations depends on the number of possible actions, and is given by where max L is the maximum number of iterations that may be applied until the probability decreases and tan s a a θ = .

Quantum Q-Learning and Multiple V(s) or Q(s, a) Functions
In this paper, a collection of new quantum reinforcement learning algorithms are introduced which are based on Quantum Reinforcement Learning (VQRL), which was first described by Dong, et al. [25].These quantum algorithms store the policy as a superposition of qubits, and use Grover's algorithm to update the probability amplitudes corresponding to different actions in a given state.The novelty of these algorithms quantum algorithms comes from replacing the value function ( ) V s with the action-value function ( ) , Q s a in the quantum rein- forcement learning algorithm VQRL.This new algorithm is called Quantum Q-learning (QQRL).The advantage of the ( ) V s , which is an advantage during training because it more efficiently uses updates applied after training steps.This is easily contrasted with the ( ) V s functions, which can suffer from "contradicting'' updates-that is, the averaging of updates resulting from taking different actions from the same state with widely varying rewards.In experiments, it QQRL was found to converge faster than VQRL, which is likely due to this additional precision.Both VQRL and International Journal of Intelligence Science QQRL exhibit much faster convergence than their classical counterpart, Q-learning, which is likely due to the balance of exploration and exploitation provided by the quantum nature of the policy.
Another alteration done to VQRL, and also done to QQRL, is to increase the number of ( ) V s and ( ) , Q s a functions; this alteration is inspired by Multiple Q-learning algorithm described by Duryea, et al. [11].These algorithms are known as Multiple Quantum Reinforcement Learning (MVQRL) and Multiple Quantum Q-learning (MQQRL).In general, it was found that increasing the number of ( ) V s or ( ) , Q s a functions increased the performance of the algorithms in a stochastic environment, where the reward is sampled from a distribution of values instead of being deterministic.
The structure of this paper is as follows.Section 2 introduces the quantum reinforcement learning algorithms discussed in this paper.Section 3 first presents the test environment used on the algorithms, and then shows the results of testing the new algorithms in the environment.Finally, Section 4 concludes by interpreting the results and discussing their implications.

Algorithms
is applied.The expected value of each state is then the average of the two value functions ( ) ( ) ( ) ( ) . This is used to compute L, which is the number of times to apply Grover's operator to the policy, s a .First, θ is computed by solving tan .
Then, after solving for θ , L may be computed using where L is the number of times to apply Grover's operator and k is a parameter which controls the rate at which the policy is updated.The Grover operator ˆg U may then be applied L times to the current policy, given by ( ) ( ) The algorithm for DVQRL can be seen in Algorithm 1.International Journal of Intelligence Science

Multiple Quantum Reinforcement Learning (MVQRL)
Multiple Quantum Reinforcement Learning (MVQRL) is similar to Double Quantum Reinforcement Learning, but allows for N action value functions instead of only 2. Similar to Multiple Q-learning, the effect of increasing the number of functions is an improvement in learning in stochastic environments.The estimated value of each state is stored in the functions The value functions are updated at each step according to . The algorithm for Multiple VQRL may be seen in Algorithm 2.

Quantum Q-Learning (QQRL)
The idea of Quantum Reinforcement Learning may also be adapted to utilize an , Q s a is the expected value of taking action a from state s, and s V ′ is the expected value of the agent being in state s′ under the current policy.Like VQRL, in QQRL the policy s a is updated according to ( ) ( ) The full algorithm describing QQRL is shown in Algorithm 3.

Double
When computing L to determine the number of Grover iterations to be performed on the policy, ( ) The policy is then updated in the same way as in QQRL.

Multiple Quantum Q-Learning (MQQRL)
QQRL may also be extended to have any number of action value functions; this is done in a similar way to Multiple Q-learning and MVQRL.The algorithm for Multiple Quantum Q-learning (MQQRL) may be seen in (Algorithm 5).At each time step, a single function is chosen to be updated; this is done by sampling . In order to compute the number of Grover iterations, L, ( ) The probability amplitudes of the policy are then updated in the same way as in QQRL.

Test Environment and Optimal Paths
The grid environment used to test the algorithms can be seen in Figure 1.At each time step, the agent may move between adjacent states through the actions up, down, left, and right.The two optimal paths through the environment are shown in Figure 2.This is the environment used by Brown [26], and is similar to grid environments used in recent studies of quantum computing-based reinforcement learning [23] [24].While the state and action spaces of these environments are small, they simplify the process of analysis while the field is still in early stages of development.Furthermore, these small state and action spaces lend themselves to implementation on current quantum computing hardware; for example, Sriarunothai, et al. [20] were able to implement an ion-trap reinforcement learning agent with only 2 qubits as a proof of concept.
In order to denote paths through the environment, an action sequence is used.This is a string of action numbers; the mapping from number to action may be seen in Table 1.For example, a path might be denoted by "31210"; this represents the movements right, down, left, down, and up, in sequence.Note that an action may not change the state; if the agent is in its initial position and follows the path "000", it will remain in the same position.
The "pit" and the "goal" are terminal states; that is, when the agent enters these states, the episode is finished.When the agent enters the "pit", it receives a reward of 10 r = − , and when it enters the "goal", it receives a reward of  A represents the starting location of the agent in the environment, P represents the pit, where the agent receives a large negative reward, W represents the wall, which is a disallowed state, and G represents the goal, where the agent receives a large positive reward.At each time step, the agent receives a small negative reward so that the optimal policy is the shortest path through the environment from the initial state to the goal.
Figure 2. Optimal and sub-optimal paths through the environment.The optimal path (left) corresponds to an action sequence of 33111, while the sub-optimal path (right) corresponds to an action sequence of 31311.Although both paths have the same number of steps, the sub-optimal path is closer to the pit; for stochastic policies, this means that there is an increased probability that the agent will take an action that moves into this state.Due to the stochastic nature of the quantum algorithms, the results for convergence were averaged over multiple runs for each experiment.The number of repeated runs was different for each experiment; these are given in the text and figure captions.In each repeated run, the only difference between the algorithms was the seed value for random number generation.In other words, the position of each feature was the same as in Figure 1 in every run.

Convergence Properties
One of the most important differences between the single value function algo-International Journal of Intelligence Science rithms (Q-learning, VQRL, and QQRL) with the corresponding multiple value function algorithms (Multiple Q-learning, MVQRL, and MQQRL) is an increase in stability with the addition of value functions.This is most clearly seen in Fig- ure 3, which plots the breakdown learning rate against the number of value functions for algorithm.As the number of functions increases, all three algorithms become more robust, allowing higher learning rates to be used.One reason for this is that each function is constructed using only a fraction of the overall experiences.Combining these functions then results in an estimate of the value or the action value which is less sensitive than when only a single function is used.
The breakdown learning rate was defined in the same manner as above for the single value function algorithms; in other words, the value of α past which 1000 N > . This was computed for Multiple Q-learning, MVQRL, and MQQRL and is shown in Figure 3. From this, it can be seen that for each of the three algorithms, increasing the number of value functions (either V s ) increases the learning rate at which the algorithms break down; in other words, additional value functions increase the stability of the policy.
Another feature of Figure 3 to note is that MQQRL consistently has a higher breakdown learning rate than Multiple Q-learning and MVQRL.This indicates that, for the same N, MQQRL is more robust to a higher learning rate.One advantage of this is that a higher learning rate may be set with MQQRL than MVQRL, which generally increases the speed of learning.Furthermore, both Multiple Q-learning and MQQRL have higher breakdown learning rates for 1 N > , which indicates that ( ) , Q s a is a more robust against higher learning rates than ( ) Similar to the learning rate, the parameter k also has a significant effect on the convergence of the quantum algorithms; effectively, it determines how quickly the policy changes.A larger k results in a faster change in the policy, which generally means that the learning rate of the algorithm increases.This may be seen in Figure 4.
The relationship between k and the speed of convergence of VQRL and QQRL is interesting because, above a certain threshold, the number of episodes until convergence remains steady at around 12.5.This indicates that the algorithms are not sensitive to the particular value of k, given that it is sufficiently high, which suggests that a reasonable strategy for the selection of a particular k value may be to increase the value until the number of iterations to convergence does not change anymore.

Branching Ratios of Optimal and Sub-Optimal Paths
While the speed of convergence is an important characteristic when comparing any computer algorithms, it is also important to consider the quality of the policies the algorithms generate and the paths through the environment that they produce.An interesting comparison between Q-learning, VQRL and QQRL is the relative path distribution between each of the algorithms.Ideally, each of the algorithms will converge over time to the optimal path, which is often the shortest path.The branching ratio, which is the fraction of times that the algorithm converged to a certain path, can be seen as a function of episode number in Fig-    While Q-learning converges slower than VQRL and QQRL, over time it converges to the optimal path instead of the sub-optimal path.However, when VQRL and QQRL converge to the sub-optimal path, their policies have become essentially deterministic; consequently, there is little difference between the optimal path and the sub-optimal path as both have are the same length.The only reason why the sub-optimal path is less ideal than the optimal one is because it is closer to the "pit", so an agent with a stochastic policy has a higher probability of entering the "pit".
The reason for why the policies of VQRL and QQRL become deterministic over time may be attributed to repeated applications of Grover's algorithm to the policies.In these algorithms, Grover iterations have the tendency to increase the probability of actions with higher expected returns and decrease actions with lower ones.As the value functions of these algorithms converge, repeated Grover iterations tend to increase the probability of the most favorable action-the action with the highest expected return-and decrease the probability of all of the other actions.In the limit, the probability of the action with the maximum expected return approaches 1, and the probabilities of all other actions approach 0.
In effect, the probability becomes deterministic in the limit.
While it is clear in Figure 5 that VQRL and QQRL converge much faster than Q-learning, it is also apparent that Q-learning tends to converge to the optimal path over time, while the quantum algorithms converge to a distribution between the optimal and sub-optimal paths.This highlights one trade off between the quantum algorithms and their classical counterparts: while the quantum algorithms converge much faster to a less optimal path distribution, Q-learning eventually converges to the optimal path.

Comparison of Value Functions for VQRL, QQRL, and Q-Learning
An important difference between Q-learning, VQRL, and QQRL is the estimate of the value of each of the states; these are shown in Figure 6 for the initial state International Journal of Intelligence Science indicates that the policies of VQRL and QQRL converge much quicker than Q-learning.An interesting distinction between VQRL and QQRL is that VQRL converges to a much higher value than QQRL.While the cause of this is uncertain, it is likely related to the path distribution that the algorithm converges to (see Figure 5).
In addition to the maximum value of the state, Q-learning and QQRL exhibit further differences in the value of each action for the state.This is shown in Fig- ure 7 for the initial state.An interesting observation is that for the first 100 episodes, ( ) , Q s a for Q-learning and QQRL follow the same downward trend for each a.After episode 100, however, the values diverge; QQRL converges much faster than Q-learning, which does not converge in the first 1000 episodes.

Comparison of Value Functions for Double Q-Learning, DVQRL, and DQQRL
One of the effects of doubling the value functions of reinforcement learning algorithms is that the expected value of each state is different for each state.Generally, this difference is an underestimate in comparison with the estimate of the value in the single function algorithm; this may be observed in Figure 8.However, despite this initial underestimate at any given episode, the shape of the double function algorithms generally follows the shape of the single function algorithms, but at a slower rate.

Comparison of Value Functions for Multiple Q-Learning, MVQRL, and MQQRL
Like Double Q-learning, DVQRL, and DQQRL, Multiple Q-learning, MQVRL, International Journal of Intelligence Science efficiency per episode in Table 2; however, Table 3 shows that the quantum algorithms show much better performance overall, converging in significantly less time because they require less iterations to converge.The time measurements were made on an Intel i5-3210M processor with 8 GB of DDR3 RAM.The algorithms were implemented with multiple threads in C++ and compiled with GCC 6.3.1-1;however, CPU times were normalized to give the required CPU time in a single thread.

DOI: 10 .
4236/ijis.2019.91001 2 International Journal of Intelligence Science popularity.Reinforcement learning algorithms are a subset of machine learning

2. 1 . 2 V
Double Quantum Reinforcement Learning (DVQRL) Double Quantum Reinforcement Learning (DVQRL) combines the idea of doubled learning (such as used in Double Q-learning) with Quantum Reinforcement Learning.The main idea of DVQRL is to use two separate value functions, s , and for each experience to randomly choose one function to update us- ing the value of the other.Explicitly, if then for each experience the update action value function instead of a value function.The advantage of this is that the action value function holds more specific information than the value function, potentially leading to faster convergence.In this context, an algorithm is considered to have converged if subsequent updates do not change the policy.Quantum Q Reinforcement Learning (QQRL), which adapts VQRL, has an action value function that replaces the value function in VQRL.This action value function is used to compute the expected value of the next state according to

2 ,
Quantum Q-Learning (DQQRL) Similar to VQRL and Double Q-learning, QQRL may be doubled to use two different action value functions, Q s a .This algorithm is de- scribed in (Algorithm 4).At each time step, only one function in the algorithm is updated at a time; this is done by sampling

10 r
= + .When it enters any of the other states, it receives a reward with mean 1 r = − and standard deviation σ .In a deterministic environment, 0 σ = ; in a stochastic environment, 0 σ ≠ , and the agent receives the rewards 1 σ − + and 1 σ − − with equal probability.International Journal of Intelligence Science

Figure 1 .
Figure1.Environment used to test each algorithm.A represents the starting location of the agent in the environment, P represents the pit, where the agent receives a large negative reward, W represents the wall, which is a disallowed state, and G represents the goal, where the agent receives a large positive reward.At each time step, the agent receives a small negative reward so that the optimal policy is the shortest path through the environment from the initial state to the goal.

Figure 3 .
Figure 3. Breakdown learning rate as function of the number of ( ) , Q s a functions.The breakdown learning rate is the minimum value of α where the algorithm fails to find the goal.Note that 1 N = corresponds to Q-learning, QQRL, and VQRL, and that 2 N = corresponds to Double Q-learning, DQQRL, and DVQRL.

Figure 4 .
Figure 4. Iterations to convergence as a function of k in a deterministic environment.The results were averaged over 100,000 runs.For this experiment, 0.05 α = .

Figure 5 .
Figure 5. Branching ratio of Q-learning, VQRL, and QQRL.The optimal path 33111 and the suboptimal path 31311 are shown in Figure 2.For this experiment, 0.05 α = .The results were averaged over 1000 runs.

Figure 6 .
Figure 6.Expected value of initial state for Q-learning, VQRL, and QQRL.For VQRL, ( ) V s is shown, but for QQRL and Q-learning

Table 1 .
Numerical labels of each action.A string of these actions such as 32132 indicates a path through the environment.

Table 2 .
CPU time per episode (μs/episode) for each algorithm.

Table 3 .
CPU time to reach convergence (μs) for each algorithm.