^{1}

^{1}

^{*}

In this paper, a collection of value-based quantum reinforcement learning algorithms are introduced which use Grover’s algorithm to update the policy, which is stored as a superposition of qubits associated with each possible action, and their parameters are explored. These algorithms may be grouped in two classes, one class which uses value functions (V(s)) and new class which uses action value functions (Q(s,a)). The new (Q(s,a)) -based quantum algorithms are found to converge faster than V (s) -based algorithms, and in general the quantum algorithms are found to converge in fewer iterations than their classical counterparts, netting larger returns during training. This is due to fact that the (Q(s,a)) algorithms are more precise than those based on V(s) , meaning that updates are incorporated into the value function more efficiently. This effect is also enhanced by the observation that the Q(s,a) -based algorithms may be trained with higher learning rates. These algorithms are then extended by adding multiple value functions, which are observed to allow larger learning rates and have improved convergence properties in environments with stochastic rewards, the latter of which is further improved by the probabilistic nature of the quantum algorithms. Finally, the quantum algorithms were found to use less CPU time than their classical counterparts overall, meaning that their benefits may be realized even without a full quantum computer.

In recent years, the field of reinforcement learning [

The main goal of reinforcement learning is to maximize a signal, known as the reward, over a sequence of time steps, known as an episode, by finding a policy which describes what action to take from each state. The combination of the reward signal with the environment with which the agent interacts implicitly defines an optimal policy; the goal of all reinforcement learning algorithms is to find this policy or a good approximation of it.

While many algorithms for reinforcement learning work well in environments with a reasonable number of states, they become ineffective in large state spaces (such as a continuous state space). To address this, algorithms must use function approximation and train with a relatively small number of experiences. One recent success was the application of reinforcement learning to classic Atari games [

Reinforcement learning algorithms consider the problem of finding an optimal policy in a Markov Decision Process (MDP) with respect to a reward signal; the nature of this is discussed in detail by Sutton and Barto [

V * ( s ) = max a ∑ s ′ P ( s ′ | s , a ) [ r ( s ′ , a , s ) + γ V * ( s ′ ) ]

and

Q * ( s , a ) = ∑ s ′ P ( s ′ | s , a ) [ r ( s ′ , a , s ) + γ max a ′ Q * ( s ′ , a ′ ) ] ,

where r ( ⋅ ) is the reward obtained from the transition ( s , a , s ′ ) , γ ∈ ( 0,1 ] is the discount of future rewards, and P ( ⋅ | s , a ) is the transition probabilities of the Markov chain. The optimal value functions are related according to

V * ( s ) = max a Q * ( s , a ) ,

where implicitly the optimal policy in the state s is to take the action a which maximizes the expected (discounted) return, V * ( s ) .

A number of methods have been developed to approximate the fixed points V * ( s ) and Q * ( s , a ) satisfying the two forms of the Bellman Equation. One such technique is known as Value Iteration [

Q t + 1 ( s t , a t ) = Q t ( s t , a t ) + α [ r t + γ max a ′ Q t ( s t + 1 , a ′ ) ] ,

where at each time step t the estimate of Q * ( s , a ) is updated based on the observed reward r t and next state s t + 1 .

Interesting extensions of Q-learning are those of Double Q-learning [

Q i , t + 1 ( s t , a t ) = Q i , t ( s t , a t ) + α [ r t + γ Q j , t ( s t , a ′ ) ] ,

a ′ = max a Q i , t ( s t , a ) ,

where i , j ∈ { 1,2 } , i ≠ j , with i chosen uniformly at random. In Multiple Q-learning, N estimates are maintained of Q ( s , a ) , and at each time step a single estimate is updated using the average of all other N − 1 estimates:

Q i , t + 1 ( s t , a t ) = Q i , t ( s t , a t ) + α [ r t + γ N − 1 ∑ j = 1 , j ≠ i N Q j , t ( s , a ′ ) ] ,

a ′ = max a Q i , t ( s t , a ) ,

where i is chosen uniformly over [ 1 , N ] .

Recently, there has been increased interest in developing quantum computing algorithms. In the quantum computing paradigm [

Two important algorithms which are the building blocks of more complicated algorithms are known as Shor’s algorithm [

Quantum computing has shown promise in the field of machine learning [

Grover’s algorithm [

With the increase in interest in quantum computing has come interest in applying it to reinforcement learning. As the application of reinforcement learning to real-world problems generally requires a very large state-space, the hope is that the application of quantum computing would significantly reduce the amount of time for the algorithm to reach convergence. A generalized framework for quantum reinforcement learning is described in detail by Cárdenas-López, et al. [

Other work in Quantum Reinforcement Learning has focused on adapting more traditional reinforcement learning algorithms to Quantum computing. Utilization the free energy of a restricted Boltzmann machine (RBM) to approximate the Q-function was first proposed by Sallans and Hinton [

A specific algorithm known as Quantum Reinforcement Learning (VQRL) [

π ( a | s ) = | 〈 a | a s 〉 | 2 ,

where a is an action taken from state s, | a 〉 is its associated eigenstate, and | a s 〉 is the state of a quantum system that represents the policy for state s. The state of these systems are updated according to the rule

| a s ( t + 1 ) 〉 = U ^ g | a s ( t ) 〉 ,

where U ^ g is a unitary operator that represents one Grover iteration. This may be expressed as a combination of a reflection and diffusion, U ^ a and U ^ a s [

U ^ g = U ^ a s U ^ a .

The effect of U ^ a is to invert the amplitude of a, and the effect of U ^ a s is to invert all the amplitudes of the state about their mean. When applied an equiprobable state, applying U ^ g increases the probability of obtaining a from a measurement of the state. However, U ^ g may not be applied indefinitely; after a certain number of iterations, Grover’s algorithm tends to decrease the probability of measuring a. The maximum number of iterations depends on the number of possible actions, and is given by

L max = ⌊ π 4 θ − 1 2 ⌋ ,

where L max is the maximum number of iterations that may be applied until the probability decreases and tan θ = | 〈 a s | a 〉 | .

In this paper, a collection of new quantum reinforcement learning algorithms are introduced which are based on Quantum Reinforcement Learning (VQRL), which was first described by Dong, et al. [

Another alteration done to VQRL, and also done to QQRL, is to increase the number of V ( s ) and Q ( s , a ) functions; this alteration is inspired by Multiple Q-learning algorithm described by Duryea, et al. [

The structure of this paper is as follows. Section 2 introduces the quantum reinforcement learning algorithms discussed in this paper. Section 3 first presents the test environment used on the algorithms, and then shows the results of testing the new algorithms in the environment. Finally, Section 4 concludes by interpreting the results and discussing their implications.

Double Quantum Reinforcement Learning (DVQRL) combines the idea of doubled learning (such as used in Double Q-learning) with Quantum Reinforcement Learning. The main idea of DVQRL is to use two separate value functions, V 1 ( s ) and V 2 ( s ) , and for each experience to randomly choose one function to update using the value of the other. Explicitly, if i ∼ U { 1,2 } (uniform distribution) and j = 2 − x , then for each experience the update

V i ( s ) ← V i ( s ) + α ( r + V j ( s ′ ) − V i (s))

is applied. The expected value of each state is then the average of the two value

functions V ( s ) = 1 2 ( V 1 ( s ) + V 2 ( s ) ) . This is used to compute L, which is the

number of times to apply Grover’s operator to the policy, | a s 〉 . First, θ is computed by solving

tan θ = | 〈 a s | a 〉 | .

Then, after solving for θ , L may be computed using

L = ⌊ min { k ( r + V ( s ′ ) ) , π 4 θ − 1 2 } ⌋ ,

where L is the number of times to apply Grover’s operator and k is a parameter which controls the rate at which the policy is updated. The Grover operator U ^ g may then be applied L times to the current policy, given by

| a s ( t + 1 ) 〉 = U ^ g L | a s ( t ) 〉 .

The algorithm for DVQRL can be seen in Algorithm 1.

Multiple Quantum Reinforcement Learning (MVQRL) is similar to Double Quantum Reinforcement Learning, but allows for N action value functions instead of only 2. Similar to Multiple Q-learning, the effect of increasing the number of functions is an improvement in learning in stochastic environments. The estimated value of each state is stored in the functions V 1 ( s ) , V 2 ( s ) , ⋯ , V N ( s ) . The value functions are updated at each step according to

V i ( s ) ← V i ( s ) + α ( r + γ N − 1 ∑ j = 1 , j ≠ i N V j ( s ′ ) − V i ( s ) ) ,

where i ∼ U { 1, N } . The algorithm for Multiple VQRL may be seen in Algorithm 2.

The idea of Quantum Reinforcement Learning may also be adapted to utilize an action value function instead of a value function. The advantage of this is that the action value function holds more specific information than the value function, potentially leading to faster convergence. In this context, an algorithm is considered to have converged if subsequent updates do not change the policy. Quantum Q Reinforcement Learning (QQRL), which adapts VQRL, has an action value function that replaces the value function in VQRL. This action value function is used to compute the expected value of the next state according to

V s ′ = max a ′ Q ( s ′ , a ′ ) ,

where Q ( s , a ) is the expected value of taking action a from state s, and V s ′ is the expected value of the agent being in state s ′ under the current policy. Like VQRL, in QQRL the policy | a s 〉 is updated according to

| a s ( t + 1 ) 〉 = U ^ g L | a s ( t ) 〉 .

The full algorithm describing QQRL is shown in Algorithm 3.

Similar to VQRL and Double Q-learning, QQRL may be doubled to use two different action value functions, Q 1 ( s , a ) and Q 2 ( s , a ) . This algorithm is described in (Algorithm 4). At each time step, only one function in the algorithm is updated at a time; this is done by sampling i ∼ U { 1,2 } , j = 2 − i , and applying the update

Q i ( s , a ) ← Q i ( s , a ) + α ( r + γ Q j ( s ′ , arg max a ′ Q i ( s ′ , a ′ ) ) − Q i ( s , a ) ) .

When computing L to determine the number of Grover iterations to be performed on the policy, V ( s ) is computed according to

V ( s ) = max a 1 2 ( Q 1 ( s , a ) + Q 2 ( s , a ) ) .

The policy is then updated in the same way as in QQRL.

QQRL may also be extended to have any number of action value functions; this is done in a similar way to Multiple Q-learning and MVQRL. The algorithm for Multiple Quantum Q-learning (MQQRL) may be seen in (Algorithm 5). At each time step, a single function is chosen to be updated; this is done by sampling i ∼ U { 1, N } updated according to

Q i ( s , a ) ← Q i ( s , a ) + α ( r + γ N − 1 ∑ j = 1 , j ≠ i N Q j ( s ′ , b ) − Q i ( s , a ) ) ,

where b = arg max a ′ Q i ( s ′ , a ′ ) . In order to compute the number of Grover iterations, L, V ( s ) is computed according to

V ( s ) = max a 1 N ∑ i = 1 N Q i ( s , a ) .

The probability amplitudes of the policy are then updated in the same way as in QQRL.

The grid environment used to test the algorithms can be seen in

In order to denote paths through the environment, an action sequence is used. This is a string of action numbers; the mapping from number to action may be seen in

The “pit” and the “goal” are terminal states; that is, when the agent enters these states, the episode is finished. When the agent enters the “pit”, it receives a reward of r = − 10 , and when it enters the “goal”, it receives a reward of r = + 10 . When it enters any of the other states, it receives a reward with mean r ¯ = − 1 and standard deviation σ . In a deterministic environment, σ = 0 ; in a stochastic environment, σ ≠ 0 , and the agent receives the rewards − 1 + σ and − 1 − σ with equal probability.

Number | Action |
---|---|

0 | Up |

1 | Down |

2 | Left |

3 | Right |

Due to the stochastic nature of the quantum algorithms, the results for convergence were averaged over multiple runs for each experiment. The number of repeated runs was different for each experiment; these are given in the text and figure captions. In each repeated run, the only difference between the algorithms was the seed value for random number generation. In other words, the position of each feature was the same as in

One of the most important differences between the single value function algorithms (Q-learning, VQRL, and QQRL) with the corresponding multiple value function algorithms (Multiple Q-learning, MVQRL, and MQQRL) is an increase in stability with the addition of value functions. This is most clearly seen in

The breakdown learning rate was defined in the same manner as above for the single value function algorithms; in other words, the value of α past which N > 1000 . This was computed for Multiple Q-learning, MVQRL, and MQQRL for each value of N ∈ [ 1 , 10 ] and is shown in

Another feature of

Similar to the learning rate, the parameter k also has a significant effect on the convergence of the quantum algorithms; effectively, it determines how quickly the policy changes. A larger k results in a faster change in the policy, which generally means that the learning rate of the algorithm increases. This may be seen in

The relationship between k and the speed of convergence of VQRL and QQRL is interesting because, above a certain threshold, the number of episodes until convergence remains steady at around 12.5. This indicates that the algorithms are not sensitive to the particular value of k, given that it is sufficiently high, which suggests that a reasonable strategy for the selection of a particular k value may be to increase the value until the number of iterations to convergence does not change anymore.

While the speed of convergence is an important characteristic when comparing any computer algorithms, it is also important to consider the quality of the policies the algorithms generate and the paths through the environment that they produce. An interesting comparison between Q-learning, VQRL and QQRL is the relative path distribution between each of the algorithms. Ideally, each of the algorithms will converge over time to the optimal path, which is often the shortest path. The branching ratio, which is the fraction of times that the algorithm converged to a certain path, can be seen as a function of episode number in

While Q-learning converges slower than VQRL and QQRL, over time it converges to the optimal path instead of the sub-optimal path. However, when VQRL and QQRL converge to the sub-optimal path, their policies have become essentially deterministic; consequently, there is little difference between the optimal path and the sub-optimal path as both have are the same length. The only reason why the sub-optimal path is less ideal than the optimal one is because it is closer to the “pit”, so an agent with a stochastic policy has a higher probability of entering the “pit”.

The reason for why the policies of VQRL and QQRL become deterministic over time may be attributed to repeated applications of Grover’s algorithm to the policies. In these algorithms, Grover iterations have the tendency to increase the probability of actions with higher expected returns and decrease actions with lower ones. As the value functions of these algorithms converge, repeated Grover iterations tend to increase the probability of the most favorable action―the action with the highest expected return―and decrease the probability of all of the other actions. In the limit, the probability of the action with the maximum expected return approaches 1, and the probabilities of all other actions approach 0. In effect, the probability becomes deterministic in the limit.

While it is clear in

An important difference between Q-learning, VQRL, and QQRL is the estimate of the value of each of the states; these are shown in

of the agent as a function of the number of episodes. In the experiment shown, VQRL and QQRL converge much faster to the expected value of the state. As the stability of the value function is highly related to the stability of the policy, this indicates that the policies of VQRL and QQRL converge much quicker than Q-learning. An interesting distinction between VQRL and QQRL is that VQRL converges to a much higher value than QQRL. While the cause of this is uncertain, it is likely related to the path distribution that the algorithm converges to (see

In addition to the maximum value of the state, Q-learning and QQRL exhibit further differences in the value of each action for the state. This is shown in

One of the effects of doubling the value functions of reinforcement learning algorithms is that the expected value of each state is different for each state. Generally, this difference is an underestimate in comparison with the estimate of the value in the single function algorithm; this may be observed in

Like Double Q-learning, DVQRL, and DQQRL, Multiple Q-learning, MQVRL, and MQQRL also exhibit underestimation of Q ( s , a ) and V ( s , a ) for the single versions of the algorithms (Q-learning, VQRL, and QQRL). This is shown for N = 3 , 6 , 10 in

An important metric when comparing reinforcement learning algorithms is the relative computational efficiency of each. The average amount of computation time per episode for each algorithm implemented is shown in

N | Algorithm | ||
---|---|---|---|

Multiple Q-Learning | MVQRL | MQQRL | |

1 | 1.5 | 8.3 | 8.1 |

2 | 1.8 | 8.0 | 12.2 |

3 | 1.7 | 7.9 | 14.7 |

4 | 1.7 | 8.0 | 18.0 |

5 | 1.7 | 7.8 | 20.4 |

6 | 1.7 | 7.7 | 22.8 |

7 | 1.7 | 7.7 | 24.9 |

8 | 1.6 | 7.6 | 27.7 |

9 | 1.6 | 7.7 | 30.4 |

10 | 1.6 | 7.8 | 34.3 |

N | Algorithm | ||
---|---|---|---|

Multiple Q-Learning | MVQRL | MQQRL | |

1 | 137.5 | 102.2 | 97.7 |

2 | 368.7 | 108.5 | 167.2 |

3 | 397.9 | 112.2 | 208.0 |

4 | 449.8 | 114.9 | 259.7 |

5 | 475.8 | 114.0 | 301.7 |

6 | 534.0 | 114.7 | 346.2 |

7 | 576.6 | 116.8 | 376.1 |

8 | 584.9 | 116.2 | 422.4 |

9 | 630.8 | 119.3 | 466.2 |

10 | 673.0 | 123.8 | 536.5 |

Another interesting feature highlighted by

As may be seen in

V s ′ = max a ′ 1 N ∑ i = 1 N Q i ( s ′ , a ′ ) on every iteration, scaling linearly with the number

of Q ( s , a ) functions. Additionally, the memory fragmentation of the particular implementation may play largely into the additional CPU time; each Q ( s , a ) was stored as vector of rows, where each row was stored separately. On the other hand, in MVQRL each V ( s ) function was stored as a vector, meaning that the entire function was stored contiguously in memory. Thus, it is likely that MQQRL causes significantly more cache misses than MVQRL in the current implementation, increasing the CPU time for the algorithm. A more robust solution might store the entire Q ( s , a ) function in a contiguous array to reduce cache misses.

However, despite the fact that MQQRL requires more CPU time per episode than MVQRL, it is balanced to a certain degree by the fact that MQQRL requires fewer iterations to converge. However, there are many situations in which fewer iterations is more desirable than overall computation time, and in these scenarios MQQRL may be a better choice than MVQRL. An example scenario might be one where the physical cost of exploration is high but computational expense is low, in which case it may be advantageous to decrease the number of iterations required to converge and use a more powerful computer.

The increase in CPU time to reach convergence with N for Multiple Q-learning may be attributed to an increase in the number of episodes required to reach convergence, as the CPU time per episode is effectively constant. However, for MQQRL the increase in CPU to reach convergence may be attributed to both an increase in the number of episodes as well as an increase in the CPU time per episode. For all N, however, MVQRL and MQQRL completed in less CPU time than Multiple Q-learning, meaning that the benefits of the Quantum algorithms over the classical ones may be realized with less computational expense, not more.

The results in Section 3.2 demonstrate that QQRL reaches convergence in fewer episodes than VQRL. QQRL and VQRL only differ in how the value of the current state is stored; that is, QQRL uses Q ( s , a ) while VQRL only uses V ( s ) . Likely, the reason for such improvement in convergence speed is due to the extra precision provided by Q ( s , a ) when computing the value of a certain action and then choosing the best action; in contrast, V ( s ) is the expectation of the value of all possible actions, and as a consequence is much less precise. The extra precision is especially important when updating the policy in QQRL as it uses max a Q ( s , a ) to compute the number of Grover iterations to perform.

Another important observation is that the quantum algorithms, VQRL, QQRL, MVQRL, and MQQRL, have much faster convergence than their classical counterparts, Q-learning and Multiple Q-learning. As discussed by Dong, et al. [

Finally, it was found that adding extra V ( s ) and Q ( s , a ) functions improved the learning in a stochastic environment. In other words, when the reward was drawn from some distribution, MVQRL and MQQRL exhibited better performance than VQRL and QQRL, respectively, in the same way that Multiple Q-learning exhibited better performance than Q-learning [

each V i ( s ) or Q i ( s , a ) function is constructed using only 1 N of the total

number of experiences, and the variations among the function are reduced when computing the average V ( s ) or Q ( s , a ) .

One comparison which may be drawn between the quantum and classical algorithms is the number of episodes to convergence. Generally, it was observed that the value functions V ( s ) and Q ( s , a ) in VQRL and QQRL, respectively, converge much faster than the Q ( s , a ) in Q-learning. This result leads to significantly faster convergence times for the same learning rate; for smaller learning rates, the quantum algorithms converge about 10 times faster. Additionally, the quantum algorithms were found to be less CPU intensive than their classical counterparts for a given number of Q ( s , a ) or V ( s ) functions, meaning that although the quantum algorithms presented in this paper were not realized on a quantum computer, their use on a classical computer can still provide benefits over the classical algorithms.

A particular feature of QQRL is that it has a higher breakdown learning rate than either Q-learning or VQRL; this generalizes to Multiple Q-learning, MQVRL, and MQQRL. QQRL not only converges faster than Q-learning for a certain learning rate, but also allows for higher learning rates than Q-learning does. This is an interesting phenomenon which indicates that QQRL gives faster convergence in general than Q-learning.

One of the main differences between VQRL and QQRL is the convergent path distribution for each algorithm (see

This paper introduced the novel algorithm called Quantum Q-learning (QQRL), which replaced the value function V ( s ) in Quantum Reinforcement learning (VQRL) with the action-value function Q ( s , a ) . It was found that QQRL converged faster than VQRL, due to the added precision of Q ( s , a ) which more efficiently incorporates updates than V ( s ) . It was also found that QQRL converged much faster than classical Q-learning. Furthermore, QQRL was found to be more robust to higher learning rates than VQRL, allowing even faster convergence.

This paper also introduced the algorithms Multiple VQRL (MVQRL) and Multiple QQRL (MQQRL) by introducing multiple estimates of V ( s ) and Q ( s , a ) into VQRL and QQRL, respectively. These algorithms are shown to be more tolerant of higher learning rates, and are found to more effectively handle stochastic rewards, with the effect increasing with the number of value functions. This is similar to the fashion in which additional functions in Multiple Q-learning tended to improve the stability. Each additional function decreases the number of experiences used to train each individual function. Averaging over these then produces a better estimate of the value or action value than each individual function.

The results of this paper demonstrate that a qubit-based policy works well with action-value ( Q ( s , a ) ) reinforcement learning techniques, even when the reward signal is noisy. This is a significant step toward more complex quantum reinforcement learning algorithms, especially those based on the classical concepts of value-based reinforcement learning. The algorithms discussed in this paper demonstrate the viability of research in this direction.

Finally, it was found that CPU time required the quantum algorithms to converge was significantly less that the amount of time required for the corresponding classical algorithms with the same number of Q ( s , a ) or V ( s ) functions. This is an interesting result, as it indicates that utilization of the quantum algorithms on a classical computer yields an improvement in computational speed over the classical algorithms. Furthermore, this means that more V ( s ) or Q ( s , a ) functions may be used, improving the ability of the quantum algorithms to handle noisy signals.

We would like to thank Houghton College for providing financial support for this study. We would also like to thank the anonymous reviewer for their helpful comments.

The authors declare no conflicts of interest regarding the publication of this paper.

Ganger, M. and Hu, W. (2019) Quantum Multiple Q-Learning. International Journal of Intelligence Science, 9, 1-22. https://doi.org/10.4236/ijis.2019.91001