^{1}

^{2}

Traditional reinforcement learning (RL) uses the return, also known as the expected value of cumulative random rewards, for training an agent to learn an optimal policy. However, recent research indicates that learning the distribution over returns has distinct advantages over learning their expected value as seen in different RL tasks. The shift from using the expectation of returns in traditional RL to the distribution over returns in distributional RL has provided new insights into the dynamics of RL. This paper builds on our recent work investigating the quantum approach towards RL. Our work implements the quantile regression (QR) distributional Q learning with a quantum neural network. This quantum network is evaluated in a grid world environment with a different number of quantiles, illustrating its detailed influence on the learning of the algorithm. It is also compared to the standard quantum Q learning in a Markov Decision Process (MDP) chain, which demonstrates that the quantum QR distributional Q learning can explore the environment more efficiently than the standard quantum Q learning. Efficient exploration and balancing of exploitation and exploration are major challenges in RL. Previous work has shown that more informative actions can be taken with a distributional perspective. Our findings suggest another cause for its success: the enhanced performance of distributional RL can be partially attributed to its superior ability to efficiently explore the environment.

Machine learning is teaching computer models how to learn from data. As a subfield of machine learning, reinforcement learning (RL) aims to learn sequential decision making from data [

A distributional RL algorithm has to make two choices: 1) how to parameterize the value distributions, and 2) how to select a distance metric or loss function for the two distributions: target distribution and predicted distribution. Categorical DQN, also commonly called C51 [

C51 represents the value distribution as a categorical distribution over a fixed set of equidistant points and uses it to approximate the projected distributional Bellman target via minimizing their KL divergence. Furthermore, the work on C51 proves that the distributional Bellman operator is a contraction in a Wasserstein metric between probability distributions. However, C51 cannot prove that it has this contraction as it uses a cross-entropy loss function. C51 uses a predefined range of the categorical supports; therefore, it may not work well in some RL tasks.

QR distributional Q learning [

With the recent advances in quantum computing, machine learning has a new paradigm of computation to design and test different algorithms [

One advantage of quantum computing is its capability to process data of high dimensions. For example, a classical computer of 64 bits can process data of size 64 bits at a time, but a quantum computer of 64 qubits can process data of size 2^{64} bits at a time, achieving an exponential increase. It is hoped that quantum computing will be useful to find new patterns in big data and solve problems that are currently intractable for classical computers. These intractable problems include quantum system simulations and molecular and atomic dynamics simulations. Finding the ground state energy of a complex molecule is a challenge for classical computers. Also, as more transistors are packed in a single CPU chip, quantum phenomenon will occur. Therefore, quantum computing will become more relevant even to classical computing in this regard.

Xanadu is a company that makes photonic quantum computers, which can process information stored in quantum states of light [

A research team at Xanadu created light-based quantum neural networks using photonic gate circuits [

In our previous work, we used these quantum neural networks to study the contextual bandit problem, and to implement Q learning and actor-critic algorithms [

We describe the test environments, QR distributional Q learning, and QR quantum networks in this section. Traditionally, RL focuses on the mean of the return; however, distributional RL aims to model the distribution over the returns in order to gain a more complete picture of their world. The output of the QR network in distributional RL is a quantile distribution, represented by a discrete set of supports for a given set of quantiles. Although the final decision on actions is still based on the Q values, which are the expected returns. One key difference is, that in distributional RL, the goal is to get full distributions rather than their expectations.

Two environments are employed in this study to examine the performance of quantum QR distributional Q learning. A grid world is used to check the rewards that the algorithm can collect for various quantile numbers, which are key parameters of the algorithm. A MDP Chain is used to see how this new algorithm explores the environment when compared the more familiar algorithm Q learning.

A grid world environment is used to evaluate the performance of quantum QR distributional Q learning, which is also used in our previous work [

A MDP chain is used to investigate the ability of the quantum QR distributional Q learning algorithm to explore the environment, which is a simple deterministic chain { s − 3 , s − 2 , s − 1 , s 0 , s 1 , s 2 , s 3 } of size 7 with the start state s 0 and two terminal states s − 3 and s 3 (

Linear and logistic regressions are commonly used in machine learning, while quantile regression is widely employed in economics. However, because standard linear regression focuses on the conditional mean function, if other relationships among the variables are desirable, quantile regression can be useful. We might be interested in the full distribution of the data rather than just the mean of the data. We may also want to study the conditional median function, where the median is the 50th percentile, or quantile q, of a data distribution. The quantile level τ ∈ ( 0 , 1 ) splits the data into proportions τ % below and ( 1 − τ ) % above. Assume X is a random variable, F X ( x ) is its cumulative distribution function (CDF), and F X − 1 ( τ ) is the inverse, there holds the relationship: τ = F X ( x ) and x = F X − 1 ( τ ) . F X − 1 ( τ ) is called the quantile function with the median F X − 1 ( 0.5 ) as a special case. The real number x τ = F X − 1 ( τ ) or F X ( x τ ) = τ defines the τ th quantile of F X or X, which means the probability that an observation is less than x τ is τ. In linear regression, a straight line is selected by minimizing distance between the line and the data points. In contrast, quantile regression searches a line based on the selected quantile using the quantile loss function as shown in

The goal of the RL agent is to learn a policy that can gain the maximum expected return. So by definition, it is natural to work directly with these expectations. However, this approach cannot render the whole picture of the randomness as seen from the possible multimodal distribution over returns. When an agent interacts with the environment in a RL problem, the state transitions, rewards, and actions can all carry certain intrinsic randomness. Distributional RL explicitly models the future random rewards as a full distribution, allowing more accurate actions to be learned. In order to introduce QR distributional Q learning, which utilizes quantile regression to approximate the quantile function for the state-action return distribution, we need to introduce several concepts as background materials [

For a given policy π, the return Z π is a random variable that represents the sum of discounted rewards.

Z π = ∑ t = 0 ∞ γ t R t (1)

where γ is the discount rate. In Equation (1), Z π is interpreted as a distribution. Let Z π ( s 0 , a 0 ) be the return obtained by starting from state s 0 , performing action a 0 and then following the current policy π, the well-known Q value can be obtained as follows:

Q π ( s 0 , a 0 ) : = Ε [ Z π ( s 0 , a 0 ) ] = Ε [ ∑ t = 0 ∞ γ t R ( s t , a t ) ] (2)

T Z π ( s t , a t ) : = R ( s t , a t ) + γ Z π ( s t + 1 , a t + 1 ) (3)

Equation (3) defines the so called target distribution or distributional Bellman target [

Distributional RL algorithms need to measure the distance of two distributions: the target and predicted distributions. The KL divergence is a commonly used distance between two distributions, but it is not defined when these distributions are discrete and have different supports. That is the reason for the work of C51 to use a projection to make them have same supports in order to apply KL divergence. The alternative is Wasserstein metric which is described as follows.

For p ∈ [ 1 , ∞ ] , the p-Wasserstein metric W p between distributions U and Y is defined as,

W p ( U , Y ) = ( ∫ 0 1 | F Y − 1 ( τ ) − F U − 1 ( τ ) | p d τ ) 1 / p (4)

where Y is a random variable, F_{Y} is its cumulative distribution function, and F Y − 1 ( τ ) is the inverse. The p-Wasserstein distance is the L^{p} metric of the inverse of CDF, which is an extension of the Euclidean distance from point data to distribution data. When U and Y are two Dirac delta distributions located at X 1 and X 2 in R^{N}, the p-Wasserstein distance becomes a Euclidean of X 1 and X 2 . Let θ = ( θ 1 , θ 2 , θ 3 , ⋯ , θ N ) ∈ R N , a quantile distribution Z θ is a uniform probability distribution supported on { θ i ( s , a ) } which can be defined as,

Z θ ( s , a ) : = 1 N ∑ i = 1 N δ θ i ( s , a ) (5)

where δ z is the Dirac delta function at z.

The QR distributional Q learning algorithms use Z θ ( s , a ) to approximate the target distribution in Wasserstein distance W_{1} through quantile regression. In other words, this approach aims to estimate the quantiles of the target distribution.

Let τ i = i / N for each = 1, …, N. For an arbitrary value distribution Z, we have

W 1 ( Z , Z θ ) = ∑ i = 1 N ∫ τ i − 1 τ i | F Z − 1 ( τ ) − θ i | d τ (6)

The key observation used in [

∑ i = 1 N Ε Z ^ ∼ Z [ ρ τ ^ i ( Z ^ − θ i ) ] (7)

which are given by θ i = F Z − 1 ( τ ^ i ) , where τ ^ i = τ i + τ i − 1 2 and ρ τ ^ i ( u ) = u ( τ ^ i − 1 { u < 0 } ) , ∀ u ∈ R . Each θ i is the τ ^ i quantile. These { θ i } are the quantiles of the target distribution Z.

The gradients of quantile regression loss are independent of the magnitude of the error, which can increase gradient variance. As an improvement, Huber loss is defined as:

L κ ( u ) = { 1 2 u 2 , i f | u | ≤ κ κ ( | u | − 1 2 κ ) , o t h e r w i s e (8)

And the quantile Huber loss is:

ρ τ κ ( u ) = | τ − 1 { u < 0 } | L κ ( u ) (9)

which is a smooth version of the quantile loss defined in

Recall in Q learning, the Bellman optimality operator is defined as (notice the expectation used in the definition):

T Q ( s , a ) = E [ R ( s , a ) ] + γ E [ max a ′ Q ( s ′ , a ′ ) ] (10)

So the distributional version of this operator is:

T Z ( s , a ) = R ( s , a ) + γ Z ( s ′ , a * ) (11)

where a * = a r g m a x a ′ E z ~ Z ( s ′ , a ′ ) [ z ] and s' is the next state of s. The quantile regression Q learning from [

Quantile regression finds the approximation to the inverse cumulative distribution function F Z − 1 with a quantile function as defined in Equation (5). The quantiles used in the algorithm imply how many times the probability distribution is divided. For example, 5-quantiles would divide the distribution into 5 intervals [0.20, 0.40, 0.60, 0.80, 1.00]. The QR distributional Q learning algorithm minimizes Wasserstein distance to the distributional Bellman target using distributional Bellman updates through quantile regression, which adjusts the support of each of the equally divided probabilities.

The quantum neural networks used to implement the QR distributional Q learning algorithm are created by following the design in [

Typically the expected sum of future rewards is used to train an agent in RL. Distributional RL takes this idea one step further by computing the full distribution of the random returns. C51 fixes supports at equal intervals and finds the probability distribution from the output of the network. However, QR distributional Q learning sets probability values at equal intervals and gets the supports from the network output (middle plot in

The final action selection is still based on the average of the value distribution. However, holding the knowledge of the whole distribution gives more information than simply estimating one expectation. The Q learning process is to reduce the gap between the predicted value and the target value. Naturally, distributional RL process is to minimize the distance between the predicted distribution and the target distribution.

In Q learning, the output of the network is the Q value for each action, but the output of the QR network in distributional RL is the quantile distribution for each action as shown in

The numerical simulations of our quantum neural networks are conducted with Strawberryfields [

The performance of our algorithms on the grid world is shown in

Two competing strategies of machine learning are the exploration of new possibilities and the exploitation of old certainties. Balancing exploration and exploitation is a fundamental issue in RL. The dilemma is between acting on what the agent already knows and taking risks to try something it has not experienced, which could potentially lead to better rewards than the known ones. Finding the correct balance between these two strategies is not easy as neither is consistently better than the other. Exploitation might be a good decision for achieving short term goals but exploration might result in a long term success.

Common strategies for exploration such as ε-greedy do not work well when deep exploration is required. Bayesian techniques that can explore a shallow environment efficiently cannot do well in a deep structure, as this could lead to an exponentially larger number of trials. Inspired by the work in [_{0}. The results in this section provide another proof of knowing the value distribution instead of its expected value is beneficial.

Numerical summary of the influence by the number of quantiles on quantum QR distributional Q learning | |||||
---|---|---|---|---|---|

Number of Quantiles | 2 | 3 | 4 | 5 | 6 |

Average Rewards | 0.89 | 0.982 | 0.954 | 0.952 | 0.954 |

Average Episode Lengths | 14.04 | 13.44 | 12.45 | 11.63 | 11.28 |

In RL, the state-action value function Q(s, a) represents the expected return for taking action a in state s. Instead of using a scalar expected return, distributional RL algorithms compute a full distribution over these returns, which makes the learning more accurate and faster than previous methods. QR distributional Q learning employs quantile regression to minimize the Wasserstein distance between the target distribution and the predicted distribution by adjusting their supports. The output of the QR network is a set of supports that form the core of the quantile distribution definition.

The problem this study is concerned with is that of quantum RL. Research has shown that the quantum approach to machine learning can result in improved performances. The work covered in this report examines the implementation and performance of QR distributional Q learning on quantum computers. Learning the full distribution over returns rather than their expectation is the main idea of distributional RL. Therefore, our aim is to evaluate the features of quantum QR distributional Q learning, by testing its ability to collect rewards in a grid world with a different number of quantiles and then comparing its capability of exploring a MDP chain environment with standard quantum Q learning. Our findings demonstrate that quantum QR distributional Q learning can explore the environment more efficiently than quantum Q learning.

The authors declare no conflicts of interest regarding the publication of this paper.

Hu, W. and Hu, J. (2019) Distributional Reinforcement Learning with Quantum Neural Networks. Intelligent Control and Automation, 10, 63-78. https://doi.org/10.4236/ica.2019.102004