^{1}

^{*}

^{2}

Applying quantum computing techniques to machine learning has attracted widespread attention recently and quantum machine learning has become a hot research topic. There are three major categories of machine learning: supervised, unsupervised, and reinforcement learning (RL). However, quantum RL has made the least progress when compared to the other two areas. In this study, we implement the well-known RL algorithm Q learning with a quantum neural network and evaluate it in the grid world environment. RL is learning through interactions with the environment, with the aim of discovering a strategy to maximize the expected cumulative rewards. Problems in RL bring in unique challenges to the study with their sequential nature of learning, potentially long delayed reward signals, and large or infinite size of state and action spaces. This study extends our previous work on solving the contextual bandit problem using a quantum neural network, where the reward signals are immediate after each action.

Great success has been made in artificial intelligence, machine learning, deep learning, and reinforcement learning (RL). Deep learning based on neural networks has demonstrated its power in supervised learning problems such as computer vision and machine translation. On the other side, when applied to RL, amazing results like Alpha Go are possible. The goal of RL is to teach an agent to learn how to act from a given state in an unknown environment.

Different from the one-step decisions in supervised learning, the sequential decision making character in RL is observed in the process of the agent taking an action, and then receiving a reward and the next state, and then acting upon that state. The purpose of RL is for an agent to learn a strategy or policy that will obtain the maximum long-term cumulative rewards. In general, a policy is a distribution over actions given the states, which the agent uses to determine the next action based on the current state. The major RL algorithms are value based, policy gradient based methods, or a combination of both [1 - 3].

The development of machine learning today depends on three pillars: new algorithms, big data, and more powerful computers. To push current machine learning to a higher level of achievement, applying quantum computing to machine learning is an obvious and natural choice. A classical computer processes a classical bit of 0 or 1, while a digital quantum computer processes a qubit that can be in both states of 0 and 1 due to superposition. Furthermore, a continuous variable quantum computer can process a quantum state (one qumode) that can represent a complex number x + pi, where x is a superposition over all possible positions of a quantum particle and p is its super positioned momentum [

When a quantum computer is utilized to solve a classical machine learning task, it typically requires the encoding of the classical data set into quantum states. Then, the quantum computer can process the quantum states and the result of the quantum computation is read by measuring the quantum system. Quantum computers can easily process quantum states that could correspond to vectors in very high-dimensional or even infinite-dimensional vector spaces, a task which is typically described as the curse of dimensionality in classical machine learning. With the current state of quantum computers, hybrid quantum-classical models are very popular. One well-known method is the variational approach to design quantum circuits with free parameters that can be optimized by both classical and quantum devices for a given machine learning objective. Therefore, the ability to compute gradients of variational quantum circuits is essential for this technique [

In the domain of quantum machine learning, RL has received relatively less attention, considering the quantum enhancements in supervised and unsupervised learning [6 - 18]. In our previous work [

Because of the impressive performance of deep learning, creating neural networks on quantum computers has been a long-time effort [13 , 20 , 21]. The work in [

The nonlinearity of the classical neural networks plays a key role in their success which is realized with a nonlinear activation function in each layer. In photonic quantum networks, this nonlinearity is accomplished with non-Gaussian photonic gates such as the cubic phase gate and the Kerr gate. The common representation of f ( W x + b ) in the classical networks where f is the nonlinear activation function, W is the weight matrix and b is the bias can be emulated as layers of quantum gates ϕ ∘ D ∘ U 2 ∘ S ∘ U 1 | x 〉 in the CV model where D is an array of displacement gates, U i are interferometers, S are squeezing gates, and ϕ are non-Gaussian gates to have an effect of nonlinearity.

One quantum advantage of this type of quantum networks is that for certain problems, a classical neural network would take exponentially many resources to approximate the quantum network. It is shown in [

RL is characterized by the agent, the environment and their interaction. Therefore, each RL algorithm needs to be tested in certain environments. In this report, we use the grid world environment, which is a commonly adopted benchmark in RL. Compared with the contextual bandit problem studied in [

The agent in this environment has four possible moves―up, down, left and right. The standard Frozen Lake environment has an option to allow moves to be slippery. If slippery, there could be a random move happening in every action since the agent is slipping in different directions on a frozen surface. The episode ends when the agent reaches the goal or falls in the hole. It receives a reward of 1 if it reaches the goal, and zero otherwise. There is a clear contrast between the contextual bandit problem [

The aim of RL is to learn sequences of actions that will help an agent gain the maximum long-term rewards. Thus, the end product of RL is a strategy that tells an agent how to act from a particular state, and this strategy is usually named the policy in RL. The policy can be deterministic or stochastic. In the latter case, the policy produces a probability distribution over all the possible actions. In general, a policy is a mapping from states to actions.

The interaction of an agent and an environment in a typical RL problem can be formulated as a Markov decision process (MDP) [

discrete, finite-state MDP is defined by a tuple M = ( S , A , p , R , γ ) , specifying a set of states S and actions A, a transition probability p, a reward function R, and a discount factor γ ∈ [ 0 , 1 ] . The Markov property of MDP indicates that the future state transitions depend only on the present state and not the whole history of events that precede it. The MDP describes how the environment works and the policy controls how the agent behaves. The hard part of RL is to learn how a current action will affect future rewards, which is commonly called the return.

The return defines the total discounted sum of the rewards received:

G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ (1)

Equation (1) describes a discounted return and 0 ≤ γ ≤ 1 is the discount rate. It makes the immediate rewards more valuable than those in the future.

Another common concept in RL is the Q function that can be formulated as below:

Q π ( s , a ) = E π [ G t | S t = s , A t = a ]

where the expectation is taken over potentially random actions, transitions, and rewards. The randomness comes from the fact that the policy can be stochastic, the transition from one state to another can be probabilistic, and the rewards returned by the environment can be varied, delayed or affected by unknown factors. The state-action value function of Q π ( s , a ) under a policy π is the expected return starting from s, taking the action a, and thereafter following policy π. The optimal state-action value function is defined as Q ∗ ( s , a ) = max π Q π ( s , a ) . Given a MDP, an agent can learn an optimal policy via simulation of interactions with the model. In this regard, there are policy iteration and value iteration algorithms to find an optimal policy since the rewards and transition probability are known and there is no need to explore the environment [

In general, there are two main approaches to RL: 1) to learn a policy using a given model or learn a model of the world from experience and then use planning with that learned model to discover a policy (model-based) and 2) to learn a policy directly or a value function from experience then define a policy from it indirectly (model-free). The policy gradient approach tries to learn a good policy directly while Q learning attempts to learn the value of Q ( s , a ) first, then from this Q function, an optimal policy can be deduced. Q learning can learn the optimal policy even when actions are selected by a different policy including a random policy.

Q Learning [24 , 25] is a model-free RL method that estimates the long-term expected return of executing an action a in state s. Without a model or in an unknown environment, the agent has to learn from experience, using an iterative process. The estimated returns, known as Q values, can be learned iteratively and the updating rule of Q values can be described as:

Q ( s t + 1 , a t + 1 ) = Q ( s t , a t ) + α [ R t + 1 + γ max a Q ( s t + 1 , a ) − Q ( s t , a t ) ] (2)

where the max is taken over all the possible actions in state s t + 1 , γ ∈ [ 0 , 1 ) is the discount factor, and α ∈ ( 0 , 1 ] is the learning rate. The updating formula in Equation (2) suggests that the formation of the Q function is executed by following a greedy policy. To address the challenge of delayed rewards in many RL problem, Equation (2) gives credit to the past actions through backpropagation of the final reward with a discount factor. Equation (2) is known as the Bellman equation and is a key tool in RL. In a simple problem when the state space is small and every state is known, Q function can be represented as a table with states as rows and actions as columns. However, in a problem with a large state space or unknown states, deep neural networks have to be used instead, in which the loss function is the mean squared error of the predicted Q value and the TD-target Q value.

Mathematically, Q ( s , a ) is a combination of immediate reward with all future rewards. The learned state-action value function Q is an approximation to the optimal value function Q ∗ which can be used to find the optimal policy, independent of the policy being employed in the learning process. The goal of RL is to choose the best action for any given state, which means the actions have to be ranked and the Q values can be used for this purpose. In this sense, Q learning is an iterative process of learning how to rank the actions in a given state. Naturally, there is a tension between the exploitation of known rewards, and exploration to discover new actions that could lead to better rewards. The quantity R t + 1 + γ max a Q ( s t + 1 , a ) in Equation (2) is usually named TD-target and TD-target- Q ( s t , a t ) is TD-error. We can see that TD-target is the maximum possible Q value for the next state while Q ( s t , a t ) is the predicted Q value for the current state.

However, it is not straightforward to apply Q learning to the continuous action space. When the environment has an infinite number of actions, it is more efficient to use policy-based methods than the value-based. Say in Q learning, the updating rule (Equation (2)) needs to compute the quantity max a Q ( s t + 1 , a ) which requires a full scan of the action space.

For a problem with a small number of states and actions, tables could be used to represent the Q function efficiently, but for large sizes of state and action spaces they do not scale well. In this case, neural networks can be employed to approximate the Q function with Q ( s, a; θ ) parameterized by θ (

The objective of training this neural network is to tune the values of the parameter θ to reduce the gap between the TD-target and the Q value, which is called TD-error. This method highlights the fact that the implementation of Q learning makes use of supervised learning techniques and the training of the quantum network is done by both quantum and classical devices with help from Tensor flow and Strawberry field simulation [

The training of this network is similar to the backpropagation of error in supervised learning. However, ground-truth labels are provided in supervised learning. RL can only use the TD-target that can be varied, delayed or affected by unknown variables. In a sense, it is a moving target.

In the case of our grid world example, we use a one-layer network which takes the state encoded as an integer (0 - 5), and produces a vector of 4 Q values, one for each action (

The grid world environment is widely used to evaluate RL algorithms. Our quantum Q learning is evaluated in this environment that is explained in Section 3.1. The aim of Q learning in this environment of size 2 × 3 is to discover a strategy that controls the behavior of an agent and helps to know how to act from a particular state. We run the algorithm in series of experiments, then take the average of the rewards and path lengths in each case, demonstrating that our quantum network is able to solve the grid world problem. In the non-slippery grid world, the optimal path length is 3, while in the slippery case, it takes more than 3 steps to arrive at the goal state G. One episode is defined as a sequence of moves from the start state S to a terminal state that can be either the hole H or the goal G in our grid world. Our results indicate that after about 5 episodes, the quantum network is able to learn how to reach the goal state G in three steps for the non-slippery case, but the learning in the slippery case is much harder (

In RL, there are model based and model free algorithms. The model based assumes that the agent has access to a model of the environment and learns from it while the model free assumes that the agent has no knowledge of a model and therefore must learn from direct experience with the environment. The Q learning that we use in this study is a model free algorithm.

Neural networks are widely used in deep learning, demonstrated as a great technique in supervised learning but can also be used in RL. As a famous RL algorithm, Q learning learns a Q function. The value of Q ( s , a ) shows how favorable action a in state s is, measured by the long-term return. The updating rule of Q learning (Equation (2)) suggests that the value of an action depends on both the immediate reward and future rewards. In challenging RL problems, there are many states and actions, making it hard to store and ineffective to produce and update the Q values in a table, so a neural network is used to approximate the Q function.

It is trendy today to apply quantum computing to classical machine learning. In this report, we create a photonic quantum neural network to implement Q learning and test it in the grid world environment. The grid world problem is a more difficult task than the contextual bandit problem in [

The interest in developing quantum machine learning algorithms has been on the rise in recent years and their potential needs to be further explored. In this regard, our work provides another example of how to implement a classical RL algorithm using a quantum device, which extends our previous work on solving the contextual bandit problem using a quantum network.

The authors declare no conflicts of interest regarding the publication of this paper.