Double BP Q-Learning Algorithm for Local Path Planning of Mobile Robot

Aiming at the dimension disaster problem, poor model generalization ability and deadlock problem in special obstacles environment caused by the increase of state information in the local path planning process of mobile robot, this paper proposed a Double BP Q-learning algorithm based on the fusion of Double Q-learning algorithm and BP neural network. In order to solve the dimensional disaster problem, two BP neural network fitting value functions with the same network structure were used to replace the two Q value tables in Double Q-Learning algorithm to solve the problem that the Q value table cannot store excessive state information. By adding the mechanism of priority experience replay and using the parameter transfer to initialize the model parameters in different environments, it could accelerate the convergence rate of the algorithm, improve the learning efficiency and the generalization ability of the model. By designing specific action selection strategy in special environment, the deadlock state could be avoided and the mobile robot could reach the target point. Finally, the designed Double BP Q-learning algorithm was simulated and verified, and the probability of mobile robot reaching the target point in the parameter update process was compared with the Double Q-learning algorithm under the same condition of the planned path length. The results showed that the model trained by the improved Double BP Q-learning algorithm had a higher success rate in finding the optimal or sub-optimal path in the dense discrete environment, besides, it had stronger model generalization ability, fewer redundant sections, and could reach the target point without entering the deadlock zone in the special obstacles environment.


Introduction
Path planning is the core part of mobile robot control system and the key of autonomous navigation technology. Where local path planning is one of the important direction of path planning research. In an unknown environment, mobile robot detects and collects map information through the real-time state information acquired by sensors, updates the environment model in real time and plans a collision free path from the start point to the target point [1].
The quality of the path planning result determines whether the mobile robot can efficiently and accurately complete the task [2]. When the pose information of the mobile robot itself is known, it can take the short running path and high smoothness as the optimization objectives [3]. At present, the commonly used local path planning algorithms include artificial potential field method [4], fuzzy control [5], neural network [6] and reinforcement learning [7]. The artificial potential field method has good real-time performance in path planning, but the mobile robot can be affected by the gravitational potential field function and the repulsive potential field function. It is easy to cause local minimum points, shock or stagnation [8]. Multiple restrictions can be added to the function model to eliminate local minimum points [9]. Fuzzy control method can summarize complete control rules and achieve local path planning through fuzzy reasoning and fuzzy judgment. However, simple fuzzification of information will reduce the precision of control [10], but it can be combined with neural network to form a fuzzy neural network to output a higher precision model [11]. Reinforcement learning algorithm utilizes mobile robot to interact with the environment and accumulate rewards to optimize strategies [12]. Literature [13] proposed to use deep reinforcement learning for path planning in an unknown environment, but the input state didn't contain enough feature information and the model generalization was poor. Literature [14] abandoned the Q value table and proposed the method of using neural network to fit the value function. Although it contains enough characteristic information, there is no clear division for the surrounding obstacles, which is easy to fall into the local minimum state and the path planning efficiency is low. Literature [15] used sensors to explore the obstacle information and the returned obstacle information and actions are input into the neural network. Although the local minimum value would not be generated, but the information of the target point would not be considered, so the problem of path redundancy would easily occur. The application of reinforcement learning in local path planning requires multi-step decision and most tasks are correlated. Therefore, the learning of strategies under different scenarios can be accelerated through parameter transfer learning [16]. Transfer learning is a machine learning method that serves as the start point of the second task model for the task developed model [17]. In reinforcement learning, it mainly uses the fixed domain transferring from a single source task to a target task [18]. Literature [19] proposed case-based reasoning (CBR) as a heuristic method to accelerate the process of Q learning algorithm through transfer

Design of State Space and Action Space
The appropriate state space and action space are designed as the input and output of the neural network. The information of the surrounding environment are returned by the sensors.
Taking the mobile robot as the origin of coordinates and the current running direction of it as the Y-axis direction, the global coordinate system of the mobile robot is established, as shown in Figure 1. Omnidirectional radar sensors are   [22]. The sector areas in the figure represent the range of detection angles of the sensors. The pose information of the robot includes the distance information d i of the nearest obstacle returned by the three directional sensors, as well as the distance D g and angle θ g between the mobile robot and the target point, so the state space s j of the mobile robot is defined as: The angle detection range between 20˚ -160˚ returned by the radar sensors is discretized into three directions λ 1 , λ 2 and λ 3 , and the angle range is 50˚, 40˚ and 50˚. d 1 , d 2 and d 3 are the distance information of the nearest obstacle returned from the direction of λ 1 , λ 2 and λ 3 , respectively. The detection distance is discretized into a maximum of 3 steps. The obstacle information d i returned by the sensor is defined as follows: D g is the distance between the mobile robot and the target point, θ g is the angle between the direction of the mobile robot and the target point. They are calculated as follows: where (x r , y r ) represents the position of the mobile robot at the current moment, and (x g , y g ) represents the position of the target point.
The action space of the robot is represented by a, a = {a 1 , a 2 , a 3 }. The current running direction of the mobile robot is taken as the reference direction, and the three actions are defined as: a 1 : move one step 45˚ to the left front; a 2 : move one step forward; a 3 : move one step 45˚ to the front right.

Design of Double BP Q-Learning Neural Network Structure
Double BP Q-Learning algorithm is used to learn the obstacle avoidance strategy to plan the path. The state of the mobile robot is taken as the input of the network and outputs the action to be performed. Double BP Q-learning algorithm The input of each network is the state space s j of the robot. Excessive range of input data in neural network will affect the accuracy of gradient descent [24].
Therefore, the variable in the state s j of the mobile robot are normalized as follows: Where ω ij and b ij represent the weight and bias of the ith input layer node to the jth hidden layer node, and the input of the jth hidden layer is defined as: where φ ij and c ij represent the weight and bias of the ith hidden layer node to the jth output layer node, and the input of the output layer of the jth layer is defined as: The parameter ω of the estimation network BP_eval and the parameter ωof the target network BP_target include the weights and bias of each layer. BP_eval updates the ω parameter by training, BP_target is used to fix the network parameters, Figure 2. Structure of BP Q-Learning neural network. · · Journal of Computer and Communications and the algorithm copies the ω to the ω-regularly. The Memory part is the memory bank that stores the collected samples during learning. r t represents the reward and punishment value of the mobile robot when it reaches the next state; BP_eval outputs three estimated Q values: Q(e_a 1 ), Q(e_a 2 ) and Q(e_a 3 ); BP_target outputs three target Q values: Q(t_a 1 ), Q(t_a 2 ) and Q(t_a 3 ). BP_eval outputs Q(e_a t ) corresponding to the action a t , and BP_target calculates Q(t_a s ) corresponding to the action a s corresponding to the maximum Q value output by BP_eval. Where Loss function is calculated as follows: Gradient descent algorithm is used to update the parameters of the network, and the network update block diagram is shown in Figure 3.

Double BP Q-Learning Algorithm
Using neural network to fit the value function to solve the dimensional disaster problem, aiming at the problems of slow learning speed and poor model generalization ability in Double BP Q-learning algorithm, reward and punishment function and action selection strategy are redesigned, adding the mechanism of priority experience replay and transfer learning, so that the mobile robot can solve the problem of sparse rewards and accelerate the speed of experience learning.

Design of Reward and Punishment Functions
The design of reward and punishment functions determines the good or bad degree of actions taken by mobile robot in a certain state [25]. Sparse rewards have always been a problem that affects the convergence of reinforcement learning

BP_eval BP_target
Memory Double BP Q-learning : Loss  [26]. A continuous reward and punishment function is designed to solve the problem of sparse rewards. It is defined as follows: where: r_tar represents the reward for reaching the target point; r_obs represents the punishment for colliding an obstacle; D_before represents the distance between the mobile robot and the target point before the action is performed; D_now represents the distance between the mobile robot and the obstacle after performing the action; d_now represents the shortest distance between the mobile robot and the obstacle at the current state; d_before represents the shortest distance between the mobile robot and the obstacle before the action is performed; μ and η represent normalized discount factors.
After the mobile robot performs the action, whether it is closer to the target point or farther from the obstacle, it will be rewarded according to the calculation Formula (8); if the robot is farther from the target point or closer to the obstacle, it will be punished.

Design of Dynamic ε-Greedy Strategy
Double BP Q-learning algorithm is proposed to improve the exploration-exploitation of ε-greedy strategy, and a dynamic ε-greedy strategy is designed. The ε-greedy strategy is used to balance the relationship between exploration and exploitation. Its essence is to explore and select actions with ε probability, and to learn the existed experience with 1-ε probability [27].
Given an initial value of exploitation ε 0 and peak value ε_end, with the increase of learning frequency, the exploration rate ε t decreases from peak value ε_end to zero and remains unchanged, while the exploitation rate ε r increases from the initial value ε 0 to peak value ε_end and remains unchanged. The exploration-exploitation rate is computed as follows: t step t step step max t step step max where: ε i represents the increase factor of exploration rate; t_step represents the number of learning rounds; step_max represents the maximum round of running.
The change of exploration-exploitation rate is shown in Figure 4.

The Mechanism of Priority Experience Replay
Although random sampling breaks the correlation between samples, it ignores the importance of experience, and introduces the mechanism of priority experience replay to store the priority of samples through the binary tree structure Sumtree. High-priority samples are selected firstly to speed up learning [28].
Double BP Q-learning algorithm collects samples in the training process, which including state s t , action a t , reward r t and state s t+1 at the next moment. Samples (s t , a t , r t , s t+1 ) are stored in the experience pool. TD-error is used to define the sample priority, which is to estimate the difference between the Q values output by the estimation network and the target network. The larger the TD-error, the higher the corresponding priority [29]. The probability defined by the extracted sample is as follows: where: α represents how much priority to add to the sample. If α is 0, it is an uniform sampling: ( ) where the rank(i) is sorted by TD-error.
The higher the sample priority, the higher the probability of the sample being extracted, and the lower the sample priority is guaranteed to have a certain probability of being selected.

Transfer Learning
Most tasks in local path planning by reinforcement learning are correlated, and parameters in related tasks are initialized by parameter migration in different map environments to speed up strategy learning of mobile robots in different scenarios [30]. Firstly, the pre-training model is loaded to obtain all the model parameters. The model parameters ω s and ω t to be trained are divided into two parts by using hierarchical Bayes. One part is the common task, such as the function ω i , which tends to the target point. The other part is unique to each model parameters, such as dense discrete obstacles avoidance strategy v s and special obstacles avoidance rule v t . The transfer learning framework of this paper is as follows: The model parameter ω i approaching the target point is obtained through random initialization training, and it is initialized into the model parameter i ′ ω of the sparse discrete obstacles environment and the model parameter ω t of the special obstacles environment. Then, combining with the special state operation rules set in Table 1 and the direction and distance information between the mobile robot and the target point, the obstacle avoidance strategy is learned. Finally, i ′ ω is initialized as the model parameter ω s in the dense discrete obstacle environment to perfect the obstacles avoidance rules. The structure diagram of transfer learning to local path planning is shown in Figure 5.

Learning Process of Double BP Q-Learning Algorithm
The mobile robot is trained by the designed Double BP Q-learning neural network and strategy designed above, and use the obtained model to plan local path.
The learning process of the algorithm is as follows.
Step 1. A standard blank map Q_Map is generated, and the number and position of starting points, target points and obstacles are randomly set. The initial direction is the connecting direction between the mobile robot and the target point.
Step 2. Initialize the parameters of Double BP Q-learning algorithm; Step 3. The state s t is obtained after preprocessing the map environment information returned by the sensors; Step 4. Neural network parameters are initialized randomly in the initial training; otherwise, parameters are initialized by parameter transfer; Step 5. The dynamic ε-greedy strategy is used to select the action a t ; Step 6. The mobile robot performs the action a t to obtain the reward r t and the next state s t+1 ; Step 7. Store {s t , a t , r t , s t+1 } in Memory bank as a replay unit; Step 8. Extract small batch of experience for learning; Step 9. If the next state is the termination state, the Q value of the target network is r t ; otherwise, the Q value is calculated through the target network as follows: ( ) Step 10. Update the estimation network parameter ω; Step 11. Update the target network parameter ωthrough copying ω to the ωevery C step.

Simulation Experiment Test
After adding a series of improvement strategies, on PyCharm platform, Double

Simulation Test of the Function toward the Target Point
The model parameters toward the target point are randomly initialized, and other parameters are initialized as shown in Table 2.
Using the set parameters, the mobile robot roamed in the environment without obstacles at the initial stage. After reaching the preset roaming times, small batches of experience are selected for learning. After the exploration rate rises to the preset peak value, the exploration rate remains unchanged. Continue training to the pre-training times and output the probability of successfully reaching the target point during parameters update process, the processes are shown in   more than 90% in the later period of parameter update, which shows the robot can reach that the target point, with a high probability in the obstacle-free map and the model generalization ability is higher.
The double BP Q-learning algorithm is used to train the model to complete the test of reaching the target point. Figure 8 shows two scenarios that the starting point and target point are arbitrarily set, and the optimal or relatively optimal path toward the target point can be planned through the trained model.

Simulation in the Sparse Discrete Obstacles Environment
In the sparse discrete obstacles environment, model parameters approaching the target point are used as initialization to carry out model transfer. Some parameters are the same as parameters that approaching the target point, and other parameters are initialized as shown in Table 3.
By increasing the initial exploration rate ε 0 to 0.  times, the probability of successfully reaching the target point in each round of parameter update in the sparse discrete obstacles environment is output. The training results of the two algorithms are shown in Figure 9. Figure 10(a) and Figure 10(b) show the path comparison in the sparse discrete obstacles environment between the two algorithms, and the planned paths are all 21. As shown in Figure 9, when the number of parameter update p_step reaches 60 rounds, Compared with Double Q-learning algorithm, Double BP Q-learning algorithm has a greater probability of the robot reaching the target point, and the planned path is more smooth and has few redundant sections, which indicates that the Double BP Q-learning algorithm has learned part of the obstacles avoidance rules in the sparse discrete obstacles environment, and the output model has higher generalization ability.
After several tests, under different sparse discrete obstacles environments, the success rate of Double BP Q-Learning algorithm model for path planning can reach 76%, while the success rate of Double Q-Learning algorithm is only 65%. Figure 11 shows partial simulation results of the trained Double BP Q-learning algorithm in the sparse discrete obstacles environments.

Simulation in Dense Discrete Obstacles Environment
In sparse discrete obstacles environment, due to the insufficient complexity of obstacle map, the model generalization ability is not high, and the strategy learned by the algorithm can't deal with most of the maps, so the sparse discrete obstacles environment model is initialized to model parameters of the dense discrete obstacles environment through parameter transfer, some parameters are the same as parameters of the sparse discrete obstacles environment, other parameters are initialized as shown in Table 4.   Increase the number of roaming steps in the initial stage and reduce the increase extent of exploration rate. In the initial training process, the mobile robot can find more possible paths and learn more possible scenarios. At the same time, increase the memory bank capacity to store more different experiences. After reaching the preset peak value, the exploration rate remains unchanged and continues to train until reaching the preset training times and outputs the probability of successfully reaching the target point in each round of parameter update process. The simulation results of the two algorithms are shown in Figure 12.
However, as is shown in Figure 12,   After several tests, in the different dense discrete obstacles environments, the success rate of using Double BP Q-learning algorithm model to plan different maps can reach 89%, while the success rate of Double Q-learning algorithm is 82%. Figure 14 shows partial simulation results of the trained Double BP Q-learning model in the dense discrete obstacles environments.

Simulation in the Special Obstacles Environment
In the special obstacles environment, there are some special scenarios that are easy to make the mobile robot enter the deadlock state. The model parameters approaching the target point are used as initialization to carry out parameter transfer. Some parameters are the same as those parameters approaching the target point, and other parameters are initialized as shown in Table 5.
Special obstacles environment mainly includes "U" type obstacles environment and "一" type obstacles environment. Mobile robot needs to avoid the special obstacles in advance or is able to escape after entering them. Through three sensors to detect obstacles information to set specific operation rules, as shown in Table 1, and combine the direction and distance information with the target point to make the mobile robot avoid entering deadlock area in advance. Reduce unnecessary study to speed up the convergence of the model. Figure 15 is the probability of reaching the target point after each round of parameter update successfully using the two algorithms. Figure 16(a) and Figure 16(b) show the path comparison of special obstacles environment by the two algorithms, and the length is basically same. As is shown in Figure 15, the Double Q-learning algorithm converges faster in the early stage of training and has a higher probability of reaching the target point.   However, when the number of parameter update rounds reaches 50, the probability of reaching the target point is more than 90% for the model trained by the Double BP Q-learning algorithm. It tends to be stable and fluctuates less. It also has fewer redundant sections in the planned path, which indicates that the obstacles avoidance strategy learned can avoid certain deadlocks to reach the target point.
After several tests, the result is in line with the expected result of experiment.
The simulation results of partial planned path under special obstacles environments using the trained Double BP Q-learning model are shown in Figure 17.
G. M. Liu et al.

Conclusions
Double BP Q-learning algorithm based on the fusion of Double Q-learning algorithm and BP neural network is proposed. Compared with the traditional Double Q-learning algorithm, the designed algorithm can be applied to the operation environment with more complex obstacles. The convergence speed of the algorithm is faster, and the trained model has strong generalization ability and fewer path redundant sections. The success rate of the planned path in the dense discrete obstacles environment is higher, which can reach 89%. In the special obstacles environment, the robot can reach the target point avoiding the deadlocked area. The proposed Double BP Q-learning algorithm has the following characteristics: 1) Using BP neural network fitting value function instead of the Q value table can plan the path in a wider range of environment with more complex obstacles; 2) The mechanism of priority experience replay is added to speed up strategy learning; 3) Special operating rules are designed to help mobile robots escape from the deadlocked area under special obstacles environment; 4) Parameter transfer is used to initialize the model parameters in different obstacles environments for reducing unnecessary training, accelerating the convergence speed of the algorithm and increasing the generalization ability of the model.
Future research work is to combine deep reinforcement learning with neural network with memory function to process continuous high-dimensional state information and continuous action information, and improve path smoothness and reduce path redundancy.

Funding
This work is supported by the National Natural Science Foundation of China grant nos. 61473179, 61973184, 61573213.

Conflicts of Interest
The authors declare no conflict of interest.