Multiple Action Sequence Learning and Automatic Generation for a Humanoid Robot Using Rnnpb and Reinforcement Learning

This paper proposes how to learn and generate multiple action sequences of a humanoid robot. At first, all the basic action sequences, also called primitive behaviors, are learned by a recurrent neural network with parametric bias (RNNPB) and the value of the internal nodes which are parametric bias (PB) determining the output with different primitive behaviors are obtained. The training of the RNN uses back propagation through time (BPTT) method. After that, to generate the learned behaviors, or a more complex behavior which is the combination of the primitive behaviors, a reinforcement learning algorithm: Q-learning (QL) is adopt to determine which PB value is adaptive for the generation. Finally, using a real humanoid robot, the proposed method was confirmed its effectiveness by the results of experiment.


Introduction
To recognize, learn, and generate adaptive behaviors for an intelligent social robot is a charming theme and it has been attracting researchers more than a decade.From the view that dynamic complex behaviors of the robot are composed by the spatiotemporal changed actions which are so-called "primitive behaviors", or "element actions", gesture recognition has been approached by lots of methods such as 3D models [1], self-organizing map (SOM) [2,3], hidden Markov model (HMM) [4][5][6][7], dynamic Baysian network (DBN) [8], recurrent neural network (RNN) [9][10][11], and dynamic programming (DP) [12].
Tani with his colleagues proposed a RNN with parametric bias (RNNPB) which realize not only recognition of multiple behaviors but also learning and generation of them, based the finding of mirror neuron system in the brain [13,14].The input of RNNPB includes sensory (visual or auditory) information and teacher's motor information during the learning period, and the imitative behaviors are output (generated) by the network according to the observation of robot in the period of generation.
In this paper, we propose to combine RNNPB and reinforcement learning (RL) [15] to realize (i) the mul-tiple behaviors automatic generation or (ii) by the instruction of a human instructor.In another word, the adaptive PB values are determined as the result of RL in the generation process.Various patterns of primitive behaviors are learned by back-propagation through time (BPTT) [16] [17], and PB vectors are obtained as the result.Considering the PB vectors as finite states of a Markov decision process (MDP), a complex behavior can be learned as an optimal state transition process of these primitive behavior patterns using the RL algorithm such as Q-Learning.Using a humanoid robot "PALRO" (Fujisoft Inc., 2010), experiments results confirmed the effectiveness of the proposed method.

Proposed System
Multiple behavior instruction learning and complex behavior learning system for a robot is proposed here.It works as following process: (i) Time series data of angles of robot's joints for primitive behaviors are given by a user (instructor) of the robot and they are recorded as teacher signals; (ii) Train a recurrent neural network with parameter bias (RNNPB) [9][10][11] with error backpropagation method [16,17], which output are time series angles of joints when arbitrary initial angles are set as

RNNPB
The recurrent neural network with parametric biases (RNNPB) [9][10][11] is a Jordan-type recurrent feed forward neural network [18] with three kinds of internal layers: hidden layer, context layer and parametric bias (PB) layer (Figure 1).Nodes in Hidden layer and Context layer have their internal states with sigmoid function: where α, a positive constant, is the gradient of the function, and z is the input vector for the node.Specially, the input vector z h for the Hidden layer nodes: where u i = x(t), u pb , u c and v i , v pb , v c are the output and the connection weight of Input layer, PB layer and lower Context layer respectively.The input vectors z o and z c for Output layer and Context layer are given by: x(t+1) x c (t+1) where u h = f (z h ) is the output of Hidden layer given by Equation ( 1), and w o , w c are the connection weights be-tween the Hidden layer and the Output layer, the Context layers, respectively.
For the nodes of PB layer, its internal state u pb changes with the delta errors bp t δ during a period (a time series window) l, when the network is trained by the error back-propagation (BP) method [16,17]: where are learning coefficient, learning rate, and internal coefficient of PB nodes.
The modification of connection weights is executed by the back-propagation through time (BPTT) [16,17], that is, errors between the output of the network and the teacher data are used to adjust the weights of connections.Detail formula is omitted here.

Q-Learning algorithm
Reinforcement learning (RL) is a kind of active learning method which makes a learner finds its optimal action policy by an iterative process of exploration and exploittation [15].For a process of finite state transition, usually a Markov decision process (MDP), that is, the transition is and is only decided by the transition probability of the last state, RL intends to find the optimal transition probabilities by adopting value functions of states and stateaction pairs.The state-action value function, usually called Q function, serves as an index variable in a stochastic function of action selection policy.In this study, we use a traditional RL named Q-learning (QL) [15] and its learning algorithm is as follows.

QL algorithm:
Step 1 Initialize Q(s,a)=0.0,where s,a are available finite state space, and action space of the robot respectively.
Step 2 Observe the state s of the environment around the learner.
Step 3 Select an action a to change the state according to a stochastic function.For example, select an action which has the highest value of . ( 6) where 1 , 0 ≤

≤ γ λ
are learning rate and discount rate respectively.
Step 6 Repeat Step 2 to Step 5 until the value of The state space in our system is defined as different PB vectors, and the action space of QL is also these PB vectors fixed after the BP learning process.So the optimal state transition process approvals correct combination of primitive behaviors to be a complex behavior of robot, or the correct execution of the primitive behavior.

Experiments
The proposed method was applied to a complex behavior learning and generation of a humanoid robot named "PALRO" (Fujisoft, Inc., 2010) as shown in Figure 2.
There are 20 joints (actuators) in PALRO (arms, legs, neck, and body) and the control of these angles of joints in time series composes various actions of the robot.Two kinds of experiments were designed: Experiment I: a time series angles of joints yield a primitive behavior, such as raising a hand, or turning to left/right, and several primitive behaviors yield a complex behavior of the robot such as a "dance" behavior; Experiment II: 3 kinds of voice instructions corresponding to 3 kinds of behaviors of robot were learned and recognized.
Details of the experiments and results were described in this section.In fact, if the teacher signals were not added during the generation process, that is, the input signal was given by following equation: where x(t-1) is the output of the network on time t-1, x d (t-1) is the teacher data and r is the ratio of the teacher signal.
When r = 0.0, the output of the network was easily to fall in a static state and this problem needs to be improved in the future study.

Complex Behavior Learning / Generation
Using the Q-Learning algorithm (QL) described in last section, we decided the required orders of the primitive patterns to compose the complex behavior: a "dance" of robot.
The QL was defined with 4 states, that is, 4 values of PB nodes and 4 actions as same as these PB values.The training results gave the order of PB values used in the generation process of robot as follows: "

Experiment II: Behavior Instruction Learning and Recognition
Voice instruction can be captured and recognized by the recorder and microphone of PALRO.However, special behaviors need to be learned by the instructor and the learning system RNNPB, and the relationship between PB values and the voice instructions is able to be decided by QL algorithm as same as the situation of order decision of primitive behaviors for complex behavior learning and generation.In this experiment, we designed 3 kinds of behaviors for PALRO which static picture are shown in  3. Parameters used in QL were as same as in the Experiment I (Table 2).Figure 7 shows the scene of teaching process where angles of 8 joints were changed by the instructor and they were recorded as a time series data as a teach signal of a behavior.The voice instruction learning and recognition results also achieved 100.0% of successful rate, and the details are omitted here for the limit of space.

Conclusion
The combination of a recurrent neural network with bias parameters (RNNPB) and reinforcement learning algorithm was proposed to realize the complex behavior learning and generation of robot.All angles of joints of robot were considered as the input and output of RNNPB and their time series data formed kinds of patterns of primitive behaviors of robot at first, then the complex behavior of robot were composed by the time series of different primitive behaviors.The learning rule of RNNPB used back propagation through time (BPTT) method, and to generate a series of primitive behaviors in correct order, Q-learning (QL) was adopt in the training process.Using a humanoid robot "PALRO", the proposed method was confirmed its effectiveness by the results of two kinds of experiments.The generation of primitive behaviors showed a satisfied representation of required movement when a certain of teach signal was added during the generation process and a 100.0% of success rate of a complex behavior "dance" was acquired after the training with the QL algorithm.Voice instruction learning and recognition also reached to 100.0% success rate in the experiment.

Figure 1 .
Figure 1.The structure of RNNPB.Internal layers are expressed in gray color and connections with synaptic weights between layers are depicted with broken arrow lines.Context layers are same one but with temporal varied values of its internal state, input and output.
of patterns of primitive behaviors of the robot as shown in Figure 3: (a) Turn to left and shake the hands; (b) Turn to right and shake the hands; (c) Turn to right and shake the hands; (d) Raise the left hand and stop in a special pose.Angle vector with 20 dimensions served the input of RNNPB, that is, the number of nodes in Input layer and Output layer was 20 respectively.Teach signals of the primitive behaviors were recorded by the storage function of the robot, that is, the time series values of angles of the movements (primitive behaviors) obtained by the drive of an instructor.Parameters of RNNPB and its learning process used in the experiment are listed in Table1.Training results of RNNPB for the 4 primitive behaviors are shown in Figure 4, where (a) shows the learning curve (Iteration time vs. RMSE); (b) PB values of the 4 patterns of behaviors; (c)-(f) time series values of 20 angles of behaviors (generation with 30% teacher signals).The time interval "step" was set with 0.1 second/step in (c)-(f).

Table 1 . 2 The
Parameters used in RNNPB in Experiment I.Description Symbol ValueThe number of nodes in Input layer N 20The number of nodes in Output layer N 20The number of nodes in Hidden layer H 30The number of nodes in PB layer P

Figure 5 .
Figure 5. Leaning results of a complex behavior "dance".

Figure 6 :
(a) Shake a hand; (b) Raise 2 hands; (3) A handclap.Because the behaviors were limited in several joints, 8 input / output nodes were designed in RNNPB, and other parameters used in the experiment II are listed in Table

Figure 6 .
Figure 6.Leaning results of a complex behavior "dance".
series (width of time window) l 20 Internal coefficients f PB nodes k1, k2 0.8, 0.5 Gradient of sigmoid function α 2.0 Gradient of sigmoid function for PB pb α 5.0

Figure 7 .
Figure 7. Leaning results of a complex behavior "dance".