^{1}

^{*}

^{1}

^{*}

^{2}

^{*}

^{3}

^{*}

^{4}

This research presents a novel way of labelling human activities from the skeleton output computed from RGB-D data from vision-based motion capture systems. The activities are labelled by means of a Compound Hidden Markov Model. The linkage of several Linear Hidden Markov Models to common states, makes a Compound Hidden Markov Model. Each separate Linear Hidden Markov Model has motion information of a human activity. The sequence of most likely states, from a sequence of observations, indicates which activities are performed by a person in an interval of time. The purpose of this research is to provide a service robot with the capability of human activity awareness, which can be used for action planning with implicit and indirect Human-Robot Interaction. The proposed Compound Hidden Markov Model, made of Linear Hidden Markov Models per activity, labels activities from unknown subjects with an average accuracy of 59.37%, which is higher than the average labelling accuracy for activities of unknown subjects of an Ergodic Hidden Markov Model (6.25%), and a Compound Hidden Markov Model with activities modelled by a single state (18.75%).

In daily life, human beings perform activities to accomplish diverse tasks at different times throughout the day. These activities are made of one or several simpler actions which are performed at different times, and these simple activities have a chronological relationship to each other.

The motivation for this work is to analyse human behaviour by labelling the activities which are performed by a person. Human activity has the properties of being both complex and dynamic, since a person can be performing any action, which can be a pose or a motion, and change to another action.

The scope of this work is about presenting a method for labelling human activity. The pattern classification algorithm for the skeleton data uses an euclidean measure. The learning model uses a single large Hidden Markov Model, or Compound Hidden Markov Model, to tell the activities of a person from the output of the motion analysis.

The contribution of this work consists of two parts. Firstly, we present a novel way of computing features of a skeleton using distances between certain joints of both upper body and lower body. Secondly, we propose a Compound Hidden Markov Model for labelling cyclic and non-cyclic human activities; the Compound Hidden Markov Model is made of smaller Hidden Markov Models which connect to common states.

The taxonomy of human activities depend on the complexity of the activity [

Some applications of the activity recognition are [

A particular use case for activity labelling on Domestic Robotics could be: for example, a robot helps in cooking. A person is preparing food in the kitchen. The vision system of the robot captures motion data of the person. The Activity Recognition System analyses the motion to get the activities performed. The output of the Activity Recognition System provides information to the Action Planning System, which has information of the world and the robot. The Action Planning System picks a plan of action, such as getting closer to the person and ask to help out.

There is a number of challenges on each stage of the activity recognition. When motion data is acquired, there is noise on the sensor, both from internal and external sources,which alters the captured values of the motion; the occlusion of the sensor by other objects or persons produces inaccurate or incomplete data. There are some issues which are exclusive of the Computer Vision-based systems: the orientation of the body towards the sensor can obscure some body parts, generating inaccurate or incomplete data; bad lighting conditions, if they are not compensated, reduce the accuracy of the capture. The challenges on classifying motion data are: the raw motion data can be high-dimensional, so picking the features which provide the best description is necessary; the position of the person in the motion data is not absolute, that is solved by making the motion data relative to a reference frame. The challenges when recognizing activities is that they can involve interaction with other persons or objects, this is solved by segmenting the data into separate entities and tracking them; several activities can have the same motion, which is solved by segmenting the motion data before training a classifica- tion model which provides the input for the recognition model.

There are two approaches for activity recognition, according to how the motion data is represented and recognized [

The taxonomy of the single-layered approach depends on the way of modelling human activities: space-time approach and sequential approach [

The space-time approach views an input video as a three-dimensional (XYT) volume. This approach can be categorized further depending on the features used for the XYT volume: volumes of images [

The sequential approach uses sequences of features from a human motion source. An activity has occurred if a particular sequence of features which is observed after analysing the features. There are two main types of sequential approaches: exemplary-based and state model-based [

In the exemplary-based approach, human activities are defined as sequences of features which have been trained directly. A human activity is recognized by computing the similarity of a new sequence of features against a set of reference sequences of features, if a similarity is high enough, the system deduces that the new sequence belong to a certain activity. Humans do not perform the same activity at the same rate or style, so the similarity measuring algorithm must account for those details.

An approach to account for those changes is Dynamic Time Warping [

In the state model-based approach, human activities are defined as statistical models with a set of states which generate corresponding sequences of feature vectors. The models generate those sequences with a certain probability. This approach accounts for rate and style changes. One of the most used mathematical models for recognizing activities is the Hidden Markov Model [

Hidden Markov Models, are statistical Markov Models in which the signal or process to model is assumed to be a Markov Process with unobserved states [

The states in a stochastic process have the distribution probabilities for the collection of random variables, and the transitions from a state to other depend on probabilities (non-determinism).

The Markov property indicates that the probability distribution of future states depends upon the present state; in other words, it does not keep record of past time or future states (memoryless).

The unobserved states in a Hidden Markov Model indicate that the states are not visible directly, but output depends probabilistically on the state (

The most common applications of a Hidden Markov Model are temporal pattern recognition, such as speech recognition, handwriting recognition, gesture recognition, speech tagging, following of musical scores, and DNA sequencing.

The output values for the random variables in a Hidden Markov Model can be discrete, originated from a categorical distribution, or continuous, originated from a Gaussian Distribution.

The elements of a Hidden Markov Model (l) are:

For a Hidden Markov Model to be useful in real world applications, three basic problems must be solved [

Evaluation Problem: Given a sequence of observations

Optimal State Sequence Problem: Given a sequence of observations

Training Problem: How to adjust the parameters of the model

The Forward Procedure solves the Evaluation Problem. The forward variable

indicates the probability of the partial observation sequence,

The inductive solution of

1) Initialization:

2) Induction:

3) Termination:

The initialization step sets the forward probabilities as the joint probability of state

The evaluation problem is solved by the Viterbi Algorithm, which computes the most likely sequence of connected states

The Viterbi Algorithm uses the variable

The highest probability along a single path, at time

The most likely path is the sequence of these maximized variables, for each time t and each state j. The array

1) Initialization:

2) Recursion:

3) Termination:

4) Backtracking:

An approach for solving the Training Problem is the Viterbi Learning algorithm [

The initialization of the transition matrix is done with random values. The random values on each row are normalized, so its sum is equal to one. A bit mask matrix describing the transitions of a specific graph topology can be used to set the probabilities. The transition probabilities under a bit mask value equal to zero get a very small value, while the transition probabilities under a bit mask value equal to one get a random value.

The initialization step for the emission matrix uses one of these approaches: random values or segmented observation sequences. When initializing with random values, all the values must be larger than zero and each row must be normalized, so the sum of each row is equal to one. In the segmented observations sequences approach, the sequence is split by the number of states of the Hidden Markov Model. If the length of the sequence is not a multiple of the number of states, the last state gets less observations. For each state, the emission probability of each symbol is equal to the count of that symbol divided by the total amount of symbols assigned to that state.

The initial probability vector can be initialized either to uniform probabilities or by assigning the larger probability to an state or a number of states. The probabilities are normalized so its sum is equal to one.

In the induction step, for each sequence of observations for training, the Most Likely State Path is computed with the Viterbi Algorithm on the initial Hidden Markov Model, and the Likelihood Probability is computed either with the Viterbi Algorithm or the Forward Algorithm on the initial Hidden Markov Model. The Most Likely State Path of each sequence is stored for computing the parameters of an updated Hidden Markov Model. The Forward Probability of each sequence is accumulated in the variable

The values of the updated transition matrix A are computed by counting the transitions from the state

The values of the updated emission matrix B are the frequencies of each observation symbol in the observation sequence,

The initial probability vector

A new Hidden Markov Model is built from the updated model parameters

The conditions for terminating the algorithm are: either the absolute of the difference of

Both Forward Probability Algorithm and Viterbi Algorithm store the result of floating-point operations in a single variable. The accumulated product of fractional values is a value so small that might fall below the minimum precision of the floating-point variable which stores the result. That variable can be represented in logarithmic scale, where multiplication and division operations are represented as addition and subtraction respectively. The range of values in logarithmic scale goes from

The logarithmic scale in the Forward Algorithm applies at each iteration in the Induction step, a scale variable accumulates the value of the forward variable

For the Viterbi Algorithm, the elements of the model

Depending on the process that generates a signal, the contents of the signal can have a stationary structure, or a chronological structure. The structure of the contents of the signal indicates which is the most suitable Hidden Markov Model [

The classical case of the set of bowls containing different proportions of coloured balls is an example of a stationary process: any ball is drawn from any bowl at any time. For this case, the most suitable topology for the Hidden Markov Model is the ergodic model (

In automated motion recognition and activity recognition applications, the input data to be processed has a chronological or linear structure [

The simplest topology for linear processes is the linear model (

The flexibility in the modelling of the duration can increase if it is possible to skip individual states in the sequence. One of the most used topology variations for automated speech and handwriting recognition is the Bakis model (

The largest variations in the chronological structure are achieved by allowing a state to have transitions to any posterior states in the chronological sequence. The only forbidden transition is going from a state

Any of the Hidden Markov Models for signals with chronological structure―Linear, Bakis, Left-to-Right― can model cyclic signals by adding a transition from the last state to the first state (Figures 2(e)-(f)) [

The Hidden Markov Model is one of the most commonly used statistical models in the state model-based

approach to Activity Recognition. There are two approaches for recognizing activities with Hidden Markov Models: Maximum Likelihood Probability (MLP) [

• Features:

-Each Activity has a Hidden Markov Model.

-Each Hidden Markov Model computes the Forward Probability of a sequence of observation symbols.

-The Hidden Markov Model with the largest Forward Probability identifies the activity.

• Advantages:

-New activities can be added easily by training another Hidden Markov Model.

-The evaluation of a sequence of observation symbols can be performed by parallel tasks.

• Disadvantages:

-Motion segmentation is required when recognizing connected activities.

• Features:

-All the activities are embedded in a single large Hidden Markov Model.

-Each activity is represented by a subset of states.

-A sequence of observation symbols is processed to obtain the sequence of most likely states which generates it.

• Advantages:

-The evaluation of connected activities is possible without motion segmentation.

-Reconstruction of activities from the sequence of most likely states.

• Disadvantages:

-Adding a new activity is complicated: the Hidden Markov Model for the new activity is trained separately, the Hidden Markov Model is merged with the single large Hidden Markov Model and the single large Hidden Markov Model must be retrained to update the probabilities of emission and transition.

-The computation of likelihood probability with Viterbi Algorithm is slower than with Forward Algorithm,

-Reconstruction of activities requires an index which associates each activity with a subset of states.

A limitation of the Hidden Markov Models is that they do not allow for complex activities, interactions between persons and objects, and group interactions. To enhance the probability of recognizing activities with Hidden Markov Models, variations to the model have been studied in previous works.

In the Conditioned Hidden Markov Model [

The Coupled Hidden Markov Model [

The states of a Hidden Semi-Markov Model [

The Maximum Entropy Markov Model represents [

The Compound Hidden Markov Model [

The Dynamic Multiple Link Hidden Markov Model [

The Two-Stage Linear Hidden Markov Model [

The Layered Hidden Markov Model [

The Hidden Markov Model topology chosen for this work is the Compound Hidden Markov Model, because the purpose of this work is labelling activities performed by a person, during a period of time.

The method for activity recognition proposed in this work uses a representation of skeleton data based in Euclidean distance between body parts, and a Compound Hidden Markov Model for activity labelling.

The features of the skeleton are a variation of those presented in Glodek et al. [

The observations for the Hidden Markov Model are computed from the Euclidean Distance between the features of two skeletons. To get the observations of a new sequence of motion data, the skeletons of each frame have their features computed. A set of similarities is computed for each frame of the new sequence of motion data. Those similarities come from the Euclidean Distance of the features of a frame of motion data and the features of each element of the codebook of key frames. The index of the key frame with the smallest Euclidean Distance becomes the observation of each frame.

The model proposed for activity labelling is a Compound Hidden Markov Model [

The Compound Hidden Markov Model is formed by several simpler Hidden Markov Models, whose topologies are configured according to the type of activity to model: the stationary activities, like sit still and stand still, have a single state; the non-periodic activities, like stand up and sit down, are modelled with Linear Hidden Markov Models; and, the periodic activities, such as walk, are modelled by a Cyclical Linear Hidden Markov Model.

The activities are connected using context information. For example, the sit still activity connects to the first state of the stand up activity, and receives a connection from the last state of the sit down activity. The stand still activity connects to the first state of the sit down activity, and receives a connection from the last state of the stand up activity. Also, the stand still activity connects to the first state of the walk activity and receives a connection from the last state of the walk activity (

The stationary activities (sit still, stand still) are modelled with a Hidden Markov Model formed by a single state. The emission probabilities of each Hidden Markov Model are initialized to the averaged frequency of the observations for the corresponding idle activity.

The non-periodic activities (stand up, sit down) and the periodic activities, (walk), are trained using the following procedure: the observations from motion data of each activity are segmented into three sections: the anticipation (

The Hidden Markov Models for stationary activities, non-periodic activities, and periodic activities are merged in a Compound Hidden Markov Model, as specified in the Section 3.3, and its parameters are re- estimated using Viterbi Learning with all the elements of the training set.

In order to assess the labelling accuracy of both the Compound Hidden Markov Model and some reference Hidden Markov Models, they was tested with a data set of human activities.

The tests were performed using the Microsoft Research Daily Activity 3D Data set (MSRDaily) [

The data set is composed by 16 activities, a) drink; b) eat; c) read book; d) call cellphone; e) write on a paper; f) use laptop; g) use vacuum cleaner; h) cheer up; i) remain still; j) toss paper; k) play game; l) lay down on sofa; m) walk; n) play guitar; o) stand up; and p) sit down which are performed by 10 persons, who execute each activity twice, once in standing position, and once in sitting position. There is a sofa in the scene. Three channels are recorded: depth maps (.bin), skeleton joint positions (.txt), and RGB video (.avi). There are

For the purpose of this work, only the skeleton joint positions were used as input for labelling the actions, as well as a subset of activities: a) remain still (sitting pose) (

At the training step, the Hidden Markov Model is generated using a training set of motion data. The training set is made of the motion data from the first 6 subjects of the MSRDaily data set, while the motion data of the last 4 subjects constitute the testing set.

The Microsoft Kinect sensor captures the depth map

capture, a skeleton represents a single frame of the motion, therefore, a whole motion sequence contains several skeletons. The training set of an activity is formed by captures of motion sequences of the same activity performed by several people.

First of all, the skeletons have their features extracted, using the algorithm described in the Section 3.1. All the features from the skeletons of the training set are clustered with the k-means algorithm. The centroids of the clusters become the codebook of key frames.

The amount of symbols used in this work is 255, because that is the amount of symbols which provided the best labelling accuracy on the testing set, after performing tests on different amounts of symbols for the codebook, which were 31, 63, 127, 255, 511, 1023, 2047, and 4095 centroids^{1}.

The Hidden Markov Model for a non-stationary activity has the following structure for its states: the amount of states is N, where

The transition probabilities from the Anticipation State to the Action States are initialized to uniform values. There are no transitions from the Action States to the Anticipation State. The transition probabilities from the

Action States to the Reaction State are initialized to uniform values. And, the transition probabilities from the Reaction State to the Anticipation State are set to uniform values.

The observations from the anticipation section are used for initialize the emission probabilities of the Anticipation State. The observations from the reaction section are used to initialize the emission probabilities of the Reaction State. The emission probabilities of the Action States are initialized to random values.

Both the transition probabilities and the emission probabilities for all the States will be refined after applying Viterbi Learning [

The assessment of the quality of a labelled activity is done on the results of computing the Most Likely State Sequence from the observations of an activity.

The joints of the skeleton

The key frame with the minimum distance becomes an observation o, which is appended to a sequence of observations

The first Hidden Markov Model to test is an Ergodic Hidden Markov Model where each state represent a single activity, giving a total of 5 states (

For the second Hidden Markov Model, the proposal is a Hidden Markov Model organized like a Finite State Machine. Each activity is represented by a single state, giving a total of 5 states. The connections between the states of each activity use a language model.

The third Hidden Markov Model is a variation of the second Hidden Markov Model, where its parameters are retrained with Viterbi Learning. The connections between the states of each activity use a language model.

Both the second and the third Hidden Markov Model have a Graph-like structure (

The fourth Hidden Markov Model is the Compound Hidden Markov Model proposed in the Section 2.3 (

The language model for connecting coherent activities is the following:

•Sit Still ® Sit Still.

•Sit Still ® Stand Up ® Stand Still.

•Stand Still ® Stand Still.

•Stand Still ® Sit Down ® Sit Still.

•Stand Still ® Walk ® Stand Still.

The sequence of observations

The criteria for determining the accuracy of the sequence of most likely states

The tests were performed on the four different Hidden Markov Models specified in the section 23. Each Hidden Markov Model was tested with the following amount of symbols 31, 63, 127, 255, 511, 1023, 2047, and 4095 , while keeping the amount of states of each Hidden Markov Model topology. The tables show the Hidden Markov Model with the amount of symbols that provided the highest labelling accuracy.

The assessed data was the sequence of most likely states

Activity | Expected sequence of motions |
---|---|

Sit still | {Sit still} |

Stand still | {Stand still} |

Stand up | {Sit still, stand up, stand still} |

Sit down | {Stand still, sit down, sit still} |

Walk | {Stand still, walk, stand still (optional)} |

Codebook size | |||
---|---|---|---|

Model (#states) | 255 symbols | 511 symbols | 2047 symbols |

Ergodic (5) | 25.00% | 37.50% | 58.33% |

Graph-like (5) | 41.66% | 43.75% | 54.16% |

Graph-like retrained (5) | 41.66% | 43.75% | 68.75% |

Compound (14) | 54.16% | 54.16% | 77.08% |

Codebook size | |||
---|---|---|---|

Model (#states) | 255 symbols | 511 symbols | 2047 symbols |

Ergodic (5) | 6.25% | 15.62% | 18.75% |

Graph-like (5) | 18.75% | 12.50% | 6.25% |

Graph-like retrained (5) | 18.75% | 18.75% | 53.12% |

Compound (14) | 59.37% | 56.25% | 53.12% |

Subjects tested | 6 | |||||
---|---|---|---|---|---|---|

Activity | Sit | Stand | ||||

Model (#states, #symbols) | 255 | 511 | 2047 | 255 | 511 | 2047 |

Ergodic (5) | 4 | 6 | 6 | 2 | 3 | 6 |

Graph-like (5) | 6 | 6 | 6 | 0 | 0 | 0 |

Graph-like retrained (5) | 6 | 6 | 6 | 0 | 0 | 0 |

Compound (14) | 6 | 6 | 6 | 0 | 0 | 0 |

Subjects tested | 6 | |||||
---|---|---|---|---|---|---|

Activity | Walk | Walk occluded | ||||

Model (#states, #symbols) | 255 | 511 | 2047 | 255 | 511 | 2047 |

Ergodic (5) | 1 | 4 | 6 | 0 | 0 | 0 |

Graph-like (5) | 6 | 6 | 5 | 0 | 0 | 0 |

Graph-like retrained (5) | 6 | 6 | 6 | 0 | 0 | 2 |

Compound (14) | 6 | 6 | 6 | 4 | 4 | 3 |

Subjects tested | 6 | |||||
---|---|---|---|---|---|---|

Activity | Stand up 1 | Stand up 2 | ||||

Model (#states, #symbols) | 255 | 511 | 2047 | 255 | 511 | 2047 |

Ergodic (5) | 3 | 4 | 6 | 0 | 0 | 0 |

Graph-like (5) | 5 | 5 | 5 | 0 | 0 | 0 |

Graph-like retrained (5) | 5 | 5 | 6 | 0 | 0 | 4 |

Compound (14) | 5 | 6 | 6 | 1 | 0 | 4 |

Subjects tested | 6 | |||||
---|---|---|---|---|---|---|

Activity | Sit down 1 | Sit down 2 | ||||

Model (#states, #symbols) | 255 | 511 | 2047 | 255 | 511 | 2047 |

Ergodic (5) | 2 | 1 | 4 | 0 | 0 | 0 |

Graph-like (5) | 3 | 3 | 6 | 0 | 1 | 4 |

Graph-like retrained (5) | 3 | 3 | 5 | 0 | 1 | 4 |

Compound (14) | 3 | 3 | 6 | 1 | 1 | 6 |

Tables 4(a)-(d) show the results of the tests on the 4 subjects of the testing set from the MSRDaily data set. The first column shows the topology of the Hidden Markov Model, the columns 2-4 show the three sizes of codebooks which gave high accuracy for a first activity, and the columns 5-7 show three sizes of codebooks which gave high accuracy for a second activity.

Subjects tested | 4 | |||||
---|---|---|---|---|---|---|

Activity | Sit | Stand | ||||

Model (#states, #symbols) | 255 | 511 | 2047 | 255 | 511 | 2047 |

Ergodic (5) | 0 | 0 | 0 | 1 | 2 | 2 |

Graph-like (5) | 0 | 0 | 0 | 1 | 1 | 0 |

Graph-like retrained (5) | 0 | 1 | 2 | 1 | 1 | 1 |

Compound (14) | 2 | 2 | 2 | 3 | 2 | 2 |

Subjects tested | 4 | |||||
---|---|---|---|---|---|---|

Activity | Walk | Walk occluded | ||||

Model (#states, #symbols) | 255 | 511 | 2047 | 255 | 511 | 2047 |

Ergodic (5) | 1 | 2 | 4 | 0 | 0 | 0 |

Graph-like (5) | 4 | 1 | 1 | 0 | 1 | 0 |

Graph-like retrained (5) | 4 | 2 | 4 | 1 | 1 | 3 |

Compound (14) | 4 | 4 | 4 | 2 | 2 | 2 |

Subjects tested | 4 | |||||
---|---|---|---|---|---|---|

Activity | Stand up 1 | Stand up 2 | ||||

Model (#states, #symbols) | 255 | 511 | 2047 | 255 | 511 | 2047 |

Ergodic (5) | 0 | 1 | 0 | 0 | 0 | 0 |

Graph-like (5) | 0 | 1 | 1 | 0 | 0 | 0 |

Graph-like retrained (5) | 0 | 1 | 2 | 0 | 0 | 2 |

Compound (14) | 3 | 2 | 2 | 2 | 2 | 3 |

Subjects tested | 4 | |||||
---|---|---|---|---|---|---|

Activity | Sit down 1 | Sit down 2 | ||||

Model (#states, #symbols) | 255 | 511 | 2047 | 255 | 511 | 2047 |

Ergodic (5) | 0 | 0 | 0 | 0 | 0 | 0 |

Graph-like (5) | 0 | 0 | 0 | 1 | 0 | 0 |

Graph-like retrained (5) | 0 | 0 | 2 | 0 | 0 | 1 |

Compound (14) | 1 | 1 | 2 | 2 | 3 | 0 |

The results for both the training set and the testing set show that the Compound Hidden Markov Model labels correctly a sequence of motion more often than an Ergodic Hidden Markov Model or the Graph-like Hidden Markov Models, when the amount of symbols is lesser than 2047 (

In the Hidden Markov Models whose codebooks are of

It must be noted that the “Walk Occluded” activity is labelled incorrectly by all the Hidden Markov Models. The reason for such failure is that the skeleton data is incorrect or noisy because a sofa occludes the person who is walking. The algorithm which computes the skeleton [

We present results for labelling human activity from skeleton data of a single Microsoft Kinect sensor. We present a novel way of computing features of a skeleton using distances between certain joints of both upper body and lower body. And, we propose a Compound Hidden Markov Model for labelling cyclic and non-cyclic human activities, which perform better than the reference Hidden Markov Models, an Ergodic Hidden Markov Model and a Graph-like Hidden Markov Model. The results for labelling 5 activities from 4 non-trained subjects show that the Compound Hidden Markov Model, with a codebook of 255 symbols, labels correctly a sequence of motion with an average accuracy of 59.37%, which is higher than the average labelling accuracy for activities of unknown subjects of an Ergodic Hidden Markov Model (6.25%), and a Compound Hidden Markov Model with activities modelled by a single state (18.75%), both with a codebook of 255 symbols. The contributions of this work are the representation of a full body pose with Euclidean distances between certain pairs of body joints, and the method for training a Compound Hidden Markov Model for activity labelling by segmenting the training data with the Anticipation-Action-Reaction sections from theory of animation. The future work involves using a new representation for the skeleton, based on Orthogonal Direction Change Chain Codes [

This work was supported by PAPIIT-DGAPA UNAM under Grant IN-107609.

Jose IsraelFigueroa-Angulo,JesusSavage,ErnestoBribiesca,Luis EnriqueSucar,BorisEscalante, (2015) Compound Hidden Markov Model for Activity Labelling. International Journal of Intelligence Science,05,177-195. doi: 10.4236/ijis.2015.55016