3D Human Pose Estimation from a Monocular Image Using Model Fitting in Eigenspaces

doi:10.4236/jsea.2010.311125

Paper Menu >>

Journal Menu >>

J. Software Engineering & Applications, 2010, 3, 1060-1066

doi:10.4236/jsea.2010.311125 Published Online November 2010 ( ht t p : / /www.SciRP.org/journal/jsea)

3D Human Pose Estimation from a Monocular

Image Using Model Fitting in Eigenspaces

Geli Bo, Katsunori Onishi, Tetsuya Takiguchi, Yasuo Ariki

Graduate School of Engineering, Kobe University, Kobe, Japan.

Email: {takigu, ariki}@kobe-u.ac.jp

Received October 4th, 2010; revised October 26th, 2010; accepted November 3rd, 2010.

ABSTRACT

Generally, there are two approaches for solving the problem of human pose estimation from monocular images. One is

the learning-based approach, and the other is the model-based approach. The former method can estimate the poses

rapidly but has the disadvantage of low estimation accuracy. While the latter method is able to accurately estimate the

poses, its computational cost is high. In this paper, we propose a method to integrate the learning-based and model-

based approaches to improve the estimation precision. In the learning-based approach, we use regression analysis to

model the mapping from visua l observa tions to human poses. In th e mod el-ba sed approach , a particle filter is emp lo yed

on the results of regression ana lysis. To solve the curse of the d imensionality pr oblem, th e eigenspa ce of each motio n is

learned using Principal Component Analysis (PCA). Finally, the proposed method was estimated using the CMU

Graphics Lab Motion Capture Database. The RMS error of human joint angles was 6.2 degrees using our method, an

improvement of up to 0.9 degrees compared to the method without eigenspaces.

Keywords: HOG, Regression Analysis, Eigenspaces, Particle Filter, Pose Estimation

1. Introduction

The 3D configuration estimation of complex articulated

objects from monocular images has been widely studied.

Once the technology is perfected, there will be potential

applications in many fields related to human pose and

kinematic information, such as computer interfaces that

utilize gesture input, interaction with the robots, video

surveillance, and entertainment. However, monocular

human pose estimation is extremely challenging due to

the complicated nature of human motion and the limited

amount of information in 2D images.

The methods of human pose estimation can be sum-

marized into two approaches: learning-based and model-

based. In the learning-based method [1-4] features are

directly extracted from the image, and the mapping func-

tion for the human poses is trained using the image fea-

tures. Through this mapping, the human pose of an im-

age can be estimated. Once the training process is com-

pleted, the pose estimation is performed rapidly. How-

ever, the estimation precision decreases when the input

image is not included in the training data. In the model-

based method [5-8], the pose estimation method follows

Bayes' theorem and models the posterior probability

density using observation likelihood or cost function.

This method is computationally expensive, in general,

and dependent on an initial pose.

To solve these problems, we propose a method to in-

tegrate the learning-based and model-based methods to

improve the estimation accuracy. An initial pose is de-

termined using regression analysis in the learning-based

approach, and the estimation method is switched to a

particle filter in the model-based approach to improve

the precision. Unfortunately, given the large dimension-

ality of a 3D human model space, it is almost impractical

to apply particle filtering directly as a large number of

particles is required to adequately approximate the un-

derlying probability distribution in the human pose space.

Therefore, we first use PCA to learn the eigenspace of

each motion. Then, the optimal human pose is efficiently

searched in the eigenspaces selected according to the

estimated type of human motion in the input images.

2. Features

2.1. Image Features

We use the HOG feature [9], which can describe the

shape of an object in an appearance-based approach [10].

HOG was proposed as a gradient-based feature for gen-

eral object recognitio n, where HOG d escribes the feature

3D Human Pose Estimation from a Monocular Image Using Model Fitting in Eigenspaces 1061

Input

Image

Background

subtraction &

normalizatio

Gradient

computatio

Weighted vote

into spatial &

orientation cells

Normalizatio

over overlapping

blocks

HOG

descriptor

Figure 1. The flow of feature extraction.

over the given region. This means that HOG can repre-

sent the rough shap e of an object. Moreover, since HOG

can tolerate a range of varying illumina tion, it is suitable

for pose estimation [11]. Figure 1 presents the complete

processing chain of the HOG feature extraction briefly.

We will now discuss the HOG encoding algorithm in

this section in detail.

2.1.1. Gradient Computatio n

Before extraction of the HOG feature, we first separate

the human region from the input image using back-

ground subtraction, where the size of the human region

is normalized, and the human region is located in the

center position of the image. Then the image gradient is

computed as follows.

(, )(1, )(1, ),

(, )(,1)(,1),

xyIxy Ixyxy

xy IxyIxyxy

 









 (1)

where

and y

denote

and components of the

image gradient, respectively. y

(, )

xy denotes the pixel

intensity at the position (, )

y. The magnitude

and orientation (, )mxy

(, )



are computed by

(, )(, )(, )

mxyf xyfxy

(2)

(, )tan((,)(, ))

yfxyfx





y

(3)

In order to make the HOG feature insensitive to cloth-

ing and the facial expression, we use the unsigned orien-

tation of the image gradient, which is computed as fol-

lows.

(, ),(, )0

(, )(,),

xyif xy

xy x yotherwise













 (4)

2.1.2. Orienta tion Hist ogra ms

The gradient image is divided into cells wh

c pixels as

shown in Figure 2. At each cell, the orientation

c(,)





is quantized into b

c orientation bins, weighted by its

magnitude to make a histogram. That is, a his-

togram with the orientations is computed for each

cell.

(,mxy)

2.1.3. Block Normalization

Figure 2 shows the orientation histogram extracted at

Block: b

 b

cell

Cell: C

 C

pixel

orientation

Ov erla p pin g bl ocks

normalization

Figure 2. Block normalization.

every cell and the larger spatial b locks with wh



cells.

Since a cell has orientations, the feature dimension of

each block is bwhb

cbc







for each block. Let v de-

note a feature vector in a block, ij denote th e unno rmal-

ized histogram of the cell in the position (, ,

h)ij



1,ib 1j

in a block. The feature vector of a

certain block is normalized as follows.

b

2(1



)



 (5)

Since the normalization is done by overlapping the

block, the histograms are repeatedly normalized by a

different block. ij

2.2. 3D Human Model

The human body can be regarded as a multi-joint object

that transforms into various shapes. In addition, the seg-

mented part that connects two joints can be regarded as

rigid object. Therefore, it is possible to express a 3D

human model with joint angles. In this research, we used

the motion capture data in the CMU Graphics Lab Mo-

tion Capture Database [12]. The 3D human model is

represented by 56 joint angles, so the dimension of the

pose state vector is 56. Figure 3 shows an example of

the 3D human model.

3. The Pose Estimation Method

Human pose estimation is carried out using two ap-

proaches. In the model-fitting method, human pose is

estimated by an iteration procedure [5]. However, it is

problematic in that the initial value has to be given

manually. Therefore, we adopt the learning-based

method to automatically obtain the initial value, which is

3D Human Pose Estimation from a Monocular Image Using Model Fitting in Eigenspaces

1062

Figure 3. An example of 3D human model.

Regression Fitting

Likelihood

evaluation

Generate s a mples

Input image

3D human pose

for initial value

Output (3D human pose)

3D human pose

in eigen

space



3D human poses generated

by particle filter



Select eigen space

and projectio

en space

Back project ion

3D human poses in high

dimensional space

Figure 4. Pose estimation system.

integrated into the iteration procedure of the model-

based method. One drawback in the model-based

method is the high dimensionality of the state space,

which makes the algorithm computationally ineffective.

Thus we use PCA to reduce the dimension of the pose

state and establish the eigenspace of each motion. Fig-

ure 4 shows the proposed pose estimation system.

3.1. Learning-Based Method Using Regression

Analysis

In the learning-based method, we adopt regression

analysis [1,13] to estimate the pose of input image. Let

denote the vector composed of the angles at joints in

the 3D human model. The relation between the HOG

feature vector z and 3D pose vector

is linearly ap-

proximated using the following formula:

xRz



 (6)

where



is residual error vector. The 3D human pose is

Estimation

HOG feature

3D human model

pairs of images and 3D

human model angles

LMS

Figure 5. Regression-based estimation method.

estimated by converting the input image feature to the

3D human model vector z

. In model training (estimate

), a set of n training pairs





(,) 1

zi n is giv e n

(in our case, 3D poses and the corresponding image

HOG features). The conversion matrix R is estimated

by minimizing the mean square error. Packing the train-

ing data into 3D pose matrix 12 n

()

xx

)x and image

feature matrix 12

(

zz z



, the training is performed

as follows:

:argmin

RRZX (7)

In the testing phase, the 3D human po sture vector

estimated by converting HOG features vector z using

the computed conversion matrix R. Figure 5 demon-

strates the regression-based estimation method.

3.2. Model-Based Method Using a Particle Filter

In the model-based method, a particle filter [14] is em-

ployed. Following a notation similar to [14], we define

as the state vector at time t, with t

z denoting the

measurements at time t. Furthermore, let all the meas-

urements until time t be given by 1

(, ,)

zz. Par-

ticle filtering based on Bayes' theorem is used to obtain a

posterior probability ()

px at each time-step using

all the available information as shown bellow.

()()(

tttt tt

px Zpzxpx Z

) (8)

This equation is evaluated recursively as described be-

low. The fundamental idea of particle filtering is to ap-

proximate the posterior probability density function

(pdf) over t

by a weighted sample set . Suppose that

samples from the posterior pdf t

N()

t are avail-

able and denote them as t

px Z

()i

. Then the ith weighted

sample at time is represented by



() ()

t()

)

tttt

xS



First, a cumulative histogram of all the samples'

weights is computed at time t. Then, according to

each particle's weight 1

1

()i





, its number of successors is

determined according to its relative probability in this

cumulative histogram. At the prediction step, the new

state t

is computed using the following Chapman-

3D Human Pose Estimation from a Monocular Image Using Model Fitting in Eigenspaces 1063

Kolmogorov equation.

111

() ()()

ttttt tt

px ZpxxpxZdx





11

(9)

At the measurement step, the new state t

is weighted

according to its likelihood to the new measurement t

The posterior density (

px Z) is represented by a set of

weighted particles



, where the

weights



()

)



(0) (0)()

(,)(,

ttt t





() ()

(

ttt

pz xs





1

)

are normalised so that

. The new state

()n



t

can be estimated by

() ()

() (

Nni

ttt t

px Zs





) (10)

The measurement step of Equation (10) and the pre-

diction step of Equation (9) together form the Bayes'

formulation Equation (8).

In the ordinary model-based method, a particle filter is

utilized to match the 3D model with the input image, and

the initial value needs to be set manually [15,16]. In our

method, the initial value can be obtained from the learn-

ing-based method. Therefore, the former manual con-

figuration will be replaced by an automatic estimation

process.

The pose estimated by regression analysis is used as

an initial value, and the particles are sampled around it.

The likelihood of each particle is evalu ated as its weight,

and the particles are generated by a resampling process

based on the weight. After repeating resampling several

times, the particle with the highest likelihood is consid-

ered as the final state.



Eigen spaces

h dimensional

ose s

ace

Figure 6. Original space and eigenspaces.

3.3. State Space

The 3D human model was introduced in Subsection 2.2.

In this section, it is treated as the state vector of particle

filter. However, in a real environment, the dimension of

state space is normally very high. That would cause both

low computational efficiency and poor convergence per-

formance.

We propose the method of utilizing an eigenspace of

each motion constructed by PCA as a motion prior,

which constraints corresponding motion [17]. Simulta-

neously, it is possible to search efficiently in the low-

dimensional space using dimension reduction. Suppose

that the number of motion types is

. When PCA is

carried out using the training data of a certain motion



, the 3D human pose m

is projected into the

eigenspace as follows:

(

mmmm

Px x)





 (11)

where m

denotes the mean pose vector of a certain mo-

tion , m is the base vector matrix, and m



denotes

the pose vector in the eigenspace. Because PCA is car-

ried out for

types of motions in the training data,

kinds of eigenspaces are constructed. The pose vector

is projected to the eigenspace of each motion and is

used as the state of particle as shown in Figure 6. The

dimension reduction using PCA is decided according to

the 95% cumulative proportion rate.

3.4. Likelihood

In the likelihood calculation stage, the state vector of

every particle is converted to a 3D human pose in a high

dimension space using PCA Inverse Transformation.

Then, we use MAYA to produce the CG image that

represents the pose of every particle. This CG image

compared with the input image. The performance of the

particle filter depends largely on the image features that

are used to calculate the likelihood. The ideal image

features should remain stable in various scenarios and be

easy to extract. In our case, we adopt two features to

construct weighting function: HOG for representing the

human conformation and silhouetting for evaluating the

human region.

Configuration:

HOG represents human conformation and is robust re-

garding changes in color, clothes and illumination. The

CG image is generated from the particle state, and HOG

is extracted (Figure 7(b)). The distance between the

input image

and the state



is calculated as follows:

(,)(, )

hog

Ex zz

Dim





 (12)

3D Human Pose Estimation from a Monocular Image Using Model Fitting in Eigenspaces

1064

(a) (b) (c)

Figure 7. Feature extraction (a) CG image (b) HOG de-

scriptor (c) silhouette.

(

(, )cc

Dim

zz zz









)

(13)

where denotes the dimension of the HOG,

Dim

)



2(,zz



-distance between z and . indi-

cates the -th element of the HOG feature vector.

zc

Region:

The silhouette image is extracted using the background

subtraction method (Figure 7(c)). The silhouette can be

used for evaluation with stability because it is also ro-

bust to the changes of color, clothes and illumination.

After the silhouette is extracted, once again a pixel map

is constructed, this time with foreground pixels set to 1

and back ground to 0, and the distance is computed as

follows:

(,)( (,))

region i

Ex px





 (14)

where

is the number of pixels, and (,)



the val-

ues of the binary EX operation between input image

and state. OR

Next, fitness ()C



on the image is computed as fol-

lows.





() exp((,)(,))

hog region

CExEx



 



(15)

In our method, the solution search is restricted to the

eigenspace of the corresponding motion. Nevertheless,

the pose of an image can be further constrained in a cer-

tain range of the eigenspace, where we use the trajectory

of motion as a prior constraint. We regard the sequence

of vectors in the eigenspace as the trajectory of this mo-

tion. Then the distance is calculated between state vector



and vectors of training data in eigenspaces, and it can

be used for the likelihood evaluation as a penalty



Finally, the best solution is obtained by the rule shown

in Equation (16).

()LC







 (16)

3.5. Selecting the Eigenspace

In our method, it is necessary to select the proper eigen-

space according to the estimated motion of the input

image because the eigenspace is constructed for each

motion. First, many samples





,1,,

mMi S





are embedded into each eigenspace using the training

data. The mean distance between the samples and initial

pose

obtained by the regression analysis is computed,

and the input motion is decided as the motion with the

nearest mean distance.

The mean distances are computed in the eigenspace

and the high-dimensional space respectively as follows.







m



(17)



 

 (18)

Here m and denote the dimension of eigenspace

and an original space respectively, and



denotes the

low-dimensional pose vector to which

is mapped in

an eigenspace. m

denotes the high-dimensional pose

vector in which m



is reverted to an original space using

PCA Inverse Transformation. m

S is the numb er of the

samples. The motion of the input image is decided by

minimizing the sum of two distances defined by Equa-

tion (17) and Equation (18) as shown in Equation (19).

()

m is defined as a function to determine which an

eigenspace m



the motion belongs to: m



[1, ,]

(argmin)





 





(19)

3.6. Estimation of the Human Orientation

Even if the orientation of the human



is unknown, the

human pose can be estimated with regression analysis

because the image feature changes according the orienta-

tion. However, if the orientation is not estimated in the

model-based method, the image cannot be matched.

Consequently, the images in all orientations of 360

degrees are generated by CG using the results of the re-

gression analysis, and human orientation is estimated

using Equation (15) as follows:

1 360

ˆargmax()Cx





 (20)

where ()Cx



denotes the fitness between input image

and the generated image from pose

with orientation



. The model-fitting is carried out using the estimated

orientation ˆ



obtained by Equation (20).

3D Human Pose Estimation from a Monocular Image Using Model Fitting in Eigenspaces 1065

4. Estimation of the Human Orientation

4.1. Experiment Setup

We conducted the experiment using the CMU Graphics

Lab Motion Capture Database. First, we use motion cap-

ture data to produce a CG animation whose resolution is

pixels. Then we rotate the figure on the hori-

zontal plane in eight directions and take the CG image in

each direction as experiment data. We carried out ex-

periments for three kinds of motion (walking, running,

and jumping). The images used for training are summa-

rized in Table 1.

640 480

If the test motion is a cyclic movement, only four im-

ages of the typical pose are needed to represent a certain

cyclic motion. In order to represent the motion suffi-

ciently, we used eight images in each direction for a con-

tinuous action. Table 2 lists the details of the test data.

4.2. Experiment Results

RMS of absolute difference errors was computed be-

tween the true joint angles

and estimated joint angles

, by Equation (21). indicates the num ber of joints. m

( ,)()mod180

Dxxx x

m





 (21)

Figure 8 shows the RMS error over all joints angles

for all the motions. The estimation precision was im-

proved by iteration procedure.

The horizontal axis indicates the number of iterations.

The regression analysis results are shown in the primary

iteration. After the next iteration, pose estimation is

achieved by repeatedly applying the particle filter. The

results show that the accuracy of estimation can be im-

Table 1. The number of training data.

The number of frames

pose 1 orientation Total ( 8 or ientations)

Walking 90 720

Running 189 1,512

Jumping 192 1,536

Total 471 3,768

Table 2. The number of test data.

The number of frames

pose 1 orientation Total (8 orientations)

Walking 8 64

Running 8 64

Jumping 8 64

Total 24 192

Table 3. Confusion matrix [%].

Walking Running Jumping

Walking 84.38 3.12 12.5

Running 9.37 78.13 12.5

Jumping 15.63 4.68 79.69

6.2

6.4

6.6

6.8

7.2

7.4

123456

Iteration

Proposed method

Without

trajectory

RMS error (deg ree)

Figure 8. Pose estimation results.

4.5

5.5

6.5

7.5

123 4 5 6

Iteratio

Given

Unknown

RMS e rror (deg ree)

Figure 9. Estimation precision in eigenspaces.

Figure 10. Experimental result of the real image.

proved significantly because of the use of the trajectory

as a constraint.

Table 3 shows the result of eigenspace selection de-

scribed in Subsection 3.5. One reason for the increased

accuracy is that, in our method, eigenspace selection is

not applied to sequence data but to just one frame image.

Therefore, the motion recognition accuracy is not so

3D Human Pose Estimation from a Monocular Image Using Model Fitting in Eigenspaces

1066

[5] M. Lee, I. Cohen, “A Model-Based Approach for Esti-

mating Human 3D Poses in Static Images,” IEEE Trans-

actions on Pattern Analysis and Machine Intelligence,

Vol. 28, No. 6, June 2006, pp. 905-916.

good.

The selection of a proper eigenspace will greatly af-

fect the final estimation results. Figure 9 compares the

accuracy of two different estimation methods. The blue

broken line represents the scenario in which a motion

type of input image has been given, and its estimation is

carried out in the corresponding eigenspace. Comparing

these results with the red line for an unknown motion

type, it can be concluded clearly that our method shows

a better performance in terms of estimation precision.

[6] T. Jaeggli, E. Koller-Meier and L. V. Gool, “Learning

Generative Models for Monocular Body Pose Estima-

tion,” Proceedings of the 8th Asian Conference on Com-

puter Vision, Vol. 1, 2007.

[7] S. Hou, A. Galata, F. Caillette, N. Thacker and P.

Bromiley, “Real-time Body Tracking Using a Gaussian

Process Latent Variable Model,” IEEE 11th International

Conference on Computer Vision, October 2007, pp. 1-8.

The results of human pose estimation from real im-

ages [18,19] are shown in Figure 10. The first row is the

input real images and the second row represents the syn-

thetic images generated from the estimated poses. It is

confirmed that our method works effectively for the real

images.

[8] G. Peng, W. Alexander, A. O. Balan and M. J. Black,

“Estimating Human Shape and Pose from a Single Im-

age,” IEEE 12th International Conference on Computer

Vision (ICCV2009), September 2009, pp. 1381-1388.

[9] N. Dalal and B. Triggs, “Histograms of Oriented Gradi-

ents for Human Detection,” IEEE Computer Society Con-

ference on Computer Version and Pattern Recognition

(CVPR2005), Vol. 1, June 2005, pp. 886-893.

5. Conclusions

In this paper, we presented an approach to estimate 3D

human pose from a monocular image, which integrates

the learning-based and model-based estimation methods

into one framework. Furthermore, through the construc-

tion of an eigenspace, more efficient particle filter per-

formance is obtained.

[10] G. Mori and J. Malik, “Recovering 3D Human Body

Configurations using Shape Contexts,” IEEE Transac-

tions on Pattern Analysis and Machine Intelligence, Vol.

28, No. 7, 2006, pp. 1052-1062.

[11] K. Onishi, T. Takiguchi and Y. Ariki, “3D Human Pos-

ture Estimation Using the HOG Features from Monocu-

lar Image,” 19th International Conference on Pattern

Recognition (ICPR2008), December 2008, pp. 1-4.

Consequently, the precision of estimation is obviously

improved, and experimental results demonstrated that

our approach is effective. The particle filter did not need

to have the initial value provided manually, and it could

get the convergence solution in the situation with less

iterations. In future work, in order to further improve the

estimation accuracy, we are planning to use video data

for the input data because the temporal coherence be-

tween frames may provide useful information regarding

the selection of the eigenspace of each motion.

[12] CMU Human Motion Capture Database. Available

online: http://mocap.cs.cmu.edu/

[13] A. Fossati, M. Salzmann and P. Fua, “Observable sub-

spaces for 3D human motion recovery,” IEEE Confer-

ence on Computer Version and Pattern Recognition

(CVPR2009), June 2009, pp. 1137-1144.

[14] M. Isard and A. Blake, “Condensation-Conditional Den-

sity Propagation for Visual Tracking,” International Jour-

nal of Computer Vision, Vol. 29, No. 1, 1998, pp. 5-28.

REFERENCES [15] J. Deutscher, A. Blake and I. Reid, “Articulated Body

Motion Capture by Annealed Particle Filtering,” IEEE

Computer Society Conference on Computer Version and

Pattern Recognition, Vol. 2, June 2000, pp. 126-133.

[1] A. Agarwal and B. Triggs, “3D Human Pose from Sil-

houettes by Relevance Vector Regression,” IEEE Com-

puter Society Conference on Computer Vision and Pat-

tern Recognition, Vol. 2, 2004, pp. 882-888. [16] L. Ye, Q. Zhang and L. Guan, “Use Hierarchical Genetic

Particle Filter to Figure Articulated Human Tracking,”

International Conference on Multimedia and Expo

(ICME2008), 2008, pp. 1561-1564.

[2] X. Zhao, H. Ning, Y. Liu and T. Huang, “Discriminative

Estimation of 3D Human Pose Using Gaussian Proc-

esses,” Proceedings of 19th International Conference on

Pattern Recognition (ICPR’08), December 2008, pp. 1-4. [17] X. Zhao and Y. Liu, “Tracking 3D Human Motion in

Compact Base Space,” IEEE Workshop on Applications

of Computer Vision (WACV’07), Febru ary 2007, p. 39.

[3] C. Sminchisescu, A. Kanaujia and D. N. Metaxas,

“3

BME : Discriminative Density Propagation for Visual

Tracking,” IEEE Transactions on Pattern Analysis and

Machine Intelligence, Vol. 29, No. 11, November 2007,

pp. 2030-2044.

[18] H. Sidenbladh, M. Black and D. Fleet, “Stochastic Track-

ing of 3D Human Fig ure s Using 2D I mage Moti o n,” Com-

puter Vision — ECCV 2000, Vol. 1843, 2000, pp. 702-718.

[19] R. Urtasun, D. Fleet and P. Fua, “Monocular 3D Track-

ing of the Golf Swing,” IEEE Computer Society Confer-

ence on Computer Version and Pattern Recognition

(CVPR2005), Vol. 2, June 2005, pp. 932-938.

[4] H. Ning, Y. Hu and T. Huang, “Efficient Initialization of

Mixtures of Experts for Human Pose Estimation,” 15th

IEEE International Conference on Image Processing

(ICIP2008), October 2008, pp. 2164-2167.