Multi-Scale Human Pose Tracking in 2D Monocular Images


In this paper we address the problem of tracking human poses in multiple perspective scales in 2D monocular images/videos. In most state-of-the-art 2D tracking approaches, the issue of scale variation is rarely discussed. However in reality, videos often contain human motion with dynamically changed scales. In this paper we propose a tracking framework that can deal with this problem. A scale checking and adjusting algorithm is proposed to automatically adjust the perspective scales during the tracking process. Two metrics are proposed for detecting and adjusting the scale change. One metric is from the height value of the tracked target, which is suitable for some sequences where the tracked target is upright and with no limbs stretching. The other metric employed in this algorithm is more generic, which is invariant to motion types. It is the ratio between the pixel counts of the target silhouette and the detected bounding boxes of the target body. The proposed algorithm is tested on the publicly available datasets (HumanEva). The experimental results show that our method demonstrated higher accuracy and efficiency compared to state-of-the-art approaches.

Share and Cite:

Tian, J. , Li, L. and Liu, W. (2014) Multi-Scale Human Pose Tracking in 2D Monocular Images. Journal of Computer and Communications, 2, 78-84. doi: 10.4236/jcc.2014.22014.

Conflicts of Interest

The authors declare no conflicts of interest.


[1] R. Poppe, “Vision-Based Human Motion Analysis: An Overview,” Computer Vision and Image Understanding, Vol. 108, No. 1C2, 2007, pp. 4-18.
[2] H. Zhou and H. Hu, “Human Motion Tracking for Rehabilitations—A Survey,” Biomedical Signal Processing and Control, Vol. 3, No. 1, 2008, pp. 1-18.
[3] Y. Lu, L. Li and P. Peursum, “Human Pose Tracking Based on Both Generic and Specific Appearance Models,” Control Automation Robotics & Vision, 2012, pp. 1071- 1076.
[4] J. M. del Rincon, D. Makris, C. O. Urunuela and J.-C. Nebel, “Tracking Human Position and Lower Body Parts Using Kalman and Particle Filters Constrained by Human Biomechanics,” IEEE Transactions on Systems, Man, and Cybernetics Part B: Cybernetics, Vol. 41, No. 1, 2011, pp. 26-37.
[5] D. Ramanan, D. A. Forsyth and A. Zisserman, “Tracking People by Learning Their Appearance,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 29, No. 1, 2007, pp. 65-81.
[6] X. Lan and D. P. Huttenlocher, “Beyond Trees: Common-Factor Models for 2d Human Pose Recovery,” IEEE ICCV, Vol. 1, 2005, pp. 470-477.
[7] C. Chang, R. Ansari and A. Khokhar, “Cyclic Articulated Human Motion Tracking by Sequential Ancestral Simulation,” IEEE CVPR, Vol. 2, 2004, pp. II-45.
[8] D. Ramanan and D. A. Forsyth, “Finding and Tracking People from the Bottom Up,” IEEE CVPR, Vol. 2, 2003, pp. II-467.
[9] R. Fablet and M. J. Black, “Automatic Detection and Tracking of Human Motion with a View-Based Representation,” ECCV, Springer, 2002, pp. 476-491.
[10] M. A. Fischler and R. A. Elschlager, “The Representation and Matching of Pictorial Structures,” IEEE Transactions on Computers, Vol. 100, No. 1, 1973, pp. 67-92.
[11] P. F. Felzenszwalb and D. P. Huttenlocher, “Pictorial Structures for Object Recognition,” IJCV, Vol. 61, No. 1, 2005, pp. 55-79.
[12] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient Matching of Pictorial Structures,” IEEE CVPR, Vol. 2, 2000, pp. 66-73.
[13] D. Ramanan, “Learning to Parse Images of Articulated Bodies,” NIPS, Vol. 19, 2007, p. 1129.
[14] M. Eichner, M. Marin-Jimenez, A. Zisserman and V. Ferrari, “2d Articulated Human Pose Estimation and Retrieval in (Almost) Unconstrained Still Images,” IJCV, Vol. 99, No. 2, 2012, pp. 190-214.
[15] M. Andriluka, S. Roth and B. Schiele, “Discriminative Appearance Models for Pictorial Structures,” IJCV, 2012, pp. 1-22.
[16] C. Stauffer and W. E. L. Grimson, “Adaptive Background Mixture Models for Real-Time Tracking,” IEEE CVPR, Vol. 2, 1999.
[17] P. KaewTraKulPong and R. Bowden, “An Improved Adaptive Background Mixture Model for Real-Time Tracking with Shadow Detection,” Video-Based Surveillance Systems, Springer, 2002, pp. 135-144.
[18] M. Andriluka, S. Roth and B. Schiele, “Pictorial Structures Revisited: People Detection and Articulated Pose Estimation,” IEEE CVPR, 2009, pp. 1014-1021.
[19] Y. Freund and R. E. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” Journal of Computer and System Sciences, Vol. 55, No. 1, 1997, pp. 119-139.
[20] S. Belongie, J. Malik and J. Puzicha, “Shape Matching and Object Recognition Using Shape Contexts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 4, 2002, pp. 509-522.
[21] V. Ferrari, M. Marin-Jimenez and A. Zisserman, “Progressive Search Space Reduction for Human Pose Estimation,” IEEE Conference on CVPR, 2008. pp. 1-8.
[22] L. Sigal, A. O. Balan and M. J. Black, “Humaneva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion,” IJCV, Vol. 87, No. 1, 2010, pp. 4-27.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.