The rise of deep learning technology has broadly promoted the practical application of artificial intelligence in production and daily life. In computer vision, many human-centered applications, such as video surveillance, human-computer interaction, digital entertainment, etc., rely heavily on accurate and efficient human pose estimation techniques. Inspired by the remarkable achievements in learning-based 2D human pose estimation, numerous researches are devoted to the topic of 3D human pose estimation via deep learning methods. Against this backdrop, this paper provides an extensive literature survey of recent literature about deep learning methods for 3D human pose estimation to display the development process of these researches, track the latest research trends and analyze the characteristics of devised types of methods. The literatures are reviewed along with the general pipeline of 3D human pose estimation, which consists of human body modeling, learning-based pose estimation, and regularization for refinement. Different from existing reviews of the same topic, this paper focus on deep learning-based methods. The learning-based pose estimation is discussed from two categories of single-person and multi-person. Each one is further categorized by data type to the image-based methods and the video-based methods. Moreover, due to the significance of data for learning-based methods, this paper surveys the 3D human pose estimation methods according to the taxonomy of supervision form. At last, this paper also enlists the current and widely used datasets and compares performances of reviewed methods. Based on this literature survey, it can be concluded that each branch of 3D human pose estimation starts with fully-supervised methods, and there is still much room for multi-person pose estimation based on other supervision methods from both image and video. Besides the significant development of 3D human pose estimation via deep learning, the inherent ambiguity and occlusion problems remain challenging issues that need to be better addressed.
Keywords: 3D human pose estimation; deep learning; unsupervised; fully-supervised; weakly-supervised; semi-supervised
- Single-person 3D pose estimation falls into two categories: two-stage and One-stage methods.
- Two-stage methods involve two steps, first, 2D joint locations are obtained by 2D keypoints detection models, then 2D keypoints are lifted to 3D keypoints by deep learning methods.
- One-stage methods mean regressing 3D joint locations directly from a RGB image. These methods require many training data with 3D annotations, but manual annotation is costly and demanding.
- Multi-person 3D pose estimation is divided into two categories: top-down and bottom-up methods.
- Top-down methods first detect the human candidates and then apply single-person pose estimation for each of them.
- Bottom-up methods first detect all keypoints followed by grouping them into different people.
- RGB image-based methods take static images as input, only taking spatial context into account, which differs from video-based methods.
- Video-based methods meet more challenges than image-based methods, such as temporal information processing, correspondence between spatial information and temporal information and motion changes in different frames, etc.
- Unsupervised methods do not require any multi-view image data, 3D skeletons, correspondences between 2D-3D points, or use previously learned 3D priors during training. Self-supervised methods which can also solve the issue, deficiency of 3D data, have become popular in recent years. Self-supervised methods is a form of unsupervised learning where the data provides the supervision.
- Fully-supervised methods rely on large training sets annotated with ground-truth 3D positions coming from multi-view motion capture systems.
- Weakly-supervised methods access multiple cues for weak supervision, such as, a) paired 2D ground-truth, b) unpaired 3D ground-truth (3D pose without the corresponding image), c) multi-view image pair, d) camera parameters in a multi-view setup, etc.
- Semi-supervised methods use part of annotated data (e.g. 10 percent of 3D labels), which means labeled training data is scarce.
Both single-person 3D pose estimation and multi-person 3D pose estimation combined with different supervision forms could derive various branches as described in the figure below.
It is an unbalanced tree describing deep learning based 3D human pose estimation. Multi-person 3D pose estimation has received less interest compared to single-person 3D pose estimation. Also, video-based 3D pose estimation is less studied than image-based 3D pose estimation. Another interesting sight is that fully-supervised methods are presented in each sub-category, which may indicate that fully-supervised methods are helped to investigate a research area at the beginning.
We present the state-of-the-art results on several datasets, such as Human3.6m, MPI-INF-3DHP, MuPoTS-3D, Shelf, and Campus datasets.
Title | Year | Supervision | Type | URL |
---|---|---|---|---|
LCR-Net: Localization-Classification-Regression for Human Pose | 2017 | weakly -supervised | monocular | code |
Single-shot multi-person 3d pose estimation from monocular rgb | 2018 | fully-supervised | monocular | code |
Camera Distance-Aware Top-Down Approach for 3D Multi-Person Pose Estimation From a Single RGB Image | 2019 | fully-supervised | monocular | code |
XNect: Real-time Multi-Person 3D Motion Capture with a Single RGB Camera | 2020 | semi-supervised | monocular | code |
Lcr-net++: Multi-person 2d and 3d pose detection in natural images | 2019 | weakly-supervised | monocular | code |
Multi-person 3d human pose estimation from monocular images | 2019 | weakly-supervised | monocular | - |
Title | Year | Supervision | Type | URL |
---|---|---|---|---|
3D Pictorial Structures for Multiple Human Pose Estimation | 2014 | fully-supervised | multi-view | - |
Multiple human pose estimation with temporally consistent 3D pictorial structures | 2014 | weakly-supervised | multi-view | - |
3d pictorial structures revisited: Multiple human pose estimation | 2015 | fully-supervised | multi-view | - |
Multiple human 3d pose estimation from multiview images | 2018 | weakly-supervised | multi-view | - |
Fast and Robust Multi-Person 3D Pose Estimation From Multiple Views | 2019 | weakly-supervised | multi-view | code |
Multi-Person 3D Pose Estimation and Tracking in Sports | 2019 | unsupervised | multi-view | code |
VoxelPose: Towards Multi-camera 3D Human Pose Estimation in Wild Environment | 2020 | fully-supervised | multi-view | code |
Cross-View Tracking for Multi-Human 3D Pose Estimation at over 100 FPS | 2020 | unsupervised | multi-view | code |
Title | Year | Supervision | **Type ** | URL |
---|---|---|---|---|
3D Pictorial Structures for Multiple Human Pose Estimation | 2014 | fully-supervised | multi-view | - |
Multiple human pose estimation with temporally consistent 3D pictorial structures | 2014 | weakly-supervised | multi-view | - |
3d pictorial structures revisited: Multiple human pose estimation | 2015 | fully-supervised | multi-view | - |
Multiple human 3d pose estimation from multiview images | 2018 | weakly-supervised | multi-view | - |
Fast and Robust Multi-Person 3D Pose Estimation From Multiple Views | 2019 | weakly-supervised | multi-view | code |
Multi-Person 3D Pose Estimation and Tracking in Sports | 2019 | unsupervised | multi-view | code |
VoxelPose: Towards Multi-camera 3D Human Pose Estimation in Wild Environment | 2020 | fully-supervised | multi-view | code |
Light3DPose: Real-time Multi-Person 3D Pose Estimation from Multiple Views | 2021 | weakly-supervised | multi-view | - |
Cross-View Tracking for Multi-Human 3D Pose Estimation at over 100 FPS | 2020 | unsupervised | multi-view | code |
This work is supported by the National National Science Foundation of China (Grant No. 61802355 and 61702350) and the Open Research Project of The Hubei Key Laboratory of Intelligent Geo-Information Processing (KLIGIP-2019B04).