Pose estimation, detecting human actions from images and video feed using convolutional neural nets

Ever thought how the Microsoft Kinect works? You just have to stand in front of it in the active zone and start doing your stuff, and you can see that your character in the game does the same. Well, Kinect and other human actions/hand gestures/head and face gesture-guided systems work in the...

It has been one of the most challenging problems in computer vision to estimate the full 3D pose of a human body using a single, color image. But why? Well, for start, you have to solve 2 different problems here. First you have to locate all the joints of the human body, i.e. all the points where you will have rotational motion around an axis or degree(s) of freedom. That’s not by any means an easy task due to the ambiguities imposed by different camera viewpoints, clothing textures and relation to the background environment, external and self-occlusions, body shape or illumination variation. Secondly, you will have to change that 2D data you got (the joints data) or as we describe it, the 2D landmarks, to a 3D pose and you are still doing that in the light of a single image.

At first glance, you would recommend that the system must be trained using prior poses to be able to perform well in a very fuzzy environment. Remember, you can have infinite number of poses. Well, that’s true, a task this complex should require a complex solution, it should require a conv. Net and it definitely should require a large dataset.


3D-pose from images

Most approaches to 3D pose inference directly from images fall into one of two categories:

  1. models that learn to regress the 3D pose directly from image features and
  2. pipeline approaches where the 2D pose is first estimated, typically using discriminatively trained part models or joint predictors, and then lifted into 3D.

Regression based methods suffer from the need to annotate all images with ground truth 3D poses for pipeline approaches the challenge is how to account for uncertainty in the measurements.


Crucial to both types of approaches is the question of how to incorporate the 3D dependencies between the different body joints or to leverage other useful 3D geometric information in the inference process.

Early approaches that worked on human pose estimation from a single image relied on discriminatively trained models to learn a direct mapping from image features such as silhouettes, HOG or SIFT, to 3D human poses without passing through 2D landmark estimation, these differ from the recent deep- learning based approaches.

Regression-based approaches train an end-to-end network to predict 3D joint locations directly from the image, incorporating model joint dependencies in the CNN. As CNNs have become more prevalent, 2D joint estimation has become increasingly reliable and many recent works have looked to exploit this using a pipeline approach.


The CNN-based solution

A lot of attention was given to this problem and a lot of papers that discuss different novel approaches were introduced. However, we are discussing one of the novel approaches that proved its effectiveness. The solution encapsulates a CNN that learns to combine the image appearance-based predictions provided by 2D landmark detectors, with the geometric 3D skeletal information encoded in a novel pretrained model of 3D human pose.

Information captured by the 3D human pose model is embedded in the CNN architecture as an additional layer that maps 2D landmark coordinates into 3D and assuming that they follow physically plausible poses (guided by the training). But why was that approach adopted?

Well, integrating the output of the 2D landmark detection phase along with the 3D pose predicted by a probabilistic model has an advantage, the 2D landmark location estimates are improved with constraints imposed by the 3D pose model predicted. Think of it as a 2-way road. The 3D posed are predicted in the light of the 2D landmarks locations generated. And then the locations of the 2D landmarks are enhanced based on the 3D pose generated.

So, to recap what is being done here. Two different stages, and two different processes are being done. First, images are annotated with 2D landmarks of joints to be used in training the landmark detection stage. Secondly, the pose model is trained from 3D data. The thing is that each of these stages can be done completely standalone. Therefore, they can be done simultaneously, we can take advantage of extra 2D pose annotations without the need for 3D ground truth or extend the 3D training data to further mocap datasets without the need for synchronized 2D images.

Posestimation Image

With 2D joint locations provided as input, the approach focuses on solving the 3Dlifting problem and follow with methods that learn to estimate the 3D pose directly from images. 3D-pose from known 2D joint positions. 3D lifting is the process of mapping 2D joint locations to 3D poses. There are lots of methods that addressed this process, some approaches used the anatomy of the human skeleton or joint angle limits to recover pose from a single image. Other methods focused on learning a prior statistical model of the human body directly from 3D mocap data. Non-rigid structure from motion approaches (NRSfM) also recover 3D articulated motion given known 2D correspondences for the joints in every frame of a monocular video (single camera images).

The second advantage here is, as unsupervised methods, the approaches do not need 3D training data, instead they can learn a linear basis for the 3D poses purely from 2D data. The main drawback is the need for significant camera movement to guarantee accurate 3D reconstruction.

One fundamental challenge in creating models of human poses lies in the lack of access to 3D data of sufficient variety to characterize the space of human poses. To compensate for this lack of data we identify and eliminate confounding factors such as rotation in the plane, limb length, and left-right symmetry that lead to conceptually similar poses being unrecognized in the training data. Simple preprocessing eliminates some factors. Also, variation in sizes can be addressed by normalizing the data such that the sum of squared limb lengths on the human skeleton is one; while left-right symmetry is exploited by flipping each pose in the x-axis and re-annotating left as right and vice-versa.


The architecture:

A multistage deep convolutional architecture, trained end-to-end, that repeatedly fuses and refines 2D and 3D poses, and a second module which takes the final predicted 2D landmarks and lifts them one last time into 3D space for the final estimate. From an implementation point of view this is done by introducing two distinct layers, the probabilistic 3D-pose layer and the fusion layer.

So, the code to get the 2D landmarks can be something like the following:


H = out.shape[2]

W = out.shape[3]

# Empty list to store the detected key points

points = []

for i in range(len()):

    # confidence map of corresponding body's part.

    probMap = output[0, i, :, :]


    # Find global maxima of the probMap.

    minVal, prob, minLoc, point = cv2.minMaxLoc(probMap)


    # Scale the point to fit on the original image

    x = (frameWidth * point[0]) / W

    y = (frameHeight * point[1]) / H


    if prob > threshold :

        cv2.circle(frame, (int(x), int(y)), 15, (0, 255, 255), thickness=-1, lineType=cv.FILLED)

        cv2.putText(frame, "{}".format(i), (int(x), int(y)), cv2.FONT_HERSHEY_SIMPLEX, 1.4, (0, 0, 255), 3, lineType=cv2.LINE_AA)


        # Add the point to the list if the probability is greater than the threshold

        points.append((int(x), int(y)))

    else :





The output can be shown as follows:

Image tennis player

And the next step would be to draw the skeleton.


To sum up: what we have is a problem of tracking the motion of human body. So, we will have to detect the main key points which are the joints, connect between them to produce a skeleton similar to a stick man. After that these 2D landmarks can be mapped to a 3D pose based on a prior pretrained model of a CNN, both stages are cascaded into a single CNN that will give the desired output. Many approaches have tried other solutions to the problem using regression or clustering techniques, but this one proved its efficiency.

Do you have any comments to the blog article? Just log in or register and leave a comment here.




Caprico report abuse
Interesting article
communitymanager report abuse
Thanks for your praise!

Want to read on?

Find other interesting ... Show more