Nowadays, with increased interest in Augmented Reality and with releasing products like Microsoft Hololens (overview) such user input like keyboard/mouse could be possibly substituted by hand/eyes movement control. These techniques are more natual for human, so I hope in near future someone will create robust software to perform good enough hand recognition and tracking in realtime on any device.
In this article I’m going to show how hand tracking could be used to control Raspberry PI. I’ll use the simplest computer vision algorithms, so anyone can implement this on his own device.
Let’s make your workspace clean and start our developement.
To make it super easy I’ll use conda environment with prepared configuration (with all required libraries) and you can create the same environment using the next commands. How to launch guide is here.
The simplest way to control Raspberry PI is to implement mouse movements and left/right click. Let’s define the following gestures to substitute mouse activity with hand:
- Algorithm intuition
- Skin extraction
- Hand detection
In order to simply find skin segments on videos and extract skin from background (like on the next image) let’s convert image from RGB to YCrCb format (detailed explanation).
For this purpose 9 rectangles are placed on image (look on image below). User place hand in front of camera to allow all rectangles cover maximum amount of skin area (note: exaclty skin and no background) and calculate LOW, HIGH values for each image channel that allows to extract only skin color from image. Pixel values that correspond to each rectangle are collected in one array and MIN, MAX values for each channel calculated.
LOW - a list that consists of 3 values (lower bounds for channels Y, Cr, Cb)
HIGH - a list that consists of 3 values (lower bounds for channels Y, Cr, Cb)
Note: This technique is not robust with similar to skin color environment color or with changeable light conditions. But you always can re-calculate LOW, HIGH boundaries to make it more robust. (As next step you can play with Deep Learning techniques for skin extraction, but it could be much slower, especially for Raspberry PI)
The most interesting part is to implement simple algorithm to detect hand from skin segments.
Let’s make a step back and define essential steps from the beginning:
- Capture frames from camera.
- Collect N (N=5) frames sequentially from camera.
- Average values for each channel calculated and create an averaged image. This step allows to avoid noise caused by hand movements.
- Blur averaged image using GAUSSIAN_BLUR function with kernel size = 15
- Convert Blurred image to YCrCb format
- Extract values from YCrCb image that are in range with respect to LOW, HIGH values for each channel (Y, Cr, Cb). (Look the explanation for this above)
- Apply erode algorithm to mask in 1 iteration to remove weak and noisy parts of mask.
- Apply dilate algorithm with kernel size = 7 in 3 iterations to expand hand area.
- Convert mask to grayscale.
- Binarize mask with threshold function
After all these steps you’ll have something like this.
Now we can start to classify found blobs in to classes Hand/Not Hand. Let’s make it simple. Firstly let’s find and analyse hand shape. (Helpful explanation how to use OpenCV for analysing shapes/contour):
Find contour in mask image with maximum area (blue dots). This would the best candidate to be hand. Calculate hull points and defects in contour. Hull points represents possible fingertips (red dors on image) and defects are the possible area between fingers (green dots).
Let loop each three points in countour - start, end, far as showed on next image. Angle represents possible angle between fingers. In it’s value should be more than 90 degrees like between thumb and point finger (gesture #1 for mouse movements). Also distances (start-far) with (end-far) should be long enough with respect to blob size (knowing the real hand proportions).
Also we need to remove points in contour that are close to each other.
Now with the clean contour we can calculate number of valid fingers (hulls). If there is no valid finger we can consider blob as closed palm.
Also let’s define mapping of finger tip coordinates to Raspberry PI screen coordinates:
Suppose that grey rectangle is our frame from web camera. On figure you can see that not from all pixels of image we can perceive hand (green zone). You also can see an extreme positions of hand that could be possibly interpreted as gesture.
Having the size of frame area where hand gesture could be detected and the size of screen we can transform one coordinates to another with simple equation.
How to make decision about next action and how to interpret valid hand gesture showed by user.
- Based on the most far fingertip point we calculate corresponding point on a screen.
- If more than two fingers found let’s consider it as gesture #1 and execute mouse movement.
- If only one finger found let’s consider it as gesture #2 and execute left click. Also when click executed we setup a delay that allow us to avoid to much left clicking.
If no fingers found (closed palm) we execute right click. The same delay should be implemented as with left clicking.
Here are some demos:
My congratulations, you’ve just implemented simple but fast enough program to control your PC with hand movements. You can open files, transfer them to folders, select items even draw. Sure the algorithm isn’t perfect and not robust, but it fast enough to be executed on Raspberry PI. I didn’t find any deep learning solution that allows to run hand tracking and gesture recognition in real time.
Hope you’ll get some inspiration to find solutions to make this program a way better.