Extracting hand articulations from monocular depth images using curvature scale space descriptors

Shao-fan WANG,Chun LI,De-hui KONG,Bao-cai YIN

Front. Inform. Technol. Electron. Eng. 2016, 17 (1): 41-54. DOI: 10.1631/FITEE.1500126

Abstract

HTML

PDF (2840KB)

We propose a framework of hand articulation detection from a monocular depth image using curvature scale space (CSS) descriptors. We extract the hand contour from an input depth image, and obtain the fingertips and finger-valleys of the contour using the local extrema of a modified CSS map of the contour. Then we recover the undetected fingertips according to the local change of depths of points in the interior of the contour. Compared with traditional appearance-based approaches using either angle detectors or convex hull detectors, the modified CSS descriptor extracts the fingertips and finger-valleys more precisely since it is more robust to noisy or corrupted data; moreover, the local extrema of depths recover the fingertips of bending fingers well while traditional appearance-based approaches hardly work without matching models of hands. Experimental results show that our method captures the hand articulations more precisely compared with three state-of-the-art appearance-based approaches.

Experiment 1 takes 21 depth images with various poses of hands captured by Kinect. We compute the pixel-wise distance between the ground truth which we mark manually and the location of each detected fingertip obtained by using our method, the K-cos method (Lee and Lee, 2011), and the convex-hull method (Nagarajan et al., 2012). In particular, we compute the distance for each detected fingertip only using the CSS phase. Both the root-of-mean-square (RMS) and the maximum of the errors of all test images for each fingertip are shown in Fig. 5. The results indicate that the CSS phase achieves the smallest pixel-wise error while the whole procedure of our method achieves the greatest. This is because our method adds bending fingertips by a rough estimation, but that does not imply that our method is less effective than the K-cos and the convex-hull methods. To argue this, we count the number of undetected fingertips and incorrectly detected fingertips of the CSS phase and show the results in Fig. 6. While the whole procedure of our method produces no missing fingertips and no incorrect fingertips, the CSS phase produces one incorrect fingertip while the other two methods produce over 40 incorrect ones.