, 2016, 17(1): 41-54 doi: 10.1631/FITEE.1500126

Original article

Extracting hand articulations from monocular depth images using curvature scale space descriptors

WANG Shao-fan,,1, LI Chun1, KONG De-hui,,,1, YIN Bao-cai2,1,3

1Beijing Key Laboratory of Multimedia and Intelligent Software Technology, College of Metropolitan Transportation, Beijing University of Technology, Beijing 100124, China
2School of Software Technology, Dalian University of Technology, Dalian 116024, China
3Collaborative Innovation Center of Electric Vehicles in Beijing, Beijing 100081, China

Corresponding authors: †E-mail: wangshaofan@bjut.edu.cn E-mail: kdh@bjut.edu.cn†E-mail: wangshaofan@bjut.edu.cn E-mail: kdh@bjut.edu.cn

First author contact: Corresponding Author

Received: 2015-04-20   Accepted: 2015-26-10  

Fund supported: National Natural Science Foundation of ChinaNo. 61227004, 61370120, 61390510, 61300065, 61402024
Beijing Municipal Natural Science Foundation, ChinaNo. 4142010
Beijing Municipal Commission of Education, ChinaNo. km201410005013
the Funding Project for Academic Human Resources Development in Institutions of Higher Learning under the Jurisdiction of Beijing Municipality, China

Abstract

We propose a framework of hand articulation detection from a monocular depth image using curvature scale space (CSS) descriptors. We extract the hand contour from an input depth image, and obtain the fingertips and finger-valleys of the contour using the local extrema of a modified CSS map of the contour. Then we recover the undetected fingertips according to the local change of depths of points in the interior of the contour. Compared with traditional appearance-based approaches using either angle detectors or convex hull detectors, the modified CSS descriptor extracts the fingertips and finger-valleys more precisely since it is more robust to noisy or corrupted data; moreover, the local extrema of depths recover the fingertips of bending fingers well while traditional appearance-based approaches hardly work without matching models of hands. Experimental results show that our method captures the hand articulations more precisely compared with three state-of-the-art appearance-based approaches.

Keywords: Curvature scale space (CSS) ; Hand articulation ; Convex hull ; Hand contour

PDF (2840KB) Metadata Metrics Related articles Export EndNote| Ris| Bibtex  Favorite Suggest

Cite this article

WANG Shao-fan, LI Chun, KONG De-hui, YIN Bao-cai. Extracting hand articulations from monocular depth images using curvature scale space descriptors. [J], 2016, 17(1): 41-54 doi:10.1631/FITEE.1500126

1 Introduction

Extracting human hand articulations such as fingertips, finger-knuckles, finger-roots, and hand contours is an interesting and important task, which has various applications in human-computer interaction and virtual reality. This task is challenging because human hands, like other articulable objects, have many degrees of freedom, constrained parameter space, and suffer self-occlusion. Although special hardware such as data gloves and the Kinect sensor has been successfully developed, it either needs to be worn or produces a lower precision.

Research on detecting hand articulations can be divided into two categories: appearance-based approaches (Rosales et al., 2001; Athitsos and Sclaroff, 2002; 2003; Tomasi et al., 2003; Schlattmann et al., 2007; Feng et al., 2011; Lee and Lee, 2011; Ren et al., 2011; Cerezo, 2012; Nagarajan et al., 2012; Maisto et al., 2013) and model-based approaches (Chang et al., 2008; de La Gorce et al., 2011; Keskin et al., 2011; Oikonomidis et al., 2011; Kirac et al., 2014; Ma and Wu, 2014; Morshidi and Tjahjadi, 2014; Qian et al., 2014; Tompson et al., 2014). Appearance-based approaches, also known as discriminative approaches, learn a mapping from the space of image features to the space of hand configurations and formulate the task as a supervised classification problem. While appearance-based approaches give effective computations, such approaches show a limitation of faithfully capturing hand pose because hand configurations can hardly be completely recovered using the features of images.

Model-based approaches, also known as generative approaches, estimate hand states by matching the kinematic structure with image features and formulate the task as a high-dimensional search problem. While model-based approaches capture hand configurations more precisely, the high dimensionality of solution spaces and high nonlinearity of optimization functions make the computation highly expensive and prevent them from being applied to real-time systems.

Although some model-based approaches (Maisto et al., 2013; Kirac et al., 2014; Ma and Wu, 2014; Qian et al., 2014; Tompson et al., 2014) propose detecting hand articulations in real time using various techniques for solving optimization models or using efficient searching algorithms, complicated model parameters are difficult to obtain in a both precise and efficient fashion. This paper proposes an appearance-based method for hand articulation detection from a monocular depth image. While some appearance-based methods use an angle detector or a convex hull detector of the contour points of hand images for extracting local maxima, our method uses the 2D curvature scale space (CSS) descriptor for the task. The CSS descriptor (Abbasi et al., 1999) is a multiscale description of the invariant local features of 2D shapes, which is standardized in MPEG-7 as one of the most important shape descriptors. Instead of using the CSS descriptor for shape matching, we extract the fingertips and finger-valleys of straight fingers using the CSS descriptor with modified curvature thresholds. The fingertips and finger-valleys of bending fingers which do not appear at the hand contour are then detected using predefined angle or depth thresholds of detected fingers. We compare our method with three state-of-the-art appearance-based methods, and show that our method detects the fingertips more precisely because the CSS descriptor is more robust to noisy or corrupted depth data.

2 Related work

This section reviews previous works on hand feature extraction, including appearance-based approaches and model-based approaches.

2.1 Appearance-based approaches

Rosales et al. (2001) proposed a state recovering method by learning the mapping between hand joint configurations and the corresponding visual features which are generated from a computer graphics module. Athitsos and Sclaroff (2002) retrieved the most similar matches hierarchically from a large database of synthetic hand images, and obtained both the ground truth labels of those matches and camera viewpoint information. They further proposed an image-to-model chamfer distance and a probabilistic line matching method, and formulated hand pose estimation as an image database indexing problem (Athitsos and Sclaroff, 2003). Tomasi et al. (2003) tracked hand gestures with fast and complex motions, by imposing the Fourier shape descriptor on hand silhouettes and interpolating missing data of each frame. Schlattmann et al. (2007) extracted the protruding fingertips from the visual hull of segmented images of different cameras, and estimated hand gestures using the position and orientation of hands. Feng et al. (2011) extracted hand features by first estimating shape features using an approximated polygon of the hand contour, and then refining the shape features by expressing images in a multiscale space to obtain the response strength of different features. Similarly, Maisto et al. (2013) computed the convex envelope of hand contours and tracked the fingertips using both a center-of-mass-based variation and a geometry-based variation. Lee and Lee (2011) proposed a scale-invariant angle detector to locate fingertips and finally recognized fingertip actions using hand contours. Ren et al. (2011) proposed a novel distance metric for hand dissimilarity measure for matching merely fingers instead of the whole hand shape. Nagarajan et al. (2012) proposed a hand gesture recognition framework by detecting the hand skin color with the HSV color space and morphological operations and locating fingertips with the convex hull of hand contours. Cerezo (2012) detected the fingertips using the threshold of angles formed by vectors generated from contour points of the hand, followed by a normalization of the 2D points within the depth image for obtaining 3D locations of fingertips.

2.2 Model-based approaches

Chang et al. (2008) proposed an appearance-guided particle filtering method for high degree-of-freedom hand tracking, which formulates the tracking problem as finding the maximum a posteriori solution of a probability propagation model consisting of several state space vectors. de La Gorce et al. (2011) proposed an analysis-by-synthesis approach for tracking moving hands by incorporating both shading and texture information while handling self-occlusion. The shading information was captured by using a triangulated mesh-based model and the texture estimation was completed using the same objective function used for tracking with a smoothness regularization term. Oikonomidis et al. (2011) proposed a 3D hand model consisting of parametrized geometric primitives represented by two basic primitives: sphere and truncated cylinder, and estimated model parameters by minimizing the discrepancy between predicted features and observed features using particle swarm optimization. Morshidi and Tjahjadi (2014) proposed another method, a gravity optimized particle filter, for solving the model of hand configurations, where the localization and labeling of fingers are extracted using convexity defects. Ma and Wu (2014) proposed a hand tracking system using depth sequences which consists of three phases: hand segmentation, k-nearest neighbor database searching, and an improved particle swarm optimization. The searching is fulfilled using both the location of all bulges in the z-direction and a binary coding of the shape of segmented hand masks as two features. Qian et al. (2014) combined gradient-based optimization and stochastic optimization to model human hands in real time, using a number of spheres and a cost function involving the alignment and relative position of the point cloud with respect to the sphere model. Keskin et al. (2011), Kirac et al. (2014), and Tompson et al. (2014) classified pixels of depth images using random decision forests, with a different hierarchy of modes over articulation points of hands.

3 Curvature scale space based hand articulation extraction

3.1 Extracting hand contour and palm center

We illustrate the main procedure for extracting the contour of the hand and the palm center in Fig. 1, which includes six steps: (1) find a hand point on the depth image using the function hand_points of openNI; (2) select a rectangular neighborhood of the point; (3) segment the hand part from the patch using a threshold of difference of the depth value with respect to the hand point; (4) extract the contour of the hand using find_contours of openCV; (5) compute the maximum inscribed circle of the contour and take its center as the palm center; (6) remove additional contour points which belong to the wrist, and arrange the remaining contour points from the thumb to the little finger (i.e., in the clockwise/counterclockwise direction for the left/right hand).

Fig. 1

Fig. 1   A flowchart of extracting the hand contour and palm center: (a) finding a point on the hand; (b) selecting a rectangular neighborhood; (c) segmenting the hand part; (d) extracting the contour of the hand part; (e) computing the maximum inscribed circle of the contour; (f) removing additional contour points


3.2 Curvature scale space descriptors

The CSS descriptor is a contour-based shape descriptor derived from the zero-cross points of curvature of the curve convoluted with the Gaussian function. The greater the variance at which the curvature of the convoluted curve vanishes, the sharper the corresponding point of the curve is.

Let (x(t), y(t)) be a parametrization representation of a planar curve. The shape is evolved into different scales by applying Gaussian smoothing $x_{\sigma}(t) = x(t)\otimes g(t, \sigma)$, $y_{\sigma}(t)=y(t)\otimes g(t, \sigma)$, where '$\otimes$' denotes the univariate convolution operator and $g(t, \sigma)$ is a Gaussian function. The curvature of the evolving curve is given by

where $x'_{\sigma}(t), y'_{\sigma}(t), x''_{\sigma}(t), y''_{\sigma}(t)$ are the first and the second derivatives of $x_{\sigma}(t), y_{\sigma}(t)$ at location t, respectively. The CSS contour map

$\begin{equation}\label{eq-css1}\mbox{css}(t, \sigma):=\{(t, \sigma):k_{\sigma}(t)=0\}\end{equation} $

is defined to be the collection of all zero-crossing points of $(t, \sigma)$, where $\sigma$ is the scale at which the curvature of the point t of the evolving curve vanishes.

3.3 Extracting fingertips and finger-valleys using CSS descriptors

3.3.1 Extracting fingertips

For each input hand contour $\{\boldsymbol{f}_t\}_{t=1}^{n}$, we compute its CSS contour map $\mbox{css}_f(t, \sigma)$. Instead of directly using Eq.1, we modify the traditional CSS contour map in two ways. First, since the obtained hand contour is a discrete sequence of points, the convolution operator and the curvature both need discretization forms (We use traditional discrete convolution operator for computing the convolution, and use quadratic fitting scheme for computing the curvature). Second, we define the CSS contour map by the absolute small value of the curvature of the evolving curve, instead of the zero points of the curvature of the curve directly. The reason is two-fold: for one thing, all the peak points of the hand contour do not exactly achieve zero curvature after the Gaussian smoothing (the first row of subfigures of Fig. 2); for another, setting the CSS contour as points of the curvature between zero and another small positive number leads to an extremely large number of points of CSS contours (the second row of subfigures of Fig. 2). Alternatively, our method defines a modified CSS contour map (Eq.2) by the collection of points which achieve the curvature within the interval $[c_1, c_2]$ of two small positive numbers (the third row of subfigures of Fig. 2), and extracts the fingertips using the local maxima of the contour with a threshold of $\sigma$.

Fig. 2

Fig. 2   Curvature scale space contours for two (a), three (b), four (c), and five (d) straight fingers in three cases: the first row satisfies $\textbf{css}{(t,\sigma)=\{(t,\sigma): k_\sigma(t)=0\}}$, the second row satisfies $\textrm{css}{(t,\sigma)=\{(t,\sigma): 0\leq k_\sigma(t)\leq2.5\}}$, and the third row satisfies $\textrm{css}{(t,\sigma)=\{(t,\sigma): 2\leq k_\sigma(t)\leq2.5\}}$


$\begin{align}&\textrm{css}_f^{\textrm{tip}}(t, \sigma):=\{(t, \sigma): 0<c_1 \leq k_{\sigma}(t)\leq c_2\},\end{align}$

$\begin{align}&\textrm{css}_f^{\textrm{valley}}(t, \sigma):=\{(t, \sigma): c_3 \leq k_{\sigma}(t)\leq c_4<0\}.\end{align}$

3.3.2 Extracting finger-valleys

Next we extract finger-valleys. Although these may not be considered as hand articulations, they are important for computing finger-roots. We collect the points of curvature within an internal of two small negative numbers to obtain another CSS contour (Eq. (2)), and extract the finger-valleys using the local maxima of the contour.

Even if all fingers are straight, both the outer finger-valley of the thumb and the outer finger-valley of the little finger cannot be detected using CSS as both of them are smooth (When the thumb is bending, both the outer finger-valley of the first straight finger and the outer finger-valley of the last straight finger cannot be detected. We find them using the same method). The outer finger-valley of the thumb is added by the point which is symmetric to the first finger-valley with respect to the fingertip of the thumb. Similarly, the outer finger-valley of the little finger is added by the point symmetric to the last finger-valley with respect to the fingertip of the little finger.

3.3.3 Extracting finger-roots

Currently, we compute only a finger-root for each detected fingertip; when the fingertip is not detected by the CSS contour, the corresponding finger-root cannot be located, and we shall discuss this issue in Section 3.4. For each detected fingertip, the corresponding finger-root is simply given by the middle point of two neighboring finger-valleys of the fingertip.

3.4 Recovering missing fingertips and missing finger-roots

The CSS contour map can effectively detect straight or slightly bending fingers, but is ineffective for bending fingers. To recover bending fingers whose fingertips do not appear on the hand contour, we propose angle thresholds for recovering bending non-thumb fingertips. We first determine whether the thumb is bending according to the existence of the fingertip within the [15%]n-th$\sim[25\%]n$-th points of the whole hand contour, where n denotes the number of contour points ordered from thumb to little. If the thumb is bending, we roughly determine the finger-root of the thumb by the [20%]n-th point of the hand contour. Then we recover the fingertips of bending non-thumb and leave the recovery of the fingertip of the thumb at the end of this subsection (if it is missing). We connect each detected finger-root to the palm center and compute the angle of the ith non-thumb finger-root with respect to the finger-root of the thumb, denoted by $\theta_i$, $1\leq i\leq s$, where $s\leq4$ denotes the number of detected non-thumb fingers (Fig. 3). We also denote the following four intervals of the angles, each of which is associated with a non-thumb finger:

where $\textrm{depth}(\textbf{pc})$ denotes the real depth value of the palm center, and $d_i~(i=0, 1, \ldots, 5)$ are the parameters which shall be given in Section 4. The forefinger, middle finger, ring finger, and little finger can be judged missing and bending if $\Omega_i\cap\{\theta_j\}_{j=1}^s=\varnothing$ holds for i=1, 2, 3, 4, respectively.

Fig. 3

Fig. 3   A flowchart for detecting fingertips and finger-valleys and recovering bending non-thumb fingertips


Once the ith non-thumb finger is bending, we select the ray $\boldsymbol{L}$ starting from the palm center at the middle angle (i.e., $(d_i+d_{i+1})/2+(45-\textrm{depth}(\textbf{pc}))d_0$) of the corresponding interval, and compute the intersections of $\boldsymbol{L}$ and the hand contour. The missing finger-root is determined by the nearest intersection to the palm center (Usually there is only one intersection between $\boldsymbol{L}$ and the contour. However, when $\boldsymbol{L}$ has a bias direction it may intersect with the contour points of other fingers. In this case the nearest point to the palm center is the correct finger-root). The missing fingertip is then determined by one of the points of the line segment between the finger-root and the palm center which achieves the greatest directional derivative of depths along the counter-direction of $\boldsymbol{L}$. We illustrate the whole procedure in the case of a straight thumb in Fig. 3.

Finally, we recover the fingertip of the thumb if the thumb is bending. If all the fingers are straight except the thumb (see the third row of Fig. 10), we extract the points whose depths are smaller than the depth of the palm center, and take the farthest point from the finger-root of the thumb as the fingertip of the thumb. If the hand contains bending fingers other than the thumb (see the fourth to seventh rows of Fig. 10), we shall handle such a complicated case carefully. We collect all the points within the polygon whose vertices are five finger-roots and the first and the last point of the contour. Among this collection, we extract the points whose depth is smaller than the depth of the palm center. To separate the points of bending thumb from the points of bending non-thumb fingers, we remove the points whose depth equals the depth of any bending non-thumb fingertip and the points whose depth equals the depth of any bending non-thumb finger-root (We note that this removal involves all fingertips and finger-roots of non-thumb fingers which are bending). Finally, the finger-root of the thumb is given by the farthest point from the finger-root of the thumb among this collection. We illustrate this procedure of recovering the fingertip of the bending thumb in Fig. 4. The whole procedure for both cases is given in Algorithm 1.

Fig. 4

Fig. 4   Recovery of the undetected fingertip of the thumb when the thumb is bending


Algorithm 1

Algorithm 1   Detecting hand articulations from the hand contour


4 Experimental results

We give experimental results in this section. Experiment 1 uses both intensity images and depth images of human hands captured from Kinect, while Experiment 2 merely uses depth images provided by Microsoft Research Asia (http://research.microsoft.com/en-us/um/people/yichenw/handtracking). In Experiment 1, we compare our method with the K-cos method (Lee and Lee, 2011) and the convex-hull method (Nagarajan et al., 2012); in Experiment 2, we compare our method with K-cos method, the convex-hull method, and the K-curvature method (Cerezo, 2012). Because the Kinect SDK outputs only three points (https://msdn.microsoft.com/en-us/library/dn799273.aspx) which are fewer than the points produced by our method and other appearance-based methods we compare with, we ignore the comparison with the Kinect SDK.

4.1 Parameter setting

The experiments are run on a Core (TM) 2 Quad CPU Q9450 2.66 GHz machine with 4 GB RAM using Visual Studio 2010. The parameters used in Sections 3.3 and 3.4 are given as follows:

We explain the choice of parameters as follows: The parameters $c_i~(i=1, 2, 3, 4)$ are determined by evolving hand contours with different regions of $k_{\sigma}$, and $d_i~(i=0, 1, \ldots, 5)$ are determined by testing a standard hand with five straight fingers from distance of 30 cm to distance of 60 cm. Readers may doubt the robustness of our system since it involves many artificial parameters. We argue that, because both the positions of curvature peak points and the bounds of angles between two adjacent fingers remain unchanged for different hands, the selection of those parameters leads to a robust system. The only issue is the long-time off-line testing.

4.2 Experiment 1: Kinect data

4.2.1 Quantitative results

Experiment 1 takes 21 depth images with various poses of hands captured by Kinect. We compute the pixel-wise distance between the ground truth which we mark manually and the location of each detected fingertip obtained by using our method, the K-cos method (Lee and Lee, 2011), and the convex-hull method (Nagarajan et al., 2012). In particular, we compute the distance for each detected fingertip only using the CSS phase. Both the root-of-mean-square (RMS) and the maximum of the errors of all test images for each fingertip are shown in Fig. 5. The results indicate that the CSS phase achieves the smallest pixel-wise error while the whole procedure of our method achieves the greatest. This is because our method adds bending fingertips by a rough estimation, but that does not imply that our method is less effective than the K-cos and the convex-hull methods. To argue this, we count the number of undetected fingertips and incorrectly detected fingertips of the CSS phase and show the results in Fig. 6. While the whole procedure of our method produces no missing fingertips and no incorrect fingertips, the CSS phase produces one incorrect fingertip while the other two methods produce over 40 incorrect ones.

Fig. 5

Fig. 5   The RMS error (a) and the maximum error (b) of all test images for each fingertip


Fig. 6

Fig. 6   The number of undetected fingertips

(a) and the number of incorrectly detected fingertips (b) among all test images


4.2.2 Qualitative results

To show the qualitative results of Experiment 1 in a vivid fashion, we model a kinematic hand configuration with 60 articulation parameters, i.e., three-dimensional coordinates of five fingertips, five finger-roots, the palm center, and nine finger joints. The x, y-coordinates are given by the locations of hand articulations while the z-coordinate is given by the depth value of the corresponding location. All the fingers contain two joints except the thumb which contains one. The joints are computed equidistantly using the coordinates of fingertips and finger-roots. The hand model consists of fourteen cylinders and a plane, where each cylinder is determined by a finger joint and a fingertip (or a finger-root), and the plane fits to all the finger-roots and the palm center (Fig. 7).

Fig. 7

Fig. 7   The hand model consists of fourteen cylinders and a plane, characterized by the three-dimensional coordinates of a palm center, five finger-roots, nine finger joints, and five fingertips


We show the qualitative results of Experiment 1 in Figs.8-10. We mark the fingertips and finger-valleys obtained by three methods on both color images and depth images, and we show the final model obtained by our method. We see that, the examples of Fig. 8 capture all fingertips and finger-valleys using only the CSS phase. However, the CSS phase cannot capture all fingertips or finger-valleys in the examples of Figs.9 and 10. This is because those examples contain bending fingers whose fingertips do not locate at the hand contour. Fortunately, the phase of recovering bending fingertips works for those examples. The third and fourth rows show a good recovery of locations of bending fingertips. However, when the thumb is occluded or when more than two fingers are bending as in the fourth to seventh rows of Fig. 10, the detected locations lack accuracy. In the fifth row, the fingertip of the thumb is actually occluded by the ring finger while our method recovers the fingertip of the thumb in front of the ring finger. In particular, the sixth and seventh rows of Fig. 10 indicate that our method cannot effectively handle highly occlusive cases.

Fig. 8

Fig. 8   Qualitative results of hand articulations of Experiment 1 (red: fingertips; blue: finger-valleys):

(a) hand models generated by our method; (b) and (c) the CSS phase; (d) and (e) the K-cos method (Lee and Lee, 2011); (f) and (g) the convex-hull method (Nagarajan et al., 2012). All the fingertips and finger-valleys are detected by CSS contours. References to color refer to the online version of this figure


Fig. 9

Fig. 9   Qualitative results of hand articulations of Experiment 1 (red: fingertips; blue: finger-valleys):

(a) hand models generated by our method; (b) and (c) the CSS phase; (d) and (e) the K-cos method (Lee and Lee, 2011); (f) and (g) the convex-hull method (Nagarajan et al., 2012). One or two fingers are bending. References to color refer to the online version of this figure


Fig. 10

Fig. 10   Qualitative results of hand articulations of Experiment 1 (red: fingertips; blue: finger-valleys):

(a) hand models generated by our method; (b) and (c) the CSS phase; (d) and (e) the K-cos method (Lee and Lee, 2011); (f) and (g) the convex-hull method (Nagarajan et al., 2012). Three or four fingers are bending, or the thumb is bending. References to color refer to the online version of this figure


4.3 Experiment 2: MSRA database

We show the qualitative results of Experiment 2 in Fig. 11. While the CSS method works much better than the K-cos and convex-hull methods, the K-curvature method also provides a few good results, except for the 2nd, 12th rows on the left and the 1st, 8th, 10th rows on the right. We argue that the K-curvature method is more robust than the previous two appearance-based methods in boundary detection (especially when the hand contour cannot be clearly captured), but treats occlusion cases ineffectively. In addition, the K-curvature cannot detect finger-valley points and cannot help find finger joints.

Fig. 11

Fig. 11   Qualitative results of hand articulations of Experiment 2 (red: fingertips; blue: finger-valleys; yellow: palm center):

(a1) and (a2) are the original depth images, (b1) and (b2) the CSS method, (c1) and (c2) the K-cos method (Lee and Lee, 2011), (d1) and (d2) the convex-hull method (Nagarajan et al., 2012), (e1) and (e2) the K-curvature method (Cerezo, 2012). References to color refer to the online version of this figure


Fig. 12 shows some failure examples of Experiment 2 using the CSS method. We consider those examples as failures since our method misses or incorrectly detects at least two fingertips of the target hand. In general, our method fails for the examples with large occlusions or with small resolutions. Such a disadvantage is common for appearance-based methods and can be improved by model-based methods. This is the future work we shall consider.

Fig. 12

Fig. 12   Failure examples of the CSS method of Experiment 2 (red: fingertips; blue: finger-valleys; yellow: palm center).

References to color refer to the online version of this figure


4.4 Computational time

The average running times for our method, the K-cos method (Lee and Lee, 2011), the convex-hull method (Nagarajan et al., 2012), and the K-curvature method (Cerezo, 2012) are 1.96 825 s, 0.648 467 s, 0.755 888 s, and 0.02 s, respectively. The CSS method is slower than the others because our method spends a lot of time obtaining the CSS contours, and this involves a quadratic fitting step. More efficient discretization schemes of curvature of the contour will improve our method.

5 Conclusions

We have proposed an appearance-based method for extracting both straight fingers and bending fingers, using a modified CSS descriptor and the angle thresholds characterized by the finger-root of the thumb and the palm center. Experimental results showed that our method detects hand articulations more precisely than three state-of-the-art appearance-based approaches. This is because the CSS descriptors are more robust for noisy or corrupted data. When input images contain two or three bending fingers, our method can detect them with a low accuracy. This is because 2D curvature or other geometric descriptors are difficult to handle in such highly occlusive cases.

Our method has the following disadvantages. (1) We can handle only hand images with normalized orientation. For abnormalized orientation of depth images, our method will fail in detecting bending fingers because the thresholds of depth values no longer work. (2) When images contain many bending fingers, the 2D CSS descriptor may recognize finger joints as fingertips incorrectly as the joints appear on the 2D hand contour. (3) Too many parameters require a long-time and extensive off-line testing.

To improve our method, we shall consider extracting 3D CSS contours for 3D hand contours (i.e., the 3D curves around five fingers including bending fingers) and integrating the 3D CSS descriptors with a parametric model of human hands, which might better treat bending fingers.

Reference

Abbasi S , Mokhtarian F , Kittler J .

Curvature scale space image in shape similarity retrieval

Multimedia Syst. 1999, 7 (8): 467-476 doi: 10.1007/s005300050147

DOI:10.1007/s005300050147      URL     [Cited within: 1]

In many applications, the user of an image database system points to an image, and wishes to retrieve similar images from the database. Computer vision researchers aim to capture image information in feature vectors which describe shape, texture and color properties of the image. These vectors are indexed or compared to one another during query processing to find images from the database. This paper is concerned with the problem of shape similarity retrieval in image databases. Curvature scale space (CSS) image representation along with a small number of global parameters are used for this purpose. The CSS image consists of several arch-shape contours representing the inflection points of the shape as it is smoothed. The maxima of these contours are used to represent a shape. The method is then tested on a database of 1100 images of marine creatures. A classified subset of this database is used to evaluate the method and compare it with other methods. The results show the promising performance of the method and its superiority over Fourier descriptors and moment invariants.

Athitsos V , Sclaroff S .

An appearance-based framework for 3D hand shape classification and camera viewpoint estimation

2002, Proc. 5th IEEE Int. Conf. on Automatic Face and Gesture Recognition: p.40-45 doi: 10.1109/AFGR.2002.1004129

DOI:10.1109/AFGR.2002.1004129      URL     [Cited within: 2]

An appearance-based framework for 3D hand shape classification and simultaneous camera viewpoint estimation is presented. Given an input image of a segmented hand, the most similar matches from a large database of synthetic hand images are retrieved. The ground truth labels of those matches, containing hand shape and camera viewpoint information, are returned by the system as estimates for the input image. Database retrieval is done hierarchically, by first quickly rejecting the vast majority of all database views, and then ranking the remaining candidates in order of similarity to the input. Four different similarity measures are employed, based on edge location, edge orientation, finger location and geometric moments.

Athitsos V , Sclaroff S .

Estimating 3D hand pose from a cluttered image

2003, Proc. IEEE Computer Society Conf. on Computer Vision and Pattern Recognition: p.432-439 doi: 10.1109/CVPR.2003.1211500

DOI:10.1109/CVPR.2003.1211500      URL     [Cited within: 2]

A method is proposed that can generate a ranked list of plausible three-dimensional hand configurations that best match an input image. Hand pose estimation is formulated as an image database indexing problem, where the closest matches for an input hand image are retrieved from a large database of synthetic hand images. In contrast to previous approaches, the system can function in the presence of clutter, thanks to two novel clutter-tolerant indexing methods. First, a computationally efficient approximation of the image-to-model chamfer distance is obtained by embedding binary edge images into a high-dimensional Euclidean space. Second, a general-purpose, probabilistic line matching method identifies those line segment correspondences between model and input images that are the least likely to have occurred by chance. The performance of this clutter tolerant approach is demonstrated in quantitative experiments with hundreds of real hand images.

Cerezo T, 3D hand and finger recognition using Kinect2012, Technical ReportUniversidad de Granada, Spain Available at http://frantracerkinectft.codeplex.com.

[Cited within: 5]

Chang WY , Chen CS , Jian YD .

Visual tracking in high-dimensional state space by appearanceguided particle filtering

IEEE Trans. Image Process 2008, 17(7): 1054-1067 doi: 10.1109/TIP.2008.924283

DOI:10.1109/TIP.2008.924283      URL     PMID:18586623      [Cited within: 2]

In this paper, we propose a new approach, appearance-guided particle filtering (AGPF), for high degree-of-freedom visual tracking from an image sequence. This method adopts some known attractors in the state space and integrates both appearance and motion-transition information for visual tracking. A probability propagation model based on these two types of information is derived from a Bayesian formulation, and a particle filtering framework is developed to realize it. Experimental results demonstrate that the proposed method is effective for high degree-of-freedom visual tracking problems, such as articulated hand tracking and lip-contour tracking.

de La Gorce M , Fleet DJ , Paragios N .

Modelbased 3D hand pose estimation from monocular video

IEEE Trans. Patt. Anal. Mach. Intell. 2011, 33(9): 1793-1805 doi: 10.1109/TPAMI.2011.33

DOI:10.1109/TPAMI.2011.33      URL     PMID:21339527      [Cited within: 2]

A novel model-based approach to 3D hand tracking from monocular video is presented. The 3D hand pose, the hand texture, and the illuminant are dynamically estimated through minimization of an objective function. Derived from an inverse problem formulation, the objective function enables explicit use of temporal texture continuity and shading information while handling important self-occlusions and time-varying illumination. The minimization is done efficiently using a quasi-Newton method, for which we provide a rigorous derivation of the objective function gradient. Particular attention is given to terms related to the change of visibility near self-occlusion boundaries that are neglected in existing formulations. To this end, we introduce new occlusion forces and show that using all gradient terms greatly improves the performance of the method. Qualitative and quantitative experimental results demonstrate the potential of the approach.

Feng Z , Yang B , Chen Y . et al. .

Features extraction from hand images based on new detection operators

Patt. Recog. 2011, 44(5): 1089-1105 doi: 10.1016/j.patcog.2010.08.007

DOI:10.1016/j.patcog.2010.08.007      URL     [Cited within: 2]

Human hand shape features extraction from image frame sequences is one of the key steps in human hand 2D/3D tracking system and human hand shape recognition system. In order to satisfy the need of human hand tracking in real time, a fast and accurate method for acquirement of edge features from human hand images without consideration of hand over face is put forward in this paper. The proposed approach is composed of two steps, the coarse location phase (CLP) and the refined location phase (RLP) from coarseness to refinement. In the phase of CLP, the hand contour is approximately described by a polygon with concave and convex, an approach to obtaining hand shape polygon using locating points and locating lines is meticulously discussed. Then, a coarse location (CL) algorithm for extraction of interested hand shape features, such as contour, fingertips, roots of fingers, joints and the intersection of knuckle on different fingers, is proposed. In the phase of RLP, a multi-scale approach is introduced into our study to refine the features obtained by the CL algorithm. By means of defining the response strength of different types of features, a refined location (RL) algorithm is proposed. The major contribution of this paper is that the novel detection operators for features of hand images are presented in the above two steps, which have been successfully applied to our 3D hand shape tracking system and 2D hand shape recognition system. A number of comparative studies with real images and online videos demonstrate that the proposed method can extract the three defined human hand image features with high accuracy and high speed.

Keskin C , Kıraç F , Kara YE . et al. . Real time hand pose estimation using depth sensors 2011, London In: Fossati, A., Gall, J., Grabner, H., et al. (Eds.), Consumer Depth Cameras for Computer Vision, Springer: p.119-137 doi: 10.1007/978-1-4471-4640-7_7

DOI:10.1007/978-1-4471-4640-7_7      [Cited within: 2]

Kirac F , Kara YE , Akarun L .

Hierarchically constrained 3D hand pose estimation using regression forests from single frame depth data

Patt. Recog. Lett. 2014, 50(3): 415-422 doi: 10.1016/j.patrec.2013.09.003

DOI:10.1016/j.patrec.2013.09.003      URL     [Cited within: 3]

The emergence of inexpensive 2.5D depth cameras has enabled the extraction of the articulated human body pose. However, human hand skeleton extraction still stays as a challenging problem since the hand contains as many joints as the human body model. The small size of the hand also makes the problem more challenging due to resolution limits of the depth cameras. Moreover, hand poses suffer from self-occlusion which is considerably less likely in a body pose. This paper describes a scheme for extracting the hand skeleton using random regression forests in real-time that is robust to self- occlusion and low resolution of the depth camera. In addition to that, the proposed algorithm can estimate the joint positions even if all of the pixels related to a joint are out of the camera frame. The performance of the new method is compared to the random classification forests based method in the literature. Moreover, the performance of the joint estimation is further improved using a novel hierarchical mode selection algorithm that makes use of constraints imposed by the skeleton geometry. The performance of the proposed algorithm is tested on datasets containing synthetic and real data, where self-occlusion is frequently encountered. The new algorithm which runs in real time using a single depth image is shown to outperform previous methods.

Lee D , Lee S .

Vision-based finger action recognition by angle detection and contour analysis

ETRI J. 2011, 33(3): 415-422 doi: 10.4218/etrij.11.0110.0313

DOI:10.4218/etrij.11.0110.0313      URL     [Cited within: 9]

In this paper, we present a novel vision-based method of recognizing finger actions for use in electronic appliance interfaces. Human skin is first detected by color and consecutive motion information. Then, fingertips are detected by a novel scale-invariant angle detection based on a variable k-cosine. Fingertip tracking is implemented by detected region-based tracking. By analyzing the contour of the tracked fingertip, fingertip parameters, such as position, thickness, and direction, are calculated. Finger actions, such as moving, clicking, and pointing, are recognized by analyzing these fingertip parameters. Experimental results show that the proposed angle detection can correctly detect fingertips, and that the recognized actions can be used for the interface with electronic appliances.

Ma Z , Wu E .

Real-time and robust hand tracking with a single depth camera

Vis. Comput. 2014, 30(10): 1133-1144 doi: 10.1007/s00371-013-0894-1

DOI:10.1007/s00371-013-0894-1      URL     [Cited within: 3]

In this paper, we introduce a novel, real-time and robust hand tracking system, capable of tracking the articulated hand motion in full degrees of freedom (DOF) using a single depth camera. Unlike most previous systems, our system is able to initialize and recover from tracking loss automatically. This is achieved through an efficient two-stage k-nearest neighbor database searching method proposed in the paper. It is effective for searching from a pre-rendered database of small hand depth images, designed to provide good initial guesses for model based tracking. We also propose a robust objective function, and improve the Particle Swarm Optimization algorithm with a resampling based strategy in model based tracking. It provides continuous solutions in full DOF hand motion space more efficiently than previous methods. Our system runs at 40 fps on a GeForce GTX 580 GPU and experimental results show that the system outperforms the state-of-the-art model based hand tracking systems in terms of both speed and accuracy. The work result is of significance to various applications in the field of human–computer-interaction and virtual reality.

Maisto M , Panella M , Liparulo L . et al. .

An accurate algorithm for the identification of fingertips using an RGB-D camera

IEEE J. Emerg. Sel. Topics Circ. Syst. 2013, 3(2): 272-283 doi: 10.1109/JETCAS.2013.2256830

DOI:10.1109/JETCAS.2013.2256830      URL     [Cited within: 3]

RGB-D cameras and depth sensors have made possible the development of an uncountable number of applications in the field of human-computer interactions. Such applications, varying from gaming to medical, have made possible because of the capability of such sensors of elaborating depth maps of the placed ambient. In this context, aiming to realize a sound basis for future applications relevant to the movement and to the pose of hands, we propose a new approach to recognize fingertips and to identify their position by means of the Microsoft Kinect technology. The experimental results exhibit a really good identification rate, an execution speed faster than the frame rate with no meaningful latencies, thus allowing the use of the proposed system in real time applications. Furthermore, the scored identification accuracy confirms the excellent capability of following also little movements of the hand and it encourages the real possibility of successive implementations in more complex gesture recognition systems.

Morshidi M , Tjahjadi T .

Gravity optimised particle filter for hand tracking

Patt. Recog. 2014, 47(1): 194-207 doi: 10.1016/j.patcog.2013.06.032

DOI:10.1016/j.patcog.2013.06.032      URL     [Cited within: 2]

Introduces Gravity Optimised Particle Filter (GOPF) to improve particle propagation.GOPF introduces gravitational force in addition to weighted particles. GOPF attracts particles towards peak of likelihood distribution to improve sampling. Introduces a fast detection and labelling of hand features using convexity defects. GOPF is incorporated in hand features tracking with a validation gate mechanism.

Nagarajan S , Subashini T , Ramalingam V .

Vision based real time finger counter for hand gesture recognition

Int. J. Technol. 2012, 2(2): 1-5

URL     [Cited within: 9]

ABSTRACTSign Language is a natural way of communication for deaf and mute people around the world. They use different hand gestures to communicate with normal people everywhere. In this paper, a real time finger counter is proposed for counting the number of fingers (1-5) shown in a hand gesture. This system involves five phases namely input video capturing, preprocessing, hand region segmentation, feature extraction and recognition. In the first phase, the gesture video is captured using a low cost USB camera in real time. In the second phase, the image frames of the captured video are resized for further processing. The skin colour of hand region is detected using HSV colour space and morphological operations are performed in the third phase. In the feature extraction phase, the convex hull method is used to detect the boundary points of the segmented binary hand image and the vertices of the convex polygon are determined as the convexity defects i.e. finger tips. Finally, the recognition phase recognizes the gesture number based on the number of convexity defects present in the convex polygon. The experimental results show that this technique gives satisfactory recognition rate.

Oikonomidis I , Kyriazis N , Argyros AA .

Efficient model-based 3D tracking of hand articulations using Kinect

BMVC 2011, 1(2): 1-11

URL     [Cited within: 2]

We present a novel solution to the problem of recovering and tracking the 3D po-sition, orientation and full articulation of a human hand from markerless visual obser-vations obtained by a Kinect sensor. We treat this as an optimization problem, seekingfor the hand model parameters that minimize the discrepancy between the appearanceand 3D structure of hypothesized instances of a hand model and actual hand observa-tions. This optimization problem is effectively solved using a variant of Particle SwarmOptimization (PSO). The proposed method does not require special markers and/or acomplex image acquisition setup. Being model based, it provides continuous solutionsto the problem of tracking hand articulations. Extensive experiments with a prototypeGPU-based implementation of the proposed method demonstrate that accurate and ro-bust 3D tracking of hand articulations can be achieved in near real-time (15Hz). Efficient model-based 3D tracking of hand articulations using Kinect (PDF Download Available).

Qian C , Sun X , Wei Y , et al..

Realtime and robust hand tracking from depth

2014, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, p.1106-1113 doi: 10.1109/CVPR.2014.145

DOI:10.1109/CVPR.2014.145      URL     [Cited within: 3]

We present a realtime hand tracking system using a depth sensor. It tracks a fully articulated hand under large viewpoints in realtime (25 FPS on a desktop without using a GPU) and with high accuracy (error below 10 mm). To our knowledge, it is the first system that achieves such robustness, accuracy, and speed simultaneously, as verified on challenging real data. Our system is made of several novel techniques. We model a hand simply using a number of spheres and define a fast cost function. Those are critical for realtime performance. We propose a hybrid method that combines gradient based and stochastic optimization methods to achieve fast convergence and good accuracy. We present new finger detection and hand initialization methods that greatly enhance the robustness of tracking.

Ren Z , Yuan J , Zhang Z .

Robust hand gesture recognition based on finger-earth mover’s distance with a commodity depth camera

2011, Proc. 19th ACM Int. Conf. on Multimedia, p.1093-1096 doi: 10.1145/2072298.2071946

DOI:10.1145/2072298.2071946      URL     [Cited within: 2]

The recently developed depth sensors, e.g., the Kinect sensor, have provided new opportunities for human-computer interaction (HCI). Although great progress has been made by leveraging the Kinect sensor, e.g. in human body tracking and body gesture recognition, robust hand gesture recognition remains an open problem. Compared to the entire human body, the hand is a smaller object with more complex articulations and more easily affected by segmentation errors. It is thus a very challenging problem to recognize hand gestures. This paper focuses on building a robust hand gesture recognition system using the Kinect sensor. To handle the noisy hand shape obtained from the Kinect sensor, we propose a novel distance metric for hand dissimilarity measure, called Finger-Earth Mover's Distance (FEMD). As it only matches fingers while not the whole hand shape, it can better distinguish hand gestures of slight differences. The extensive experiments demonstrate the accuracy, efficiency, and robustness of our hand gesture recognition system.

Rosales R , Athitsos V , Sigal L , et al..

3D hand pose reconstruction using specialized mappings

2001, Proc. 8th IEEE Int. Conf. on Computer Vision, p.378-385 doi: 10.1109/ICCV.2001.937543

DOI:10.1109/ICCV.2001.937543      URL     [Cited within: 2]

A system for recovering 3D hand pose from monocular color sequences is proposed. The system employs a non-linear supervised learning framework, the specialized mappings architecture (SMA), to map image features to likely 3D hand poses. The SMA's fundamental components are a set of specialized forward mapping functions, and a single feedback matching function. The forward functions are estimated directly from training data, which in our case are examples of hand joint configurations and their corresponding visual features. The joint angle data in the training set is obtained via a CyberGlove, a glove with 22 sensors that monitor the angular motions of the palm and fingers. In training, the visual features are generated using a computer graphics module that renders the hand from arbitrary viewpoints given the 22 joint angles. We test our system both on synthetic sequences and on sequences taken with a color camera. The system automatically detects and tracks both hands of the user, calculates the appropriate features, and estimates the 3D hand joint angles from those features. Results are encouraging given the complexity of the task.

Schlattmann M , Kahlesz F , Sarlette R , et al..

Markerless 4 gestures 6 DOF real-time visual tracking of the human hand with automatic initialization

Comput. Graph. Forum. 2007, 26(3): 467-476 doi: 10.1111/j.1467-8659.2007.01069.x

DOI:10.1111/j.1467-8659.2007.01069.x      URL     [Cited within: 2]

In this paper we present a novel computer vision based hand-tracking technique, which is capable of robustly tracking 6+4DOF of the human hand in real-time (at least 25 frames per second) with the help of 3 (or more) off-the-shelf consumer cameras. `6+4DOF' means that the system can track the global pose (6 continuous parameters for translation and rotation) of 4 different gestures. A key feature of our system is its fully automatic real-time initialization procedure, which, along with a sound tracking-lost detector, makes the system fit for real-world applications. Because of this, our method acts as an enabling technology for uncumbersome hand-based 3D Human-Computer-Interaction (HCI). Previously, using the hand as an at least 6DOF input device involved the use of either datagloves or markers. Using our tracking we evaluated the use of the hand as an input device for two prevalent Virtual Reality applications: fly-through exploration of a virtual world and a simple digital assembly simulation.

Tomasi C, Petrov S, Sastry A .

3D tracking = classification + interpolation

2003, Proc. 9th IEEE Int. Conf. on Computer Vision, p.1441-1448 doi: 10.1109/ICCV.2003.1238659

DOI:10.1109/ICCV.2003.1238659      URL     [Cited within: 2]

Hand gestures are examples of fast and complex motions. Computers fail to track these in fast video, but sleight of hand fools humans as well: what happens too quickly we just cannot see. We show a 3D tracker for these types of motions that relies on the recognition of familiar configurations in 2D images (classification), and fills the gaps in-between (interpolation). We illustrate this idea with experiments on hand motions similar to finger spelling. The penalty for a recognition failure is often small: if two configurations are confused, they are often similar to each other, and the illusion works well enough, for instance, to drive a graphics animation of the moving hand. We contribute advances in both feature design and classifier training: our image features are invariant to image scale, translation, and rotation, and we propose a classification method that combines VQPCA with discrimination trees.

Tompson J, Stein M, Lecun Y, et al..

Real-time continuous pose recovery of human hands using convolutional networks

ACM Trans. Graph. 2014, 33(5): 169.1-169.10 doi: 10.1109/ICCV.2003.1238659

DOI:10.1109/ICCV.2003.1238659      URL     [Cited within: 3]

We present a novel method for real-time continuous pose recovery of markerless complex articulable objects from a single depth image. Our method consists of the following stages: a randomized decision forest classifier for image segmentation, a robust method for labeled dataset generation, a convolutional network for dense feature extraction, and finally an inverse kinematics stage for stable real-time pose recovery. As one possible application of this pipeline, we show state-of-the-art results for real-time puppeteering of a skinned hand-model.

浙ICP备14002560号-5
版权所有 © Frontiers of Information Technology & Electronic Engineering
地址:浙江省杭州市浙大路38号 邮编:310027
联系电话:+86-571-87952783/87952276 E-mail: jzus_zzy@zju.edu.cn; jzus@zju.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发

/