Visual Sign and Gesture Recognition

Tracking

Detecting Faces and Hands using Boosted Classifiers

Boosting is a general method that can be used for improving the accuracy of a given learning algorithm [3]. More specifically, it is based on the principle that a highly accurate or "strong" classifier can be produced through the linear combination of many inaccurate or "weak" classifiers.

Classifier efficiency is increased by organising the weak classifiers into a ‘cascade’, where the number of weak classifiers in each layer of the cascade increases with depth. The purpose of this is that initial layers (which are required to test many possible hypotheses) are simple to compute. They should reject large numbers of hypotheses such that later layers (which are more complex and therefore computationally expensive) need only be applied to a small subset of the original hypotheses. In this fashion an exhaustive search over all positions and scales is possible as in excess of 90% of possible hypotheses can be rejected at each stage of the cascade.

Model	N' Layers	Cascade Layers	+ve Training Eg	-ve Training Eg
Head Hands	10 8	2,5,5,20,50,50,150,150,150,150 2,5,5,20,50,50,150,150	2500 2400	6000 6000

boost.png (81638 bytes) In order to perform detection, we train two different strong classifiers for the head and hands respectively. The details of which are given in the table above. The entire image is searched over all position and scale. To perform detection on a section of an image the weak classifiers are transformed such that they are applied to that image section. If the image content does not contain the object of interest (head or hands), it will be rejected by a particular layer on the cascade. Otherwise it will filter down to the final layer to be accepted by it. The strongest detected head and hand hypotheses are then passed to stage I classification. The figure shows some sample images taken from 2 sequences with the detected location of head and hands outlined.

Signers generally face the viewer as directly as possible to ease understanding and remove ambiguities and occlusions that occur at more oblique angles. The system uses the boosted detectors coupled with a contour model of the head and shoulders. This provides a bodycentred co-ordinate system in which to describe the position and motion of the hands. The 2D contour is a coarse approximation to the shape of the shoulders and head and consists of 18 connected points. The contour is a mathematical mean shape taken from a number of sample images of signers.

The contour is fitted to the image by estimating the similarity transform which minimises the contour's distance to local image features. Estimates for key body locations, are placed relative to the location of the head contour. This means that as the contour is transformed to fit the location of the user within the video stream, so the approximate locations of the key body components are also transformed.

Next we derive our feature vector and perform classification