Visual Sign and Gesture Recognition

12Richard Bowden, 2Timor Kadir, 1David Windridge, 1Eng-Jon Ong, 1Antonio Micilotta, 2Andrew Zisserman, 2Michael Brady

1 CVSSP, School of EPS, University of Surrey, Guildford, Surrey, UK
2 Department of Engineering Science, University of Oxford, Oxford, UK.

r.bowden@eim.surrey.ac.uk

Objective

The objective of this work is to efficiently and accurately recognise signed words, from British Sign Language, using a minimal number of training examples. Furthermore, our aim is to use natural image sequences, without the signer having to wear data gloves or coloured gloves, and to be able to recognise hundreds of signs. The motivation for this work is to provide a real time interface so that signers can easily and quickly communicate with non-signers.

Why is it hard?

  1. Each country has its own sign language with different vocabularies and grammar. Any system that is to be of use must be independent of the specific language model used.
  2. Individuals are different shapes and sizes and will vary the way in which a specific sign is performed, for example someone who is new to signing will sign slower and with a larger sign space (the volume in which the sign is performed) with minimal co-articulation between signs. A fluent signer will be far faster with heavy coarticulation and typically in a far smaller sign space. This is similar to speech where a fluent speaker, with their own dialect, will blur words together and use slang and abbreviations to communicate faster. In addition to these fundamental variations our sensor modality is video. Cameras have different lenses, responses and an individual may be arbitrarily placed relative to the camera which further complicates matters.
  3. Traditional approaches such as those used in speech recognition (such as the HMM) require large amounts of labelled data in order to generalise about the points raised in 2 above, such as the feature space, sign variation and co-articulation artefacts. No such databses for sign exist (unlike in speech). Considering the storage requirements of video and the task of labelling this data the acquisition of labelled data becomes a limiting factor in the size of lexicon that can be addressed. Obviously this limitation also has serious implications to the issues raised in point 1 above, one would have to generate training data for each sign language to be learnt.

How do we do it?

We break the problem down into 2 areas:

  1. Generic tracking of the human, regardless of size, camera type and placement.
  2. A novel 2 stage classification architecture which reduces training requirements by generating a high level feature description based upon sign linguistics.

An overview of the system is given in the figure.

newoverview.png (47139 bytes)

The novelty of our approach is that we structure the classification model around a linguistic definition of signed words, rather than a HMM. This enables signs to be learnt reliably from just a handful of training examples. The classification process is divided into two stages. The 1st generates a description of hand shape and movement at the level of `the hand has shape 5 (an open hand) and is over the left shoulder moving right'. This level of feature is based directly upon those used within sign linguistics to document signs. Its broad description aids in generalisation and therefore significantly reduces the requirements of further stages of classification. In the second stage, we apply Independent Component Analysis (ICA) to separate the channels of information from uncorrelated noise. Final (stage II) classification uses a bank of Markov models to recognise the temporal transitions of individual words/signs.

Results

The system is a system capable of running in real-time, and generating extremely high recognition rates for large lexicons with as little as a single training instance per sign. We have demonstrated classification rates as high as 92% for a lexicon of 164 signs with extremely low training requirements outperforming previous approaches where thousands of training examples are required.

How do I find out more?

We have published a number of papers in this area, the most recent that describes the system being [1] or slightly older (without the boosting and smaller lexicon) in [5]. For details of the booting see [1] and [6]. For discussions about the feature selection process see [4]. For body tracking and estimating elbows see [3]. A demonstration of the system was performed at [8][5] and shortly to appear at [3] and [1]. For older work on hand modelling see [9] and [10]. Failing all that you are more than welcome to email me at r.bowden@eim.surrey.ac.uk.

Dataset Availability

We have a number of datasets that we have assembled for this work. Our latest dataset consists of 2 individuals performing 10 repetitions of 164 different signs taken from British Sign Language. The movies are available as Mpeg2, PAL resolution divx and 1/2PAL resolution divx with associated ground truth label files for each of the signs performed. For ease of segmentation each signer remains relatively static infront of a uniform dark background wearing a red shirt and two different coloured gloves. Ground truthed test sequences are also available without gloves. If you would like to obtain this dataset we are happy to make it available for a small charge to cover the cost of media duplication. For more info contact r.bowden@eim.surrey.ac.uk.

Publications and further information

  1. Bowden R, Progress in Sign and Gesture Recognition. Invited Speaker (to appear), AMDO2004, Third International Workshop of Articulated Motion and Deformable Models, Palma de Mallorca, Spain.
  2. Bowden R, Kadir T, Ong E, Windridge D, Zisserman A, Brady M. Minimal Training, Large Lexicon, Unconstrained Sign Language Recognition. To appear in Proc. BMVC04
  3. Micilotta A, Bowden R, View-based Location and Tracking of Body Parts for Visual Interaction.  To appear in Proc. BMVC04
  4. Windridge D, Bowden R,  A General Strategy for Hidden Markov Chain Parameterisation in Composite Feature Space. To appear in Proc. SSPR04 Syntactical and Structural Pattern Recognition 2004.
  5. Bowden R, Windridge D, Kadir T, Zisserman A, Brady M. A Linguistic Feature Vector for the Visual Interpretation of Sign Language, In Tomas Pajdla, Jiri Matas (Eds), Proc. 8th European Conference on Computer Vision, ECCV04. LNCS3022, Springer-Verlag (2004), Volume 1, pp391- 401.
  6. Ong E, Bowden R, Detection and Segmentation of hand shapes using Boosted Classifiers, In Proc. 6th Int Conf on Automatic Face and Gesture Recognition, FGR'04, IEEE Comp Soc TC PAMI, Korea 2004, pp889-894.
  7. Windridge D, Bowden R, Induced Decision Fusion In Automatic Sign Language Interpretation: Using ICA to Isolate the Underlying Components of Sign. In 5th International Workshop on Multiple Classifier Systems, MCS04, Cagliari, Italy, 2004. pp
  8. Bowden R, Zisserman A, Kadir T, Brady M. Vision based Interpretation of Natural Sign Languages. Exhibition at ICVS03, The 3rd International Conference on Computer Vision Systems, Graz. Austria, April 2003. Short Paper, Exhibit Poster
  9. Bowden R, Sarhadi M, A non-linear Model of Shape and Motion for Tracking Finger Spelt American Sign Language, Image and Vision Computing, vol 20/9-10, pp597-607, Aug 2002, Elsevier Science Ltd
  10. Bowden R, Sarhadi M, Building Temporal Models for Gesture Recognition, In Proc. BMVC'00, M Mirmehdi & Barry Thomas Ed, Vol 1, pp32-41, Bristol UK, Sept 2000.