Visual Sign and Gesture Recognition
12Richard Bowden, 2Timor Kadir, 1David Windridge, 1Eng-Jon Ong, 1Antonio Micilotta, 2Andrew Zisserman, 2Michael Brady
1 CVSSP, School of EPS, University of
Surrey, Guildford, Surrey, UK
2 Department of Engineering Science,
University of Oxford, Oxford, UK.
r.bowden@eim.surrey.ac.uk
Objective
The objective of this work is to efficiently and accurately
recognise signed words, from British Sign Language, using a minimal number of training
examples. Furthermore, our aim is to use natural image sequences, without the signer
having to wear data gloves or coloured gloves, and to be able to recognise hundreds of
signs. The motivation for this work is to provide a real time interface so that signers
can easily and quickly communicate with non-signers.
Why is it hard?
- Each country has its own sign language with different
vocabularies and grammar. Any system that is to be of use must be independent of the
specific language model used.
- Individuals are different shapes and sizes and will vary the
way in which a specific sign is performed, for example someone who is new to signing will
sign slower and with a larger sign space (the volume in which the sign is performed) with
minimal co-articulation between signs. A fluent signer will be far faster with heavy
coarticulation and typically in a far smaller sign space. This is similar to speech where
a fluent speaker, with their own dialect, will blur words together and use slang and
abbreviations to communicate faster. In addition to these fundamental variations our
sensor modality is video. Cameras have different lenses, responses and an individual may
be arbitrarily placed relative to the camera which further complicates matters.
- Traditional approaches such as those used in speech
recognition (such as the HMM) require large amounts of labelled data in order to
generalise about the points raised in 2 above, such as the feature space, sign variation
and co-articulation artefacts. No such databses for sign exist (unlike in speech).
Considering the storage requirements of video and the task of labelling this data the
acquisition of labelled data becomes a limiting factor in the size of lexicon that can be
addressed. Obviously this limitation also has serious implications to the issues raised in
point 1 above, one would have to generate training data for each sign language to be
learnt.
How do we do it?
We break the problem down into 2 areas:
- Generic tracking of the human,
regardless of size, camera type and placement.
- A novel 2 stage
classification architecture which reduces training requirements by generating a high level
feature description based upon sign linguistics.
An overview of the system is given in the figure.

The novelty of our approach is that we structure the
classification model around a linguistic definition of signed words, rather than a HMM.
This enables signs to be learnt reliably from just a handful of training examples. The
classification process is divided into two stages. The 1st generates a description of hand
shape and movement at the level of `the hand has shape 5 (an open hand) and is over the
left shoulder moving right'. This level of feature is based directly upon those used
within sign linguistics to document signs. Its broad description aids in generalisation
and therefore significantly reduces the requirements of further stages of classification.
In the second stage, we apply Independent Component Analysis (ICA) to separate the
channels of information from uncorrelated noise. Final (stage II) classification uses a
bank of Markov models to recognise the temporal transitions of individual words/signs.
Results
The system is a system capable of running in
real-time, and generating extremely high recognition rates for large lexicons with as
little as a single training instance per sign. We have demonstrated classification rates
as high as 92% for a lexicon of 164 signs with extremely low training requirements
outperforming previous approaches where thousands of training examples are required.
How do I find out more?
We have published a number of papers in this area, the most
recent that describes the system being [1] or slightly older (without the boosting and
smaller lexicon) in [5]. For details of the booting see [1] and [6]. For discussions about
the feature selection process see [4]. For body tracking and estimating elbows see [3]. A
demonstration of the system was performed at [8][5] and shortly to appear at [3] and [1].
For older work on hand modelling see [9] and [10]. Failing all that you are more than
welcome to email me at r.bowden@eim.surrey.ac.uk.
Dataset Availability
We have a number of datasets that we have assembled for this
work. Our latest dataset consists of 2 individuals performing 10 repetitions of 164
different signs taken from British Sign Language. The movies are available as Mpeg2, PAL
resolution divx and 1/2PAL resolution divx with associated ground truth label files for
each of the signs performed. For ease of segmentation each signer remains relatively
static infront of a uniform dark background wearing a red shirt and two different coloured
gloves. Ground truthed test sequences are also available without gloves. If you would like
to obtain this dataset we are happy to make it available for a small charge to cover the
cost of media duplication. For more info contact r.bowden@eim.surrey.ac.uk.
Publications and further information
- Bowden R, Progress in Sign and Gesture
Recognition. Invited Speaker (to appear), AMDO2004, Third International
Workshop of Articulated Motion and Deformable Models, Palma de Mallorca, Spain.
- Bowden R, Kadir T, Ong E, Windridge D, Zisserman A, Brady M. Minimal
Training, Large Lexicon, Unconstrained Sign Language Recognition. To appear
in Proc. BMVC04
- Micilotta A, Bowden R, View-based Location and
Tracking of Body Parts for Visual Interaction. To appear in Proc.
BMVC04
- Windridge D, Bowden R, A General Strategy
for Hidden Markov Chain Parameterisation in Composite Feature Space. To
appear in Proc. SSPR04 Syntactical and Structural Pattern Recognition 2004.
- Bowden R,
Windridge D, Kadir T, Zisserman A, Brady M. A Linguistic Feature Vector for
the Visual Interpretation of Sign Language, In Tomas Pajdla, Jiri Matas
(Eds), Proc. 8th European Conference on Computer Vision, ECCV04. LNCS3022, Springer-Verlag
(2004), Volume 1, pp391- 401.
- Ong E,
Bowden R, Detection and Segmentation of hand shapes using Boosted Classifiers,
In Proc. 6th Int Conf on Automatic Face and Gesture Recognition, FGR'04, IEEE Comp Soc TC
PAMI, Korea 2004, pp889-894.
- Windridge D, Bowden R, Induced Decision Fusion In
Automatic Sign Language Interpretation: Using ICA to Isolate the Underlying Components of
Sign. In 5th International Workshop on Multiple Classifier Systems, MCS04,
Cagliari, Italy, 2004. pp
- Bowden R, Zisserman A, Kadir T, Brady M. Vision
based Interpretation of Natural Sign Languages. Exhibition at ICVS03, The
3rd International Conference on Computer Vision Systems, Graz. Austria, April 2003. Short Paper, Exhibit Poster
- Bowden
R, Sarhadi M, A non-linear Model of Shape and
Motion for Tracking Finger Spelt American Sign Language, Image and Vision
Computing, vol 20/9-10, pp597-607, Aug 2002, Elsevier Science Ltd
- Bowden
R, Sarhadi M, Building Temporal Models for Gesture Recognition,
In Proc. BMVC'00, M Mirmehdi & Barry Thomas Ed, Vol 1, pp32-41, Bristol UK, Sept 2000.