Text2Sign: Towards Sign Language Production using Neural Machine Translation and Generative Adversarial Networks (bibtex)

by Stephanie Stoll, Necati Cihan Camgoz, Simon Hadfield and Richard Bowden

Abstract:

We present a novel approach to automatic Sign Language Production (SLP) using recent developments in Neural Machine Translation (NMT), Generative Adversarial Networks (GANs), and motion generation. Our system is capable of producing sign videos from spoken language sentences. Contrary to current approaches that are dependent on heavily annotated data, our approach requires minimal gloss and skeletal level annotations for training. We achieve this by breaking down the task into dedicated sub-processes. We first translate spoken language sentences into sign pose sequences by combining an NMT network with a Motion Graph (MG). The resulting pose information is then used to condition a generative model that produces photo realistic sign language video sequences. This is the first approach to continuous sign video generation that does not use a classical graphical avatar. We evaluate the translation abilities of our approach on the PHOENIX14T Sign Language Translation dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of 16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities of our approach for both multi-signer and high-definition

View PDF

Reference:

Text2Sign: Towards Sign Language Production using Neural Machine Translation and Generative Adversarial Networks (Stephanie Stoll, Necati Cihan Camgoz, Simon Hadfield and Richard Bowden), In , Springer, 2019. (Recorded presentation, text2gloss code)

Bibtex Entry:

@article{Stoll19,
  Title                    = {Text2Sign: Towards Sign Language Production using Neural Machine Translation and Generative Adversarial Networks},
  Author                   = {Stephanie Stoll and Necati Cihan Camgoz and Simon Hadfield and Richard Bowden},
  Booktitle                = {International Journal of Computer Vision (IJCV)},
  Year                     = {2019},

  Publisher                = {Springer},

  Abstract                 = {We present a novel approach to automatic Sign Language Production (SLP) using recent developments in Neural Machine Translation (NMT), Generative Adversarial Networks (GANs), and motion generation. Our system is capable of producing sign videos from spoken language sentences. Contrary to current approaches that are dependent on heavily annotated data, our approach requires minimal gloss and skeletal level annotations for training. We achieve this by breaking down the task into dedicated sub-processes. We first translate spoken language sentences into sign pose sequences by combining an NMT network with a Motion Graph (MG). The resulting pose information is then used to condition a generative model that produces photo realistic sign language video sequences. This is the first approach to continuous sign video generation that does not use a classical graphical avatar. We evaluate the translation abilities of our approach on the PHOENIX14T Sign Language Translation dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of 16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities of our approach for both multi-signer and high-definition},
  Comment                  = {<a href="https://youtu.be/VisZLaZyblE?t=3457">Recorded presentation</a>, <a href="https://github.com/neccam/text2gloss">text2gloss code</a>},
  Url                      = {http://personalpages.surrey.ac.uk/s.hadfield/papers/Stoll19.pdf},
  Pages                = {1--18},
}