Approach to the DANSA project

home | research | publications | software | acknowledgements

DANSA > approach

introduction | background | relevance | references

Introduction

Automatic speech recognition is a key technology in our modern information society, as, along with other spoken language technologies, it enables human-computer interaction in the most natural and universal way, while leaving the eyes and hands free to carry out secondary or complementary functions. Commercial research into ASR tends to concentrate on long-term development of in-house systems, which necessarily favours incremental modifications and improvements to the existing state-of-the-art; whereas the proposed research takes the form of basic innovation that deals initially with a simpler recognition task. It employs a fundamentally different approach to modelling, based on incorporating knowledge of speech dynamics into the recognizer architecture. For the sake of mathematical tractability, traditional methods, such as the hidden Markov model (HMM), have necessarily started from a naïve representation of the speech production process: that it is piecewise stationary. The proposed research seeks to take advantage of the stochastic formalism while extending the theory to create more realistic models of articulator movement, derived from the study of direct articulatory measurements.

A revolution in ASR occurred in the 1970s when statistical methods based on HMMs overtook rule-based ones. In the 80s and 90s, these systems were expanded, refined and furnished with additional training data, allowing incremental improvements that advanced the technology from isolated-word, speaker-dependent dictation to noise-robust, large-vocabulary continuous speech recognition. Now we need systems that can adapt to different users and speaking styles for spontaneous speech, yet it is under these conditions that HMMs revealed their limitations, which is partly why interest in articulatory approaches was revived in the 90s. The rigour of the HMM formalism has sustained the last 30 years of development, though they continue to suffer from simplistic assumptions, such as a time-invariant state probability density function (pdf), and fail to capitalise on our knowledge of speech science in many ways. Research has progressed with artificial neural networks [RobEtAl02], but our physical understanding of the speech production process offers a substantial opportunity for improving performance, provided the appropriate parameters can be learnt.

Speech is part of our daily lives and impinges on many facets of scientific endeavour, so understanding how it is produced affects researchers from many different disciplines. The literature on speech articulation has contributions from physiology and acoustics, as well as psychology, linguistics, engineering, mathematics and medical imaging, defining the field as essentially multidisciplinary. To describe some of these areas in relation to the statistical modelling of speech gestures, they are divided into six themes:

the structure of speech,

coarticulation,

duration,

articulatory models,

the work of Deng, and

acoustic modelling techniques.

Background

The debate continues as to whether auditory perception, articulatory features or syllables govern the structure of speech [ChoHal68, Pic80, Lin96, RicEtAl00, Ost00, GreEtAl03]. The expression of language through the human vocal apparatus is unquestionably an intercourse between words and articulation that is designed to allow recognition of those words by a human listener. Hence, the process of speech production may be viewed as the transformation whereby the message is encoded as an utterance. This project seeks to take advantage of this bottleneck in the information transmission, by developing compact models of the linguistically-meaningful speech gestures. The variety of opinions is related to the practical difficulty of turning phrases into sound waves: the message is typically organized into words, sentences, and paragraphs (in text) or turns (in dialogue), but the utterance itself is structured by the fast consonantal movements of intrinsic muscles, the slower vowel movements of the extrinsic musculature (corresponding to syllables) and breath groups. In certain circumstances, each constraint will play its part, meanwhile the final realization as a speech utterance will be coarticulated, as a compromise of all these competing factors, plus those affecting its reception by the listener. Models of speech production that capture these factors will enable us to analyse their effects quantitatively. Once it is shown that trained models have learnt such behaviours automatically from data, then significant improvements in ASR accuracy most likely to follow.

Traditional recognition systems address the problem of coarticulation by learning separate models of the phonemes for each phonetic context (i.e., context-sensitive triphones), which requires large amounts of training data. Meanwhile, knowledge we have is discarded, for instance, that "the vowel and consonant gestures are largely independent" [Ohm67], that a vowel's left context is more influential than its right context, and vice versa for consonants [Pic80], that pauses, phonetic context and stress affect segment duration [Kla87, GreEtAl03], and that effects can span several phonemes [West00]. Nevertheless, some attempts have been made to imitate the smoothness of articulatory movements [RicBri99], and correlations of articulatory targets with those of adjacent phonemes [BlaYou00]. In terms of segment durations, research has tended to polarise into linguistic studies of mean phone duration on small data sets [LisAbr64] and phonetic studies of temporal cues [Hou61, LisAbr64], versus the design of suitable parametric distribution models for use in ASR [RusMoo85, Bur96]. The latter tend to concentrate on how best to model the statistical properties of context-independent phonemes (i.e., monophones), rather than considering how best to incorporate the kinds of dependencies that are typically observed. Nevertheless, recent pilot studies indicate that improvements in performance can be forthcoming [Jac03a].

There are good reasons for developing articulatory models other than ASR [ShaDam01, Huc02], evident from the varied attempts over the years: for production [Mer73, Cok76, KabHon01], and for speech synthesis [Dud40, ParCok92, KurEtAl99]; other studies have profitted from advances in measurement techniques [Wes96, KabHon01]. The work of Deng is notable, however, for his radical attempt to meld many of these conflicting ideas into one holistic approach to ASR from speech production, using Bayesian networks [DengMa00]. While his results are promising, they have not been independently verified. Still, it is not clear from his analysis whether the complicated multi-tiered recognition system he proposes has learnt attributes of actual speech dynamics or merely captured statistical characteristics of the speech signal. This is a question that equally hangs over other recent research that uses a hidden dynamical model within the recognizer [FraKing01, RosGal01]. Moreover, these attempts have not yet delivered the substantial improvement that is anticipated, which calls for further investigation to analyse the emergent behaviour of the dynamical models and to determine how it corresponds to actual articulation.

Relevance

From the perspective of speech science, this project will develop segmental models so as to capture certain known characteristics of fluent speech, but estimated in a quantitative and statistical way (i.e., by maximum likelihood). Such characteristics include accurately modelling the way that phones vary in duration, and the way that the distributions of phone durations vary with their context, finding the articulatory parameters and the form of their trajectories that best represent meaningful speech acts, and describing the behaviour of both redundant and critical articulators. Hence, we plan to make significant advances in charting out and quantifying those aspects of the timing and posture of articulators during speech production that are important for recognition. We expect there to be strong synergy between the advances in knowledge and understanding of speech dynamics and the ability of consistent models to recognize correctly. For example, phone recognition accuracy improvements from better state alignment of linear mappings could reveal a new coarticulatory phenomenon, meanwhile understanding the behaviour of redundant articulators could lead to a more comprehensive modelling strategy with commensurate benefits in performance.

Thus, and in addition to the fundamental scientific and technological progress, there would be potential benefits in a whole gamut of application areas. The use of articulatory constraints based on human speech production is most likely to deliver improvements for spontaneous speech, where the style is casual and continuous, and in noisy environments, which are two of the most demanding speech recognition tasks currently yet crucial to its wider deployment in society. The modelling paradigm could also aid adaptation over a number of influential human factors, e.g., vocal-tract length, accent, and speaking rate and style. Equally, improvements in modelling accuracy would help to capture voice characteristics for biometric tasks, like speaker recognition and authentication. There would be indirect benefits for speech synthesis that stem from the novel ability to extract the essence of articulatory dynamics from a speech database, encapsulated in the models. Furthermore, these models would provide parameterisation for a model-based synthesiser that could be readily integrated with a talking head, for instance, since information concerning tongue, lip and jaw movements would be present inherently. Speaking agents have obvious applications in gaming, education, foreign-language training and speech therapy. Finally, it is conceivable that the enhanced representational power of dynamic articulatory models learnt through generic statistical methods could offer very low bit rate transmission of speech for extremely efficient speech coding or, indeed, of other forms of gestural communication where there are accompanying opportunities for audio-visual data fusion (e.g., for ASR in noise) and multimodal integration. Notwithstanding the range and extent of the prospective technological beneficiaries, the primary ambition and driving force behind the project remains the enhancement of speech recognition performance through better modelling of the articulatory dynamics of speech production.

References

[BlaYou00] C. S. Blackburn and S. J. Young. A self-learning predictive model of articulator movements during speech production. J. Acoust. Soc. Am., 107(3):1659-1670, 2000.

[Bur96] D. Burshtein. Robust parametric modeling of durations in hidden Markov models. IEEE Trans. SAP, 4(3):240-242, 1996.

[ChoHal68] N. Chomsky and M. Halle. The Sound Pattern of English. Harper and Row, New York, NY, 1968.

[Cok76] C. H. Coker. A model of articulatory dynamics and control. Proc. IEEE, 64(4):452-460, 1976.

[DengMa00] L. Deng and J. Ma. Spontaneous speech recognition using a statistical coarticulatory model for the vocal-tract-resonance dynamics. J. Acoust. Soc. Am., 108(6):3036-3048, 2000.

[Dud40] H. Dudley. The carrier nature of speech. Bell Systems Tech. J., 19:495-513, 1940.

[FraKing01] J. Frankel and S. King. Mixture density networks, human articulatory data and acoustic-to-articulatory inversion of continuous speech. Proc. Inst. of Acoust., Stratford-upon-Avon, UK, 23(3):37-46, 2001.

[GreEtAl03] S. Greenberg, H. M. Carvey, L. Hitchcock, and S. Chang. Temporal properties of spontaneous speech -- a syllable-centric perspective. J. Phon., in review, 2003.

[Hou61] A. S. House. On vowel duration in English. J. Acoust. Soc. Am., 33(9):1174-1178, 1961.

[Huc02] M. A. Huckvale. Speech synthesis, speech simulation and speech science. In Proc. Int. Conf. on Spoken Lang. Proc., Denver, CO, pages 1261-1264, 2002.

[Jac03a] P. J. B. Jackson. Improvements in phone-classification accuracy from modelling duration. In Proc. Int. Cong. of Phon. Sci., Barcelona, pages 1349-1352, 2003.

[KabHon01] T. Kaburagi and M. Honda. Dynamic articulatory model based on multidimensional invariant-feature task representation. J. Acoust. Soc. Am., 110(1):441-452, 2001.

[Kla87] D. Klatt. Review of text-to-speech conversion for English. J. Acoust. Soc. Am., 82(3):737-793, 1987.

[KurEtAl99] T. Kuratate et al. Audio-visual synthesis of talking faces from speech production correlates. In Proc. Eurospeech '99, Budapest, volume 3, pages 1279-1282, 1999.

[Lin96] B. Lindblom. Role of articulation in speech perception. J. Acoust. Soc. Am., 99(3):1683-1692, 1996.

[LisAbr64] L. Lisker and A. S. Abramson. A cross-language study of voicing in initial stops: acoustical measurements. Acoustic Characteristics of Speech, reprinted from Word, 20(3):527-565, 1964.

[Mer73] P. Mermelstein. Articulatory model for the study of speech production. J. Acoust. Soc. Am., 53(4):1070-1082, 1973.

[Ost00] M. Ostendorf. Moving beyond the `beads-on-a-string' models of speech. Proc. IEEE ASRU, 2000.

[ParCok92] S. Parthasarathy and C. H. Coker. On automatic estimation of articulatory parameters in a text-to-speech systems. Comp. Speech & Lang., 6:37-75, 1992.

[Pic80] J. M. Pickett. The Sound of of Speech Communication. Univ. Pk. Press, Baltimore, MD, USA, 1980.

[RicBri99] H. B. Richards and J. S. Bridle. The HDM: a segmental Hidden Dynamic Model of coarticulation. In Proc. IEEE-ICASSP, Phoenix, AZ, pages 357-360, 1999.

[RicEtAl00] M. Richardson, J. Bilmes, and C. Dorio. Hidden-articulator Markov models for speech recognition. In Proc. ISCA ITRW ASR2000, Paris, pages 133-139, 2000.

[RobEtAl02] A.J. Robinson et al. Connectionist speech recognition of broadcast news. Speech Comm., 37:27-45, 2002.

[RosGal01] A.-V. Rossti and M.J.F. Gales. Generalised linear gaussian models. Tech. Rpt. 420, CUED, UK, 2001.

[RusMoo85] M. J. Russell and R. K. Moore. Explicit modelling of state occupancy in Hidden Markov Models for automatic speech recognition. In Proc. IEEE-ICASSP, volume 1, pages 5-8, 1985.

[ShaDam01] C. H. Shadle and R. I. Damper. Prospects for articulatory synthesis: A position paper. In Proc. 4th ITRW on Spch. Synth., Blair Atholl, Scotland, volume 116, 2001. [http://www.ssw4.org/].

[Wes96] J. Westbury et al. X-ray microbeam speech production database user's handbook. Waisman Center, Univ. of Wisconsin, Madison, WI, Beta rev. 2 edition, 1996. [http://www.medsch.wisc.edu/ubeam/].

[West00] P. West. Long-distance coarticulatory effects of British English /l/ and /r/: an EMA, EPG and acoustic study. In Proc. 5th Spch. Prod. Sem., Seeon, Germany, pages 105-108, 2000.