Introduction
Automatic speech recognition is a key technology in our modern information
society, as, along with other spoken language technologies, it enables
human-computer interaction in the most natural and universal way, while
leaving the eyes and hands free to carry out secondary or complementary
functions.
Commercial research into ASR tends to concentrate on long-term development of
in-house systems, which necessarily favours incremental modifications and
improvements to the existing state-of-the-art;
whereas the proposed research takes the form of basic innovation that
deals initially with a simpler recognition task.
It employs a fundamentally different approach to modelling, based on
incorporating knowledge of speech dynamics into the recognizer architecture.
For the sake of mathematical tractability, traditional methods, such as the
hidden Markov model (HMM), have necessarily started from a naïve
representation of the speech production process:
that it is piecewise stationary.
The proposed research seeks to take advantage of the stochastic formalism while
extending the theory to create more realistic models of articulator movement,
derived from the study of direct articulatory measurements.
A revolution in ASR occurred in the 1970s when statistical methods based on
HMMs overtook rule-based ones.
In the 80s and 90s, these systems were expanded, refined and furnished with
additional training data, allowing incremental improvements that advanced the
technology from isolated-word, speaker-dependent dictation to noise-robust,
large-vocabulary continuous speech recognition.
Now we need systems that can adapt to different users and speaking styles for
spontaneous speech,
yet it is under these conditions that HMMs revealed their limitations,
which is partly why interest in articulatory approaches was revived in the
90s.
The rigour of the HMM formalism has sustained the last 30 years of development,
though they continue to suffer from simplistic assumptions, such as a
time-invariant state probability density function (pdf), and fail to
capitalise on our knowledge of speech science in many ways.
Research has progressed with artificial neural networks [RobEtAl02],
but our physical understanding of the speech production process offers a
substantial opportunity for improving performance, provided the appropriate
parameters can be learnt.
Speech is part of our daily lives and impinges on many facets of scientific
endeavour, so understanding how it is produced affects researchers from many
different disciplines.
The literature on speech articulation has contributions from physiology and
acoustics,
as well as psychology, linguistics, engineering, mathematics and medical
imaging, defining the field as essentially multidisciplinary.
To describe some of these areas in relation to the statistical modelling of
speech gestures, they are divided into six themes:
- the structure of speech,
- coarticulation,
- duration,
- articulatory models,
- the work of Deng, and
- acoustic modelling techniques.
Background
The debate continues as to whether auditory perception,
articulatory features or syllables govern the structure of
speech [ChoHal68, Pic80, Lin96, RicEtAl00, Ost00, GreEtAl03].
The expression of language through the human vocal apparatus is unquestionably
an intercourse between words and articulation that is designed to allow
recognition of those words by a human listener.
Hence, the process of speech production may be viewed as the transformation
whereby the message is encoded as an utterance.
This project seeks to take advantage of this bottleneck in the
information transmission,
by developing compact models of the linguistically-meaningful speech gestures.
The variety of opinions is related to the practical difficulty of turning
phrases into sound waves:
the message is typically organized into words, sentences, and paragraphs (in
text) or
turns (in dialogue), but the utterance itself is structured by the fast
consonantal movements of intrinsic muscles, the slower vowel movements of the
extrinsic musculature (corresponding to syllables) and breath groups.
In certain circumstances, each constraint will play its part, meanwhile the
final
realization as a speech utterance will be coarticulated, as a compromise of
all these competing factors, plus those affecting its reception by the listener.
Models of speech production that capture these factors will enable us to
analyse their effects quantitatively.
Once it is shown that trained models have learnt such behaviours
automatically from data, then significant improvements in ASR accuracy
most likely to follow.
Traditional recognition systems address the problem of coarticulation by
learning separate models of the phonemes for each phonetic context
(i.e., context-sensitive triphones),
which requires large amounts of training data.
Meanwhile, knowledge we have is discarded, for instance,
that "the vowel and consonant gestures are largely independent" [Ohm67],
that a vowel's left context is more influential than its right context, and
vice versa for consonants [Pic80],
that pauses, phonetic context and stress affect segment
duration [Kla87, GreEtAl03],
and that effects can span several phonemes [West00].
Nevertheless, some attempts have been made to imitate the smoothness of
articulatory movements [RicBri99], and correlations of articulatory
targets with those of adjacent phonemes [BlaYou00].
In terms of segment durations, research has tended to polarise into linguistic
studies of mean phone duration on small data sets [LisAbr64] and
phonetic studies of temporal cues [Hou61, LisAbr64],
versus the design of suitable parametric distribution models for use in
ASR [RusMoo85, Bur96].
The latter tend to concentrate on how best to model the statistical properties
of context-independent phonemes (i.e., monophones),
rather than considering how best to incorporate the kinds of dependencies that
are typically observed.
Nevertheless, recent pilot studies indicate that improvements in performance
can be forthcoming [Jac03a].
There are good reasons for developing articulatory models
other than ASR [ShaDam01, Huc02],
evident from the varied attempts over the years:
for production [Mer73, Cok76, KabHon01], and
for speech synthesis [Dud40, ParCok92, KurEtAl99];
other studies have profitted from advances in measurement techniques
[Wes96, KabHon01].
The work of Deng is notable, however, for his radical attempt to meld many
of these conflicting ideas into one holistic approach to ASR from speech
production, using Bayesian networks [DengMa00].
While his results are promising, they have not been independently verified.
Still, it is not clear from his analysis whether the complicated multi-tiered
recognition system he proposes has learnt attributes of actual speech dynamics
or merely captured statistical characteristics of the speech signal.
This is a question that equally hangs over other recent research that uses a
hidden dynamical model within the recognizer
[FraKing01, RosGal01].
Moreover, these attempts have not yet delivered the substantial improvement
that is anticipated, which calls for further investigation to analyse the
emergent behaviour of the dynamical models and to determine how it
corresponds to actual articulation.
Relevance
From the perspective of speech science, this project will develop segmental
models so as to capture certain known characteristics of fluent
speech, but estimated in a quantitative and statistical way (i.e., by
maximum likelihood).
Such characteristics include accurately modelling the way that phones
vary in duration, and the way that the distributions of phone durations vary
with their context,
finding the articulatory parameters and the form of their trajectories that
best represent meaningful speech acts, and describing the behaviour of both
redundant and critical articulators.
Hence, we plan to make significant advances in charting out and
quantifying those aspects of the timing and posture of articulators during
speech production that are important for recognition.
We expect there to be strong synergy between the advances in knowledge and
understanding of speech dynamics and the ability of consistent models to
recognize correctly.
For example, phone recognition accuracy improvements from better state
alignment of linear mappings could reveal a new coarticulatory phenomenon,
meanwhile understanding the behaviour of redundant articulators could lead to a
more comprehensive modelling strategy with commensurate benefits in performance.
Thus, and in addition to the fundamental scientific and technological
progress, there would be potential benefits in a whole gamut of application
areas.
The use of articulatory constraints based on human speech production is most
likely to deliver improvements for spontaneous speech, where the style is
casual and continuous,
and in noisy environments, which are two of the most demanding speech
recognition tasks currently yet crucial to its wider deployment in society.
The modelling paradigm could also aid adaptation over
a number of influential human factors, e.g., vocal-tract length, accent, and
speaking rate and style.
Equally, improvements in modelling accuracy would help to capture voice
characteristics for biometric tasks, like speaker recognition and
authentication.
There would be indirect benefits for speech synthesis that stem from the novel
ability to extract the essence of articulatory dynamics from a speech
database, encapsulated in the models.
Furthermore, these models would provide parameterisation for a model-based
synthesiser that could be readily integrated with a talking head, for
instance, since information concerning tongue, lip and jaw movements would be
present inherently.
Speaking agents have obvious applications in gaming, education,
foreign-language training and speech therapy.
Finally, it is conceivable that the enhanced representational power of dynamic
articulatory models learnt through generic statistical methods could offer
very low bit rate transmission of speech for extremely efficient speech coding
or, indeed, of other forms of gestural communication where there are
accompanying opportunities for audio-visual data fusion (e.g., for ASR in
noise) and multimodal integration.
Notwithstanding the range and extent of the prospective technological
beneficiaries, the primary ambition and driving force behind the project
remains the enhancement of speech recognition performance through better
modelling of the articulatory dynamics of speech production.
References
[BlaYou00] | C. S. Blackburn and S. J. Young.
A self-learning predictive model of articulator movements during
speech production.
J. Acoust. Soc. Am., 107(3):1659-1670, 2000.
|
[Bur96] | D. Burshtein.
Robust parametric modeling of durations in hidden Markov models.
IEEE Trans. SAP, 4(3):240-242, 1996.
|
[ChoHal68] | N. Chomsky and M. Halle.
The Sound Pattern of English.
Harper and Row, New York, NY, 1968.
|
[Cok76] | C. H. Coker.
A model of articulatory dynamics and control.
Proc. IEEE, 64(4):452-460, 1976.
|
[DengMa00] | L. Deng and J. Ma.
Spontaneous speech recognition using a statistical coarticulatory
model for the vocal-tract-resonance dynamics.
J. Acoust. Soc. Am., 108(6):3036-3048, 2000.
|
[Dud40] | H. Dudley.
The carrier nature of speech.
Bell Systems Tech. J., 19:495-513, 1940.
|
[FraKing01] | J. Frankel and S. King.
Mixture density networks, human articulatory data and
acoustic-to-articulatory inversion of continuous speech.
Proc. Inst. of Acoust., Stratford-upon-Avon, UK,
23(3):37-46, 2001.
|
[GreEtAl03] | S. Greenberg, H. M. Carvey, L. Hitchcock, and S. Chang.
Temporal properties of spontaneous speech -- a syllable-centric
perspective.
J. Phon., in review, 2003.
|
[Hou61] | A. S. House.
On vowel duration in English.
J. Acoust. Soc. Am., 33(9):1174-1178, 1961.
|
[Huc02] | M. A. Huckvale.
Speech synthesis, speech simulation and speech science.
In Proc. Int. Conf. on Spoken Lang. Proc., Denver, CO,
pages 1261-1264, 2002.
|
[Jac03a] | P. J. B. Jackson.
Improvements in phone-classification accuracy from modelling
duration.
In Proc. Int. Cong. of Phon. Sci., Barcelona, pages
1349-1352, 2003.
|
[KabHon01] | T. Kaburagi and M. Honda.
Dynamic articulatory model based on multidimensional
invariant-feature task representation.
J. Acoust. Soc. Am., 110(1):441-452, 2001.
|
[Kla87] | D. Klatt.
Review of text-to-speech conversion for English.
J. Acoust. Soc. Am., 82(3):737-793, 1987.
|
[KurEtAl99] | T. Kuratate et al.
Audio-visual synthesis of talking faces from speech production
correlates.
In Proc. Eurospeech '99, Budapest, volume 3, pages
1279-1282, 1999.
|
[Lin96] | B. Lindblom.
Role of articulation in speech perception.
J. Acoust. Soc. Am., 99(3):1683-1692, 1996.
|
[LisAbr64] | L. Lisker and A. S. Abramson.
A cross-language study of voicing in initial stops: acoustical
measurements.
Acoustic Characteristics of Speech, reprinted from Word,
20(3):527-565, 1964.
|
[Mer73] | P. Mermelstein.
Articulatory model for the study of speech production.
J. Acoust. Soc. Am., 53(4):1070-1082, 1973.
|
[Ost00] | M. Ostendorf.
Moving beyond the `beads-on-a-string' models of speech.
Proc. IEEE ASRU, 2000.
|
[ParCok92] | S. Parthasarathy and C. H. Coker.
On automatic estimation of articulatory parameters in a
text-to-speech systems.
Comp. Speech & Lang., 6:37-75, 1992.
|
[Pic80] | J. M. Pickett.
The Sound of of Speech Communication.
Univ. Pk. Press, Baltimore, MD, USA, 1980.
|
[RicBri99] | H. B. Richards and J. S. Bridle.
The HDM: a segmental Hidden Dynamic Model of coarticulation.
In Proc. IEEE-ICASSP, Phoenix, AZ, pages 357-360, 1999.
|
[RicEtAl00] | M. Richardson, J. Bilmes, and C. Dorio.
Hidden-articulator Markov models for speech recognition.
In Proc. ISCA ITRW ASR2000, Paris, pages 133-139, 2000.
|
[RobEtAl02] | A.J. Robinson et al.
Connectionist speech recognition of broadcast news.
Speech Comm., 37:27-45, 2002.
|
[RosGal01] | A.-V. Rossti and M.J.F. Gales.
Generalised linear gaussian models.
Tech. Rpt. 420, CUED, UK, 2001.
|
[RusMoo85] | M. J. Russell and R. K. Moore.
Explicit modelling of state occupancy in Hidden Markov Models for
automatic speech recognition.
In Proc. IEEE-ICASSP, volume 1, pages 5-8, 1985.
|
[ShaDam01] | C. H. Shadle and R. I. Damper.
Prospects for articulatory synthesis: A position paper.
In Proc. 4th ITRW on Spch. Synth., Blair Atholl,
Scotland, volume 116, 2001.
[http://www.ssw4.org/].
|
[Wes96] | J. Westbury et al.
X-ray microbeam speech production database user's handbook.
Waisman Center, Univ. of Wisconsin, Madison, WI, Beta rev. 2
edition, 1996.
[http://www.medsch.wisc.edu/ubeam/].
|
[West00] | P. West.
Long-distance coarticulatory effects of British English /l/ and /r/:
an EMA, EPG and acoustic study.
In Proc. 5th Spch. Prod. Sem., Seeon, Germany, pages
105-108, 2000.
|
|