|
Abstracts of my publications
|
|
|
Listing
Journal papers
IET-SPR, 2007
JASA, 2006
CSL, 2005
El. Lett., 2002
T-SAP, 2001
JASA, 2000
Conferences
Book chapter
Doctoral thesis
FTP
site
|
Academic Journal Papers
Russell, M.J., X. Zheng and Jackson, P.J.B. (2007).
Modelling speech signals using formant frequencies as an intermediate representation.
IET Signal Processing,
Vol. 1 (1), pp. 43-50.
Abstract:
Multiple-level segmental hidden Markov models (M-SHMMs) in which the relationship between symbolic and acoustic representations of speech is regulated by a formant-based intermediate representation are considered. New TIMIT phone recognition results are presented, confirming that the theoretical upper-bound on performance is achieved provided that either the intermediate representation or the formant-to-acoustic mapping is sufficiently rich. The way in which M-SHMMs exploit formant-based information is also investigated, using singular value decomposition of the formant-to-acoustic mappings and linear discriminant analysis. The analysis shows that if the intermediate layer contains information which is linearly related to the spectral representation, that information is used in preference to explicit formant frequencies, even though the latter are useful for phone discrimination. In summary, although these results confirm the utility of M-SHMMs for automatic speech recognition, they provide empirical evidence of the value of nonlinear formant-to-acoustic mappings.
INSPEC codes: A4370; B6130E; C5260S; A4360; A0210; A0250; B0210; B0240J; C1110; C1140J
[ abstract | pdf | preprint ]
Pincas, J. and Jackson, P.J.B. (2006).
Amplitude modulation of turbulence noise by voicing in fricatives.
Journal of the Acoustical Society of America,
Vol. 120 (6), pp. 3966-3977.
Abstract:
The two principal sources of sound in speech, voicing and frication, occur simultaneously in voiced fricatives as well as at the vowel-fricative boundary in phonologically voiceless fricatives.
Instead of simply overlapping, the two sources interact.
This paper is an acoustic study of one such interaction effect: the amplitude modulation of the frication component when voicing is present.
Corpora of sustained and fluent-speech English fricatives were recorded and analyzed using a signal-processing technique designed to extract estimates of modulation depth.
Results reveal a pattern, consistent across speaking style, speakers and places of articulation, for modulation at f0 to rise at low voicing strengths and subsequently saturate.
Voicing strength needed to produce saturation varied 60-66 dB across subjects and experimental conditions.
Modulation depths at saturation varied little across speakers but significantly for place of articulation (with [z] showing particularly strong modulation) clustering at approximately 0.4-0.5 (a 40-50% fluctuation above and below unmodulated amplitude); spectral analysis of modulating signals revealed weak but detectable modulation at the second and third harmonics (i.e., 2f0 and 3f0).
PACS numbers: 43.70.Bk, 43.72.Ar
|
Top
|
[ abstract | pdf ]
|
|
Russell, M.J. and Jackson, P.J.B. (2005).
A multiple-level linear/linear segmental HMM with a formant-based
intermediate layer.
Computer Speech and Language,
Vol. 19 (2), pp. 205-225.
Abstract:
A novel multi-level segmental HMM (MSHMM)
is presented in which the relationship between symbolic (phonetic) and
surface (acoustic) representations of speech is regulated by an
intermediate `articulatory' representation.
Speech dynamics are characterised as linear trajectories in the
articulatory space, which are transformed into the acoustic space using
an articulatory-to-acoustic mapping.
Recognition is then performed.
The results of phonetic classification experiments are presented for
monophone and triphone MSHMMs using three formant-based `articulatory'
parameterisations and sets of between 1 and 49 linear
articulatory-to-acoustic mappings.
The NIST Matched Pair Sentence Segment (Word Error) test shows that, for
a sufficiently rich combination of articulatory parameterisation and
mappings, differences between these results and those obtained with an
optimal classifier are not statistically significant. It is also shown
that, compared with a conventional HMM, superior performance can be
achieved using a MSHMM with 25% fewer parameters.
|
Top
|
[ abstract | pdf | preprint ]
|
|
Jackson, P.J.B., Lo, B.-H. and Russell, M.J. (2002).
Data-driven, non-linear, formant-to-acoustic mapping for ASR.
IEE Electronics Letters, Vol. 38 (13),
pp. 667-669.
Abstract:
The underlying dynamics of speech can be captured in an
automatic speech recognition system via an articulatory
representation, which resides in a domain other than that of
the acoustic observations.
Thus, given a set of models in this hidden domain, it is
essential that a mapping can be obtained to relate the
intermediate representation to the acoustic domain.
In this paper, two methods for mapping from formants to
short-term spectra are compared: multi-layered perceptrons
(MLPs) and radial-basis function (RBF) networks.
Both are capable of providing non-linear transformations, and
were trained using features extracted from the TIMIT database.
Various schemes for dividing the frames of speech data
according to their phone class were also investigated.
Results showed that the RBF networks performed approximately
10 % better than the MLPs, in terms of the rms error, and
that a classification based on discrete regions of the
articulatory space gave the greatest improvements over a
single network.
|
Top
|
[ abstract | preprint ]
|
|
Jackson, P.J.B. and Shadle, C.H. (2001).
Pitch-scaled estimation of simultaneous voiced and turbulence-noise
components in speech.
IEEE Transactions on Speech and Audio Processing,
Vol. 9 (7),
pp. 713-726.
Abstract:
Almost all speech contains simultaneous contributions from more than
one acoustic source within the speaker's vocal tract.
In this paper we propose a method -
the pitch-scaled harmonic filter (PSHF) -
which aims to separate the voiced and turbulence-noise components of the
speech signal during phonation, based on a maximum likelihood approach.
The PSHF outputs periodic and aperiodic components that are estimates of the
respective contributions of the different types of acoustic source.
It produces four reconstructed time series signals by
decomposing the original speech signal, first, according to amplitude,
and then according to power of the Fourier coefficients.
Thus, one pair of periodic and aperiodic signals is optimized for subsequent
time-series analysis, and another pair for spectral analysis.
The performance of the PSHF algorithm was tested on synthetic signals,
using three forms of disturbance (jitter, shimmer and additive
noise), and the results were used to predict the performance on real
speech.
Processing recorded speech examples elicited latent features from the
signals, demonstrating the PSHF's potential for analysis of mixed-source
speech.
EDICS number: 1-ANLS
Keywords:
Periodic-aperiodic decomposition,
speech modification,
speech pre-processing.
|
Top
|
[ abstract | pdf ]
|
|
Jackson, P.J.B. and Shadle, C.H. (2000).
Frication noise modulated by voicing, as
revealed by pitch-scaled decomposition.
Journal of the Acoustical Society of America, Vol. 108 (4),
pp. 1421-1434.
Abstract:
A decomposition algorithm that uses a pitch-scaled harmonic filter
was evaluated using synthetic signals and applied to mixed-source speech,
spoken by three subjects, to separate the voiced and unvoiced parts.
Pulsing of the noise component was observed in voiced frication,
which was analyzed by complex demodulation of the signal envelope.
The timing of the pulsation, represented by the phase of the anharmonic
modulation coefficient, showed a step change during a vowel-fricative
transition corresponding to the change in location of the sound source
within the vocal tract.
Analysis of fricatives
/ ,
v,
,
z,
,
,
/
demonstrated a relationship between steady-state phase and place, and
f0 glides confirmed that the main cause was a
place-dependent delay.
PACS numbers: 43.70.Bk, 43.72.Ar
|
Top
|
[ abstract | pdf ]
|
|
|