|
Abstracts of my publications
|
|
|
Listing
Journal papers
Conferences
Interspeech 2007
DSP 2007
Interspeech 2006
Interspeech 2005
AES 2005
3DPVT 2004
FSTS 2004
Eurospeech 2003
ICPhS 2003
EC-VIP-MC 2003
ICSLP 2002
CRAC 2001
WISP 2001
ICASSP 2000
SPS5 2000
ICPhS 1999
ICA-ASA 1998
ASME 1996
Book chapter
Doctoral thesis
FTP
site
|
Refereed Conference Proceedings
VD Singampalli, PJB Jackson
(2007).
"Statistical identification of critical, dependent and redundant articulators".
In Proc. Interspeech 2007,
4 pp.,
Antwerp, Belgium.
[ abstract | pdf | slides ]
A Turkmani, A Hilton, PJB Jackson, J Edge
(2007).
"Visual analysis of lip coarticulation in VCV utterances".
In Proc. Interspeech 2007,
4 pp.,
Antwerp, Belgium.
[ abstract | pdf ]
PJB Jackson
(2007).
"Time-frequency-modulation representation of stochastic signals".
In Proc. IEEE DSP 2007,
4 pp.,
Cardiff, UK.
[ abstract | pdf | slides ]
Every, M. and Jackson, P.J.B. (2006).
Enhancement of harmonic content of speech based on a dynamic
programming pitch tracking algorithm.
In Proceedings of Interspeech 2006,
4pp., Pittsburgh PA.
Abstract:
For pitch tracking of a single speaker, a common requirement
is to find the optimal path through a set of voiced or voiceless
pitch estimates over a sequence of time frames.
Dynamic programming (DP) algorithms have been applied before to this problem.
Here, the pitch candidates are provided by a multi-channel
autocorrelation-based estimator, and DP is extended to pitch tracking
of multiple concurrent speakers.
We use the resulting pitch information to enhance harmonic content in noisy speech and to obtain separations of target from interfering speech.
Index Terms: speech enhancement, dynamic programming
|
Top
|
[ abstract | pdf ]
|
|
Pincas, J. and Jackson, P.J.B. (2005b).
Amplitude modulation of frication noise by voicing saturates.
In Proceedings of Interspeech 2005,
4pp., Lisbon.
Abstract:
The two distinct sound sources comprising voiced frication, voicing and
frication, interact.
One effect is that the periodic source at the glottis modulates
the amplitude of the frication source originating in the vocal tract above the
constriction.
Voicing strength and modulation depth for frication noise were measured for sustained English voiced fricatives using high-pass filtering, spectral analysis in the
modulation (envelope) domain, and a variable pitch compensation procedure.
Results show a positive relationship between strength of the glottal source
and modulation depth at
voicing strengths below 66 dB SPL, at which point the modulation
index was approximately 0.5 and saturation occurred.
The alveolar [z] was found to be more modulated than other fricatives.
|
Top
|
[ abstract | pdf | poster ]
|
|
Dewhirst, M., Zielinski, S., Jackson, P.J.B. and Rumsey F. (2005).
Objective assessment of spatial localisation attributes of surround-sound reproduction systems.
In Proceedings of 118th Convention of the Audio Engineering Society,
AES 2005,
16pp., Barcelona, Spain.
Abstract:
A mathematical model for objective assessment of perceived spatial quality was
developed for comparison across the listening area of various sound reproduction
systems: mono, two-channel stereo (TCS), 3/2 stereo (i.e., 5.0 surround sound),
Wave Field Synthesis (WFS) and Higher Order Ambisonics (HOA).
Models for mono, TCS and 3/2 stereo are based on conventional microphone
techniques and loudspeaker configurations for each system.
WFS and HOA models use circular arrays of thirty-two loudspeakers driven by
signals derived from a virtual microphone array and the Fourier-Bessel spatial
decomposition of the soundfield respectively.
Directional localisation, ensemble width and ensemble envelopment of
monochromatic tones,
extracted from binaural signals, are analysed under a range of test conditions.
|
Top
|
[ abstract | pdf ]
|
|
Ypsilos, I.A., Hilton, A., Turkmani, A. and Jackson, P.J.B. (2004).
Speech-driven face synthesis from 3D video.
In IEEE Proceedings of
the 2nd International Symposium on 3D Data Processing, Visualization and
Transmission (3DPVT'04),
pp. 58-65, Thessaloniki, Greece.
Abstract:
This paper presents a framework for speech-driven synthesis of real faces
from a corpus of 3D video of a person speaking.
Video-rate capture of dynamic 3D face shape and colour appearance
provides the basis for a visual speech synthesis model.
A displacement map representation combines face shape and colour into
a 3D video.
This representation is used to efficiently register and integrate shape and
colour information captured from multiple views.
To allow visual speech synthesis viseme primitives are identified from the
corpus using automatic speech recognition.
A novel non-rigid alignment algorithm is introduced to estimate dense
correspondence between 3D face shape and appearance for different
visemes.
The registered displacement map representation together with a novel
optical flow optimisation using both shape and colour, enables accurate
and efficient non-rigid alignment.
Face synthesis from speech is performed by concatenation of the
corresponding viseme sequence using the non-rigid correspondence
to reproduce both 3D face shape and colour appearance.
Concatenative synthesis reproduces both viseme timing and
co-articulation.
Face capture and synthesis has been performed for a database of 51
people.
Results demonstrate synthesis of 3D visual speech animation with a
quality comparable to the captured video of a person.
|
Top
|
[ abstract | pdf ]
|
|
Pincas, J. and Jackson, P.J.B. (2004).
Acoustic correlates of voicing-frication interaction in fricatives.
In Proceedings of From Sound to Sense,
J Slifka, S Manuel and M Matthies (eds.),
pp. C73-C78, Cambridge MA.
Abstract:
This paper investigates the acoustic effects of source interaction in fricative
speech sounds.
A range of parameters has been employed, including a measure designed
specifically to describe quantitatively the amplitude modulation of frication
noise by voicing, a phenomenon which has mainly been qualitatively
reported.
The signal processing technique to extract this measure is presented.
Results suggest that fricative duration is the main determinant of how much
the sources overlap at the VF boundary of voiceless fricatives and that the
amount of modulation occurring in voiced fricatives is chiefly dependent on
voicing strength.
Furthermore, it appears that individual speakers have differing tendencies
for amount of source-source overlap and degree of modulation where
overlap does occur.
|
Top
|
[ abstract | pdf | poster ]
|
|
Jackson, P.J.B., Moreno, D.M., Russell, M.J. and Hernando, J. (2003).
Covariation and weighting of harmonically decomposed streams for ASR.
In Proceedings of Eurospeech 2003,
pp. 2321-2324, Geneva.
Abstract:
Decomposition of speech signals into simultaneous streams of periodic and
aperiodic information has been successfully applied to speech analysis,
enhancement, modification and recently recognition.
This paper examines the effect of different weightings of the two streams
in a conventional HMM system in digit recognition tests on the Aurora 2.0
database.
Comparison of the results from using matched weights during training showed a
small improvement of approximately 10% relative to unmatched ones,
under clean test conditions.
Principal component analysis of the covariation amongst the periodic and
aperiodic features indicated that only 45 (51) of the 78 coefficients were
required to account for 99% of the variance, for clean (multi-condition)
training, which yielded an 18.4% (10.3%) absolute increase in accuracy with
respect to the baseline.
These findings provide further evidence of the potential for
harmonically-decomposed streams to improve performance and
substantially to enhance recognition accuracy in noise.
Session:
OWeDc, Speech Modeling & Features 2 (oral).
|
Top
|
[ abstract | pdf | slides ]
|
|
Russell, M.J. and Jackson, P.J.B. (2003).
The effect of an intermediate articulatory layer on the performance of
a segmental HMM.
In Proceedings of Eurospeech 2003,
pp. 2737-2740, Geneva.
Abstract:
We present a novel multi-level HMM in which an intermediate `articulatory'
representation is included between the state and surface-acoustic levels.
A potential difficulty with such a model is that advantages gained by the
introduction of an articulatory layer might be compromised by limitations
due to an insufficiently rich articulatory representation, or by
compromises made for mathematical or computational expediency. This paper
decribes a simple model in which speech dynamics are modelled as linear
trajectories in a formant-based `articulatory' layer, and the
articulatory-to-acoustic mappings are linear. Phone classification
results for TIMIT are presented for monophone and triphone systems with a
phone-level syntax. The results demonstrate that provided the
intermediate representation is sufficiently rich, or a sufficiently large
number of phone-class-dependent articulatory-to-acoustic mapping are
employed, classification performance is not compromised.
Session:
PThBf, Robust Speech Recognition 3 (poster).
|
Top
|
[ abstract | pdf ]
|
|
Jackson, P.J.B. (2003).
Improvements in phone-classification accuracy from modelling duration.
In Proceedings of the 15th International Congress of Phonetic
Sciences, ICPhS 2003,
pp. 1349-1352, Barcelona.
Abstract:
Durations of real speech segments do not generally exhibit exponential
distributions, as modelled implicitly by the state transitions of Markov
processes. Several duration models were considered for integration within a
segmental-HMM recognizer: uniform, exponential, Poisson, normal, gamma and
discrete. The gamma distribution fitted that measured for silence best, by an
order of magnitude. Evaluations determined an appropriate weighting for duration
against the acoustic models. Tests showed a reduction of 2% absolute (6+%
relative) in the phone-classification error rate with gamma and discrete models;
exponential ones gave approximately 1% absolute reduction, and uniform no
significant improvement. These gains in performance recommend the wider
application of explicit duration models.
[http://www.ee.surrey.ac.uk/Personal/P.Jackson/Balthasar/]
Session:
T.3.P2, Automatic speech recognition / Auditory mechanisms (poster).
|
Top
|
[ abstract | pdf | poster ]
|
|
Moreno, D.M., Jackson, P.J.B., Hernando, J. and Russell, M.J. (2003).
Improved ASR in noise using harmonic decomposition.
In Proceedings of the 15th International Congress of Phonetic
Sciences, ICPhS 2003,
pp. 751-754, Barcelona.
Abstract:
Application of the pitch-scaled harmonic filter (PSHF) to automatic speech
recognition in noise was investigated using the Aurora 2.0 database.
The PSHF decomposed the original speech into periodic and aperiodic streams.
Digit-recognition tests with the extended features compared the noise robustness
of various parameterisations against standard 39 MFCCs. Separately, each stream
reduced word accuracy by less than 1% absolute; together, the combined streams
gave substantial increases under noisy conditions. Applying PCA to concatenated
features proved better than to separate streams, and to static coefficients better
than after calculation of deltas. With multi-condition training, accuracy
improved by 7.8% at 5dB SNR, thus providing resilience from corruption by noise.
[http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/]
Session:
M.4.5, Automatic speech recognition I (oral).
|
Top
|
[ abstract | pdf | ppt ]
|
|
Russell, M.J., Jackson, P.J.B. and Wong, M.L.P. (2003).
Development of articulatory-based multi-level segmental HMMs for phonetic
classification in ASR.
In Proceedings of EURASIP Conference on Video/Image Processing and
Multimedia Communications,
EC-VIP-MC~2003, Vol. 2, pp. 655-660, Zagreb, Croatia.
Abstract:
A simple multiple-level HMM is presented in which speech dynamics are modelled
as linear trajectories in an intermediate, formant-based representation and
the mapping between the intermediate and acoustic data is achieved using one
or more linear transformations. An upper-bound on the performance of such a
system is established. Experimental results on the TIMIT corpus demonstrate
that, if the dimension of the intermediate space is sufficiently high or the
number of articulatory-to-acoustic mappings is sufficiently large, then this
upper-bound can be achieved.
Keywords:
Automatic speech recognition, Hidden Markov Models, segment models.
|
Top
|
[ abstract | pdf ]
|
|
Jackson, P.J.B. and Russell, M.J. (2002).
Models of speech dynamics in a segmental-HMM recognizer
using intermediate linear representations.
In Proceedings of the International Conference on Spoken Language
Processing, ICSLP 2002,
pp. 1253-1256, Denver CO.
Abstract:
A theoretical and experimental analysis of a simple multi-level segmental HMM
is presented in which the relationship between symbolic (phonetic) and surface
(acoustic) representations of speech is regulated by an intermediate
(articulatory) layer, where speech dynamics are modeled using linear
trajectories.
Three formant-based parameterizations and measured articulatory
positions are considered as intermediate representations, from the TIMIT and
MOCHA corpora respectively.
The articulatory-to-acoustic mapping was performed by between 1 and 49 linear
transformations.
Results of phone-classification experiments demonstrate that, by appropriate
choice of intermediate parameterization and mappings, it is possible to
achieve close to optimal performance.
Session:
Acoustic modelling
|
Top
|
[ abstract | pdf | ppt ]
|
|
Jackson, P.J.B. (2001).
Acoustic cues of voiced and voiceless plosives
for determining place of articulation.
In Proceedings of Workshop on
Consistent and Reliable Acoustic Cues
for sound analysis, CRAC 2001, pp. 19-22, Aalborg, Denmark.
Abstract:
Speech signals from stop consonants with trailing vowels were
analysed for cues consistent with their place of articulation.
They were decomposed into periodic and aperiodic components
by the pitch-scaled harmonic filter to improve the quality of
the formant tracks, to which exponential trajectories were fitted
to get robust formant loci at voice onset.
Ensemble-average power spectra of the bursts exhibited dependence on place
(and on vowel context for velar consonants), but not on voicing.
By extrapolating the trajectories back to the release time, formant
estimates were compared with spectral peaks, and connexions
were made between these disparate acoustic cues.
Keywords:
acoustic cues, plosive, stop consonants.
|
Top
|
[ abstract | pdf ]
|
|
Jackson, P.J.B. and Shadle, C.H. (2001).
Uses of the pitch-scaled harmonic
filter in speech processing.
In Proceedings of the Institute of Acoustics, Workshop on Innovation in
Speech Processing 2001, Vol. 23 (3),
pp. 309-321, Stratford-upon-Avon, UK.
Abstract:
The pitch-scaled harmonic filter (PSHF) is a technique for decomposing speech
signals into their periodic and aperiodic constituents, during periods of
phonation.
In this paper, the use of the PSHF for speech analysis and processing tasks is
described.
The periodic component can be used as an estimate of the part attributable
to voicing,
and the aperiodic component can act as an estimate of that
attributable to turbulence noise, i.e., from fricative, aspiration and plosive
sources.
Here we present the algorithm for separating the periodic and aperiodic
components from the pitch-scaled Fourier transform of a short section of
speech, and show how to derive signals suitable for time-series analysis and
for spectral analysis.
These components can then be processed in a manner appropriate to their
source type, for instance, extracting zeros as well as poles from the
aperiodic spectral envelope.
A summary of tests on synthetic speech-like signals demonstrates the
robustness of the PSHF's performance to perturbations from additive noise,
jitter and shimmer.
Examples are given of speech analysed in various ways:
power spectrum, short-time power and short-time harmonics-to-noise ratio,
linear prediction and mel-frequency cepstral coefficients.
Besides being valuable for speech production and perception studies, the
latter two analyses show potential for incorporation into speech coding and
speech recognition systems.
Further uses of the PSHF are revealing normally-obscured acoustic
features, exploring interactions of turbulence-noise sources with
voicing, and pre-processing speech to enhance subsequent operations.
Keywords:
periodic/aperiodic decomposition, acoustic features.
|
Top
|
[ abstract | pdf | ppt ]
|
|
Jackson, P.J.B. and Shadle, C.H. (2000).
Performance of the pitch-scaled harmonic
filter and applications in speech analysis.
In Proceedings of IEEE International Conference on Acoustics, Speech
and Signal Processing, Vol. 3, pp. 1311-1314,
Istanbul.
Abstract:
The pitch-scaled harmonic filter (PSHF) is a technique for decomposing speech
signals into their voiced and unvoiced constituents.
In this paper, we evaluate its ability to reconstruct the
time series of the two components accurately using a variety of synthetic,
speech-like signals, and discuss its performance.
These results determine the degree of confidence that can be expected
for real speech signals: typically, 5 dB improvement in the
signal-to-noise ratio of the harmonic component and approximately
5 dB more than the initial harmonics-to-noise ratio (HNR) in the anharmonic
component.
A selection of the analysis opportunities that the decomposition offers
is demonstrated on speech recordings, including dynamic HNR estimation
and separate linear prediction analyses of the two components.
These new capabilities provided by the PSHF can facilitate
discovering previously hidden features and investigating interactions of
unvoiced sources, such as frication, with voicing.
Session:
3.2 Speech analysis
Keywords:
harmonics-to-noise ratio, voiced/unvoiced
decomposition, frication, aspiration noise.
|
Top
|
[ abstract | pdf ]
|
|
Jackson, P.J.B. and Shadle, C.H. (2000).
Aero-acoustic modelling of voiced and unvoiced
fricatives based on MRI data.
In Proceedings of the 5th Seminar on Speech Production,
pp. 185-188, Seeon, Germany.
Abstract:
We would like to develop a more realistic production model of
unvoiced speech sounds, namely fricatives, plosives and aspiration noise.
All three involve turbulence noise generation, with place-dependent
source characteristics that vary with time (rapidly, in plosives).
In this study, we aimed to produce, using an aero-acoustic model of the
vocal-tract filter and source, voiced as well as unvoiced fricatives
that provide a good match to analyses of speech recordings.
The vocal-tract transfer function (VTTF) was computed by the vocal-tract
acoustics program, VOAC [Davies, McGowan and Shadle. Vocal Fold
Physiology: Frontiers in Basic Science, ed. Titze, Singular Pub., CA, 93-142,
1993], using geometrical data, in the form of
cross-sectional area and hydraulic radius functions, along the length of the
tract.
VOAC incorporates the effects of net flow into the transmission of plane
waves through a tubular representation of the tract, and relaxes assumptions
of rrigid walls and isentropic propagation.
The geometry functions were derived from multiple-slice, dynamic, magnetic
resonance images (MRI) [Mohammad. PhD thesis, Dept. ECS, U. Southampton, UK,
1999; Shadle, Mohammad, Carter, and Jackson. Proc. ICPhS, S.F. CA, 1:623-626,
1999], using a method of converting from the pixel
outlines that was improved over earlier efforts on vowels.
A coloured noise source signal was combined with the VTTF and radiation
characteristic to synthesize the unvoiced fricative [s].
For its voiced counterpart [z], many researchers have noted that the noise
source appears to be modulated by voicing.
Furthermore, the phase of the modulation has been shown to be perceptually
significant.
Based on our analysis [Jackson and Shadle. Proc. IEEE-ICASSP, Istanbul, 2000.]
of recordings by the same subject, the frication source of [z] was varied
periodically according to fluctuations in the flow velocity at the constriction
exit, and the modulation phase was governed by the convection time for the flow
perturbation to travel from the constriction to the obstacle.
The synthesized fricatives were compared to the speech recordings in a simple
listening test, and comparisons of the predicted and measured time series
suggested that the model, which brings together physical, aerodynamic and
acoustic information, can replicate characteristics of real speech, such as
the modulation in voiced fricatives
(please note the change of URL, Nov '02:
http://www.ee.surrey.ac.uk/Personal/P.Jackson/Nephthys/).
|
Top
|
[ abstract | pdf ]
|
|
Shadle, C.H., Mohammad, M., Carter, J.N. and Jackson, P.J.B. (1999).
Dynamic Magnetic Resonance Imaging: new tools
for speech research.
In Proceedings of the 14th International Congress of Phonetic
Sciences, Vol. 1, pp. 623-626, San Francisco, CA.
Abstract:
A multiplanar Dynamic Magnetic Resonance Imaging (MRI) technique that extends
our earlier work on single-plane Dynamic MRI is described.
Scanned images acquired while an utterasne is repeated are recombined to form
pseudo-time-varying images of the vocal tract using a simultaneously recorded
audio signal.
There is no technical limit on the utterance length or number of slices that
can be so imaged, though the number of repetitions required may be limited by
the subject's stamina.
An example of [pasi] imaged in three sagittal planes is shown; with a Signa GE
0.5T MR scanner, 360 tokens were reconstructed to form a sequence of 39
3-slice 16ms frames.
From these, a 3-D volume was generated for each time frame, and tract surfaces
outlined manually.
Parameters derived from these include: palate-tongue distances for [a,s,i];
estimates of tongue volume and of the area function using only the
midsagittal, and then all three slices.
These demonstrate the accuracy and usefulness of the technique.
|
Top
|
[ abstract | pdf ]
|
|
Jackson, P.J.B. and Shadle, C.H. (1998).
Pitch-synchronous decomposition of mixed-source speech signals.
In Proceedings of the International Congress on Acoustics and Metting
of the Acoustical Society of America, Vol. 1, pp. 263-264,
Seattle, WA.
Abstract:
As part of a study of turbulence-noise sources in speech production, a
method has been developed for decomposing an acoustic signal into harmonic
(voiced) and anharmonic (unvoiced) components, based on a hoarseness
metric (Muta et al., 1988, J. Acoust. Soc. Am. 84, pp.1292-1301). Their
pitch-synchronous harmonic filter (PSHF) has been extended (to EPSHF) to
yield time histories of both harmonic and anharmonic components. Our corpus
includes many examples of turbulence noise, including aspiration, voiced and
unvoiced fricatives, and a variety of voice qualities (e.g. breathy, whispered).
The EPSHF algorithm plausibly decomposed breathy vowels, but the harmonic
component of voiced fricatives still contained significant noise, similar in shape
to (though weaker than) the ensemble-averaged anharmonic spectrum. In
general the algorithm performed best on sustained sounds. Tracking errors at
rapid transitions, and due to jitter and shimmer, were spuriously attributed to
the anharmonic component. However, the extracted anharmonic component
clearly exhibited modulation in voiced fricatives. While such modulation has
been previously reported (and also in hoarse voice), it was verified by tests on
synthetic signals, where constant and modulated noise signals were extracted
successfully. The results suggest that the EPSHF will continue to enable
exploration of the interaction of phonation and turbulence noise.
|
Top
|
[ abstract | pdf ]
|
|
Jackson, P.J.B. and Ross, C.F. (1996).
Application of active noise control to
corporate aircraft.
In Proceedings of the American Society of
Mechanical Engineers, Vol. DE93, pp. 19-25, Atlanta, GA.
Abstract:
Following the successful introduction of Active Noise Control (ANC) systems
as standard production fits on commuter aircraft (Saab2000, Saab340B and
Dash8Q series 100, 200 & 300), recent efforts have focused on developing
low-cost, low-weight systems for smaller corporate aircraft.
This paper describes the approach taken by Ultra to the new technical
challenges and the resulting improvements to the design methodology.
A review of system performance on corporate (King Air & Twin Commander)
turboprop aircraft shows repeatable global Tonal Noise Reductions (TNRs) of
>8 dBA throughout the whole cabin, achieving reductions >20 dB in some
locations at the blade-pass frequency (BPF), and major comfort benefits
throughout the flight envelope with a weight penalty of less than 20 kg.
|
Top
|
[ abstract | preprint ]
|