Philip Jackson    

Abstracts of my publications

The University of Surrey

 

Listing  

Journal papers  

Conferences  
    Interspeech 2007  
    DSP 2007  
    Interspeech 2006  
    Interspeech 2005  
    AES 2005  
    3DPVT 2004  
    FSTS 2004  
    Eurospeech 2003  
    ICPhS 2003  
    EC-VIP-MC 2003  
    ICSLP 2002  
    CRAC 2001  
    WISP 2001  
    ICASSP 2000  
    SPS5 2000  
    ICPhS 1999  
    ICA-ASA 1998  
    ASME 1996  

Book chapter  

Doctoral thesis  

FTP site  


Refereed Conference Proceedings

VD Singampalli, PJB Jackson (2007). "Statistical identification of critical, dependent and redundant articulators". In Proc. Interspeech 2007, 4 pp., Antwerp, Belgium. [ abstract | pdf | slides ]

A Turkmani, A Hilton, PJB Jackson, J Edge (2007). "Visual analysis of lip coarticulation in VCV utterances". In Proc. Interspeech 2007, 4 pp., Antwerp, Belgium. [ abstract | pdf ]

PJB Jackson (2007). "Time-frequency-modulation representation of stochastic signals". In Proc. IEEE DSP 2007, 4 pp., Cardiff, UK. [ abstract | pdf | slides ]


Every, M. and Jackson, P.J.B. (2006). Enhancement of harmonic content of speech based on a dynamic programming pitch tracking algorithm.
In Proceedings of Interspeech 2006, 4pp., Pittsburgh PA.

Abstract:

For pitch tracking of a single speaker, a common requirement is to find the optimal path through a set of voiced or voiceless pitch estimates over a sequence of time frames. Dynamic programming (DP) algorithms have been applied before to this problem. Here, the pitch candidates are provided by a multi-channel autocorrelation-based estimator, and DP is extended to pitch tracking of multiple concurrent speakers. We use the resulting pitch information to enhance harmonic content in noisy speech and to obtain separations of target from interfering speech.

Index Terms: speech enhancement, dynamic programming
 

Top
 

abstract | pdf ]
 


Pincas, J. and Jackson, P.J.B. (2005b). Amplitude modulation of frication noise by voicing saturates.
In Proceedings of Interspeech 2005, 4pp., Lisbon.

Abstract:

The two distinct sound sources comprising voiced frication, voicing and frication, interact. One effect is that the periodic source at the glottis modulates the amplitude of the frication source originating in the vocal tract above the constriction. Voicing strength and modulation depth for frication noise were measured for sustained English voiced fricatives using high-pass filtering, spectral analysis in the modulation (envelope) domain, and a variable pitch compensation procedure. Results show a positive relationship between strength of the glottal source and modulation depth at voicing strengths below 66 dB SPL, at which point the modulation index was approximately 0.5 and saturation occurred. The alveolar [z] was found to be more modulated than other fricatives.

Top
 

abstract | pdf | poster ]
 


Dewhirst, M., Zielinski, S., Jackson, P.J.B. and Rumsey F. (2005). Objective assessment of spatial localisation attributes of surround-sound reproduction systems.
In Proceedings of 118th Convention of the Audio Engineering Society, AES 2005, 16pp., Barcelona, Spain.

Abstract:

A mathematical model for objective assessment of perceived spatial quality was developed for comparison across the listening area of various sound reproduction systems: mono, two-channel stereo (TCS), 3/2 stereo (i.e., 5.0 surround sound), Wave Field Synthesis (WFS) and Higher Order Ambisonics (HOA). Models for mono, TCS and 3/2 stereo are based on conventional microphone techniques and loudspeaker configurations for each system. WFS and HOA models use circular arrays of thirty-two loudspeakers driven by signals derived from a virtual microphone array and the Fourier-Bessel spatial decomposition of the soundfield respectively. Directional localisation, ensemble width and ensemble envelopment of monochromatic tones, extracted from binaural signals, are analysed under a range of test conditions.

Top
 

abstract | pdf ]
 


Ypsilos, I.A., Hilton, A., Turkmani, A. and Jackson, P.J.B. (2004). Speech-driven face synthesis from 3D video.
In IEEE Proceedings of the 2nd International Symposium on 3D Data Processing, Visualization and Transmission (3DPVT'04), pp.  58-65, Thessaloniki, Greece.

Abstract:

This paper presents a framework for speech-driven synthesis of real faces from a corpus of 3D video of a person speaking. Video-rate capture of dynamic 3D face shape and colour appearance provides the basis for a visual speech synthesis model. A displacement map representation combines face shape and colour into a 3D video. This representation is used to efficiently register and integrate shape and colour information captured from multiple views. To allow visual speech synthesis viseme primitives are identified from the corpus using automatic speech recognition. A novel non-rigid alignment algorithm is introduced to estimate dense correspondence between 3D face shape and appearance for different visemes. The registered displacement map representation together with a novel optical flow optimisation using both shape and colour, enables accurate and efficient non-rigid alignment. Face synthesis from speech is performed by concatenation of the corresponding viseme sequence using the non-rigid correspondence to reproduce both 3D face shape and colour appearance. Concatenative synthesis reproduces both viseme timing and co-articulation. Face capture and synthesis has been performed for a database of 51 people. Results demonstrate synthesis of 3D visual speech animation with a quality comparable to the captured video of a person.

Top
 

abstract | pdf ]
 


Pincas, J. and Jackson, P.J.B. (2004). Acoustic correlates of voicing-frication interaction in fricatives.
In Proceedings of From Sound to Sense, J Slifka, S Manuel and M Matthies (eds.), pp. C73-C78, Cambridge MA.

Abstract:

This paper investigates the acoustic effects of source interaction in fricative speech sounds. A range of parameters has been employed, including a measure designed specifically to describe quantitatively the amplitude modulation of frication noise by voicing, a phenomenon which has mainly been qualitatively reported. The signal processing technique to extract this measure is presented. Results suggest that fricative duration is the main determinant of how much the sources overlap at the VF boundary of voiceless fricatives and that the amount of modulation occurring in voiced fricatives is chiefly dependent on voicing strength. Furthermore, it appears that individual speakers have differing tendencies for amount of source-source overlap and degree of modulation where overlap does occur.

Top
 

abstract | pdf | poster ]
 


Jackson, P.J.B., Moreno, D.M., Russell, M.J. and Hernando, J. (2003). Covariation and weighting of harmonically decomposed streams for ASR.
In Proceedings of Eurospeech 2003, pp. 2321-2324, Geneva.

Abstract:

Decomposition of speech signals into simultaneous streams of periodic and aperiodic information has been successfully applied to speech analysis, enhancement, modification and recently recognition. This paper examines the effect of different weightings of the two streams in a conventional HMM system in digit recognition tests on the Aurora 2.0 database. Comparison of the results from using matched weights during training showed a small improvement of approximately 10% relative to unmatched ones, under clean test conditions. Principal component analysis of the covariation amongst the periodic and aperiodic features indicated that only 45 (51) of the 78 coefficients were required to account for 99% of the variance, for clean (multi-condition) training, which yielded an 18.4% (10.3%) absolute increase in accuracy with respect to the baseline. These findings provide further evidence of the potential for harmonically-decomposed streams to improve performance and substantially to enhance recognition accuracy in noise.

Session: OWeDc, Speech Modeling & Features 2 (oral).
 

Top
 

abstract | pdf | slides ]
 


Russell, M.J. and Jackson, P.J.B. (2003). The effect of an intermediate articulatory layer on the performance of a segmental HMM.
In Proceedings of Eurospeech 2003, pp. 2737-2740, Geneva.

Abstract:

We present a novel multi-level HMM in which an intermediate `articulatory' representation is included between the state and surface-acoustic levels. A potential difficulty with such a model is that advantages gained by the introduction of an articulatory layer might be compromised by limitations due to an insufficiently rich articulatory representation, or by compromises made for mathematical or computational expediency. This paper decribes a simple model in which speech dynamics are modelled as linear trajectories in a formant-based `articulatory' layer, and the articulatory-to-acoustic mappings are linear. Phone classification results for TIMIT are presented for monophone and triphone systems with a phone-level syntax. The results demonstrate that provided the intermediate representation is sufficiently rich, or a sufficiently large number of phone-class-dependent articulatory-to-acoustic mapping are employed, classification performance is not compromised.

Session: PThBf, Robust Speech Recognition 3 (poster).
 

Top
 

abstract | pdf ]
 


Jackson, P.J.B. (2003). Improvements in phone-classification accuracy from modelling duration.
In Proceedings of the 15th International Congress of Phonetic Sciences, ICPhS 2003, pp. 1349-1352, Barcelona.

Abstract:

Durations of real speech segments do not generally exhibit exponential distributions, as modelled implicitly by the state transitions of Markov processes. Several duration models were considered for integration within a segmental-HMM recognizer: uniform, exponential, Poisson, normal, gamma and discrete. The gamma distribution fitted that measured for silence best, by an order of magnitude. Evaluations determined an appropriate weighting for duration against the acoustic models. Tests showed a reduction of 2% absolute (6+% relative) in the phone-classification error rate with gamma and discrete models; exponential ones gave approximately 1% absolute reduction, and uniform no significant improvement. These gains in performance recommend the wider application of explicit duration models.
[http://www.ee.surrey.ac.uk/Personal/P.Jackson/Balthasar/]

Session: T.3.P2, Automatic speech recognition / Auditory mechanisms (poster).
 

Top
 

abstract | pdf | poster ]
 


Moreno, D.M., Jackson, P.J.B., Hernando, J. and Russell, M.J. (2003). Improved ASR in noise using harmonic decomposition.
In Proceedings of the 15th International Congress of Phonetic Sciences, ICPhS 2003, pp. 751-754, Barcelona.

Abstract:

Application of the pitch-scaled harmonic filter (PSHF) to automatic speech recognition in noise was investigated using the Aurora 2.0 database. The PSHF decomposed the original speech into periodic and aperiodic streams. Digit-recognition tests with the extended features compared the noise robustness of various parameterisations against standard 39 MFCCs. Separately, each stream reduced word accuracy by less than 1% absolute; together, the combined streams gave substantial increases under noisy conditions. Applying PCA to concatenated features proved better than to separate streams, and to static coefficients better than after calculation of deltas. With multi-condition training, accuracy improved by 7.8% at 5dB SNR, thus providing resilience from corruption by noise.
[http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/]

Session: M.4.5, Automatic speech recognition I (oral).
 

Top
 

abstract | pdf | ppt ]
 


Russell, M.J., Jackson, P.J.B. and Wong, M.L.P. (2003). Development of articulatory-based multi-level segmental HMMs for phonetic classification in ASR.
In Proceedings of EURASIP Conference on Video/Image Processing and Multimedia Communications, EC-VIP-MC~2003, Vol. 2, pp. 655-660, Zagreb, Croatia.

Abstract:

A simple multiple-level HMM is presented in which speech dynamics are modelled as linear trajectories in an intermediate, formant-based representation and the mapping between the intermediate and acoustic data is achieved using one or more linear transformations. An upper-bound on the performance of such a system is established. Experimental results on the TIMIT corpus demonstrate that, if the dimension of the intermediate space is sufficiently high or the number of articulatory-to-acoustic mappings is sufficiently large, then this upper-bound can be achieved.

Keywords: Automatic speech recognition, Hidden Markov Models, segment models.
 

Top
 

abstract | pdf ]
 


Jackson, P.J.B. and Russell, M.J. (2002). Models of speech dynamics in a segmental-HMM recognizer using intermediate linear representations.
In Proceedings of the International Conference on Spoken Language Processing, ICSLP 2002, pp. 1253-1256, Denver CO.

Abstract:

A theoretical and experimental analysis of a simple multi-level segmental HMM is presented in which the relationship between symbolic (phonetic) and surface (acoustic) representations of speech is regulated by an intermediate (articulatory) layer, where speech dynamics are modeled using linear trajectories. Three formant-based parameterizations and measured articulatory positions are considered as intermediate representations, from the TIMIT and MOCHA corpora respectively. The articulatory-to-acoustic mapping was performed by between 1 and 49 linear transformations. Results of phone-classification experiments demonstrate that, by appropriate choice of intermediate parameterization and mappings, it is possible to achieve close to optimal performance.

Session: Acoustic modelling
 

Top
 

abstract | pdf | ppt ]
 


Jackson, P.J.B. (2001). Acoustic cues of voiced and voiceless plosives for determining place of articulation.
In Proceedings of Workshop on Consistent and Reliable Acoustic Cues for sound analysis, CRAC 2001, pp. 19-22, Aalborg, Denmark.

Abstract:

Speech signals from stop consonants with trailing vowels were analysed for cues consistent with their place of articulation. They were decomposed into periodic and aperiodic components by the pitch-scaled harmonic filter to improve the quality of the formant tracks, to which exponential trajectories were fitted to get robust formant loci at voice onset. Ensemble-average power spectra of the bursts exhibited dependence on place (and on vowel context for velar consonants), but not on voicing. By extrapolating the trajectories back to the release time, formant estimates were compared with spectral peaks, and connexions were made between these disparate acoustic cues.

Keywords: acoustic cues, plosive, stop consonants.
 

Top
 

abstract | pdf ]
 


Jackson, P.J.B. and Shadle, C.H. (2001). Uses of the pitch-scaled harmonic filter in speech processing.
In Proceedings of the Institute of Acoustics, Workshop on Innovation in Speech Processing 2001, Vol. 23 (3), pp. 309-321, Stratford-upon-Avon, UK.

Abstract:

The pitch-scaled harmonic filter (PSHF) is a technique for decomposing speech signals into their periodic and aperiodic constituents, during periods of phonation. In this paper, the use of the PSHF for speech analysis and processing tasks is described. The periodic component can be used as an estimate of the part attributable to voicing, and the aperiodic component can act as an estimate of that attributable to turbulence noise, i.e., from fricative, aspiration and plosive sources. Here we present the algorithm for separating the periodic and aperiodic components from the pitch-scaled Fourier transform of a short section of speech, and show how to derive signals suitable for time-series analysis and for spectral analysis. These components can then be processed in a manner appropriate to their source type, for instance, extracting zeros as well as poles from the aperiodic spectral envelope. A summary of tests on synthetic speech-like signals demonstrates the robustness of the PSHF's performance to perturbations from additive noise, jitter and shimmer. Examples are given of speech analysed in various ways: power spectrum, short-time power and short-time harmonics-to-noise ratio, linear prediction and mel-frequency cepstral coefficients. Besides being valuable for speech production and perception studies, the latter two analyses show potential for incorporation into speech coding and speech recognition systems. Further uses of the PSHF are revealing normally-obscured acoustic features, exploring interactions of turbulence-noise sources with voicing, and pre-processing speech to enhance subsequent operations.

Keywords: periodic/aperiodic decomposition, acoustic features.
 

Top
 

abstract | pdf | ppt ]
 


Jackson, P.J.B. and Shadle, C.H. (2000). Performance of the pitch-scaled harmonic filter and applications in speech analysis.
In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 3, pp. 1311-1314, Istanbul.

Abstract:

The pitch-scaled harmonic filter (PSHF) is a technique for decomposing speech signals into their voiced and unvoiced constituents. In this paper, we evaluate its ability to reconstruct the time series of the two components accurately using a variety of synthetic, speech-like signals, and discuss its performance. These results determine the degree of confidence that can be expected for real speech signals: typically, 5 dB improvement in the signal-to-noise ratio of the harmonic component and approximately 5 dB more than the initial harmonics-to-noise ratio (HNR) in the anharmonic component. A selection of the analysis opportunities that the decomposition offers is demonstrated on speech recordings, including dynamic HNR estimation and separate linear prediction analyses of the two components. These new capabilities provided by the PSHF can facilitate discovering previously hidden features and investigating interactions of unvoiced sources, such as frication, with voicing.

Session: 3.2 Speech analysis

Keywords: harmonics-to-noise ratio, voiced/unvoiced decomposition, frication, aspiration noise.
 

Top
 

abstract | pdf ]
 


Jackson, P.J.B. and Shadle, C.H. (2000). Aero-acoustic modelling of voiced and unvoiced fricatives based on MRI data.
In Proceedings of the 5th Seminar on Speech Production, pp. 185-188, Seeon, Germany.

Abstract:

We would like to develop a more realistic production model of unvoiced speech sounds, namely fricatives, plosives and aspiration noise. All three involve turbulence noise generation, with place-dependent source characteristics that vary with time (rapidly, in plosives). In this study, we aimed to produce, using an aero-acoustic model of the vocal-tract filter and source, voiced as well as unvoiced fricatives that provide a good match to analyses of speech recordings. The vocal-tract transfer function (VTTF) was computed by the vocal-tract acoustics program, VOAC [Davies, McGowan and Shadle. Vocal Fold Physiology: Frontiers in Basic Science, ed. Titze, Singular Pub., CA, 93-142, 1993], using geometrical data, in the form of cross-sectional area and hydraulic radius functions, along the length of the tract. VOAC incorporates the effects of net flow into the transmission of plane waves through a tubular representation of the tract, and relaxes assumptions of rrigid walls and isentropic propagation. The geometry functions were derived from multiple-slice, dynamic, magnetic resonance images (MRI) [Mohammad. PhD thesis, Dept. ECS, U. Southampton, UK, 1999; Shadle, Mohammad, Carter, and Jackson. Proc. ICPhS, S.F. CA, 1:623-626, 1999], using a method of converting from the pixel outlines that was improved over earlier efforts on vowels. A coloured noise source signal was combined with the VTTF and radiation characteristic to synthesize the unvoiced fricative [s]. For its voiced counterpart [z], many researchers have noted that the noise source appears to be modulated by voicing. Furthermore, the phase of the modulation has been shown to be perceptually significant. Based on our analysis [Jackson and Shadle. Proc. IEEE-ICASSP, Istanbul, 2000.] of recordings by the same subject, the frication source of [z] was varied periodically according to fluctuations in the flow velocity at the constriction exit, and the modulation phase was governed by the convection time for the flow perturbation to travel from the constriction to the obstacle. The synthesized fricatives were compared to the speech recordings in a simple listening test, and comparisons of the predicted and measured time series suggested that the model, which brings together physical, aerodynamic and acoustic information, can replicate characteristics of real speech, such as the modulation in voiced fricatives (please note the change of URL, Nov '02:
http://www.ee.surrey.ac.uk/Personal/P.Jackson/Nephthys/).
 

Top
 

abstract | pdf ]
 


Shadle, C.H., Mohammad, M., Carter, J.N. and Jackson, P.J.B. (1999). Dynamic Magnetic Resonance Imaging: new tools for speech research.
In Proceedings of the 14th International Congress of Phonetic Sciences, Vol. 1, pp. 623-626, San Francisco, CA.

Abstract:

A multiplanar Dynamic Magnetic Resonance Imaging (MRI) technique that extends our earlier work on single-plane Dynamic MRI is described. Scanned images acquired while an utterasne is repeated are recombined to form pseudo-time-varying images of the vocal tract using a simultaneously recorded audio signal. There is no technical limit on the utterance length or number of slices that can be so imaged, though the number of repetitions required may be limited by the subject's stamina. An example of [pasi] imaged in three sagittal planes is shown; with a Signa GE 0.5T MR scanner, 360 tokens were reconstructed to form a sequence of 39 3-slice 16ms frames. From these, a 3-D volume was generated for each time frame, and tract surfaces outlined manually. Parameters derived from these include: palate-tongue distances for [a,s,i]; estimates of tongue volume and of the area function using only the midsagittal, and then all three slices. These demonstrate the accuracy and usefulness of the technique.
 

Top
 

abstract | pdf ]
 


Jackson, P.J.B. and Shadle, C.H. (1998). Pitch-synchronous decomposition of mixed-source speech signals.
In Proceedings of the International Congress on Acoustics and Metting of the Acoustical Society of America, Vol. 1, pp. 263-264, Seattle, WA.

Abstract:

As part of a study of turbulence-noise sources in speech production, a method has been developed for decomposing an acoustic signal into harmonic (voiced) and anharmonic (unvoiced) components, based on a hoarseness metric (Muta et al., 1988, J. Acoust. Soc. Am. 84, pp.1292-1301). Their pitch-synchronous harmonic filter (PSHF) has been extended (to EPSHF) to yield time histories of both harmonic and anharmonic components. Our corpus includes many examples of turbulence noise, including aspiration, voiced and unvoiced fricatives, and a variety of voice qualities (e.g. breathy, whispered). The EPSHF algorithm plausibly decomposed breathy vowels, but the harmonic component of voiced fricatives still contained significant noise, similar in shape to (though weaker than) the ensemble-averaged anharmonic spectrum. In general the algorithm performed best on sustained sounds. Tracking errors at rapid transitions, and due to jitter and shimmer, were spuriously attributed to the anharmonic component. However, the extracted anharmonic component clearly exhibited modulation in voiced fricatives. While such modulation has been previously reported (and also in hoarse voice), it was verified by tests on synthetic signals, where constant and modulated noise signals were extracted successfully. The results suggest that the EPSHF will continue to enable exploration of the interaction of phonation and turbulence noise.
 

Top
 

abstract | pdf ]
 


Jackson, P.J.B. and Ross, C.F. (1996). Application of active noise control to corporate aircraft.
In Proceedings of the American Society of Mechanical Engineers, Vol. DE93, pp. 19-25, Atlanta, GA.

Abstract:

Following the successful introduction of Active Noise Control (ANC) systems as standard production fits on commuter aircraft (Saab2000, Saab340B and Dash8Q series 100, 200 & 300), recent efforts have focused on developing low-cost, low-weight systems for smaller corporate aircraft. This paper describes the approach taken by Ultra to the new technical challenges and the resulting improvements to the design methodology. A review of system performance on corporate (King Air & Twin Commander) turboprop aircraft shows repeatable global Tonal Noise Reductions (TNRs) of >8 dBA throughout the whole cabin, achieving reductions >20 dB in some locations at the blade-pass frequency (BPF), and major comfort benefits throughout the flight envelope with a weight penalty of less than 20 kg.

Top
 

abstract | preprint ]
 


CVSSP [Colleagues | Group | Dept. | Faculty | Univ.]

© 2002-7, maintained by Philip Jackson, last updated on 24 August 2007.

EE