[Home] [Publications] [Research] [Teaching] [Short Bio] [Demo & Data] [Codes]

Demos, Data

Demos

Demonstrations for Interference Reduction in Reverberant Speech Separation with Visual Voice Activity Detection

We have developed a pose-invariant lip tracking algorithm based on template matching under the EPSRC/Dstl project "Audio-Visual Blind Source Separation". Some results on lip tracking may be found on Youtube from here (the lip region occupied about 45*17 pixels). More details about this algorithm can be found from [1].

The lip-tracking information can be used to determine the voice activities which can then be used to enhance speech sources estimated from noisy mixtures [2] [3]. We have developed a spectral subtraction based technique to enhance the speech sources separated via Mandel's audio domain BSS algorithm [4], by integrating the visual VAD cues. The inteference within the separated speech is detected based on the correlation and energy ratio map calculated from the BSS outputs on a block by block basis. To demonstrate the results, a target speech signal was mixed with an interference speech signal (intrusion) using the binaural room impulse responses recorded from four different rooms, as described in [5], which can be downloaded from here. We then added 10 dB noise into each speech mixture. Note that, in this demonstration, the target speech and interfering speech are placed relatively close to each other with the angle of 15 degrees between them in terms of the setup in [5], and this corresponds to a very challenging source separation scenario, due to the increase of spatial ambiguities for two closely located sources. Several algorithms were applied to these challenging sound mixtures to obtain the source estimates, such as Mandel [4], Rivet [6], AV-LIU [7], and ANC [8]. The ideal binary mask (IBM) [9] (denoted as "Ideal" below) is shown as performance benchmark since it assumes that both the target speech and interenference speech are known a priori. Note that the target speech was taken from the LiLIR dataset, where the speech was recorded in the visual studio in CVSSP of University of Surrey. The LiLIR dataset has been used in some previous works on lip tracking, such as [10].

Room A (RT=320 ms)

Speech mixtures Separated source with various methods Original speech Associated video
Left Right Mandel Rivet AV-LIU ANC Ideal-VAD Proposed Proposed-ANC Ideal Intrusion Target Video (target)

Speech mixtures	Separated source with various methods	Original speech	Associated video
Left	Right	Mandel	Rivet	AV-LIU	ANC	Ideal-VAD	Proposed	Proposed-ANC	Ideal	Intrusion	Target	Video (target)

Room B (RT=470 ms)

Speech mixtures Separated source with various methods Original speech Associated video
Left Right Mandel Rivet AV-LIU ANC Ideal-VAD Proposed Proposed-ANC Ideal Intrusion Target Video (target)

Speech mixtures	Separated source with various methods	Original speech	Associated video
Left	Right	Mandel	Rivet	AV-LIU	ANC	Ideal-VAD	Proposed	Proposed-ANC	Ideal	Intrusion	Target	Video (target)

Room C (RT=680 ms)

Speech mixtures Separated source with various methods Original speech Associated video
Left Right Mandel Rivet AV-LIU ANC Ideal-VAD Proposed Proposed-ANC Ideal Intrusion Target Video (target)

Speech mixtures	Separated source with various methods	Original speech	Associated video
Left	Right	Mandel	Rivet	AV-LIU	ANC	Ideal-VAD	Proposed	Proposed-ANC	Ideal	Intrusion	Target	Video (target)

Room D (RT=890 ms)

Speech mixtures Separated source with various methods Original speech Associated video
Left Right Mandel Rivet AV-LIU ANC Ideal-VAD Proposed Proposed-ANC Ideal Intrusion Target Video (target)

Speech mixtures	Separated source with various methods	Original speech	Associated video
Left	Right	Mandel	Rivet	AV-LIU	ANC	Ideal-VAD	Proposed	Proposed-ANC	Ideal	Intrusion	Target	Video (target)

References

Q. Liu, W. Wang, and P. Jackson, "A Visual Voice Activity Detection Method with Adaboosting," in Proc. IEEE Sensor Signal Processing for Defence (SSPD 2011), London, UK, Sept 28-29, 2011. [PDF]
Q. Liu and W. Wang, "Blind source separation and visual voice activity detection for target speech extraction," in Proc. IEEE 3rd International Conference on Awareness Science and Technology (ICAST 2011), pp. 457-460, Dalian, China, Sept 27-30, 2011. (Invited Paper) [PDF]
Q. Liu, A. Aubery, and W. Wang, "Interference Reduction in Reverberant Speech Separation with Visual Voice Activity Detection", IEEE Transactions on Multimedia, 2013 (submitted)
M. I. Mandel, R. J. Weiss, and D. Ellis, "Model-based expectation-maximization source separation and localization," IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 2, pp. 382-394, February 2010.
C. Hummersone, "A Psychoacoustic Engineering Approach to Machine Sound Source Separation in Reverberant Environments", PhD Thesis, University of Surrey, 2011.
B. Rivet, L. Girin, C. Serviere, D.-T. Pham, and C. Jutten, "Using a visual voice activity detector to regularize the permutations in blind separation of convolutive speech mixtures," in Proc. International Conference on Digital Signal Processing, 2007, pp. 223-226.
Q. Liu, W. Wang, and P. Jackson, "Use of bimodal coherence to resolve the permutation problem in convolutive BSS," Signal Processing, vol. 92, no. 8, pp. 1916-1927, August 2012.
S. Y. Low, S. Nordholm, and R. Togneri, "Convolutive blind signal separation with post-processing," IEEE Trans. Speech, Audio Proc, vol. 12, pp. 539-548, 2004.
D. Wang, "On ideal binary mask as the computational goal of auditory scene analysis," in Speech Separation by Humans and Machines, P. Divenyi, Ed. Springer US, 2005, ch. 12, pp. 181-197.
E.-J. Ong and R. Bowden, "Robust lip-tracking using rigid flocks of selected linear predictors," in Proc. IEEE International Conference on Automatic Face and Gesture Recognition, 2008.

Demonstrations for Multi-Speaker Tracking in a Room Environment

We have developed a particle filtering algorithm for multi-speaker tracking in a room environment, where audio measurements are used to improve the performance of visual tracker by directly adapting the distributions of the visual particles based on the audio contributions from the direction of arrivals (DOAs) of the audio sources [1]. Using the information from audio, we can reduce the number of particles required for the visual tracker to achieve efficient tracking. In addition, the algorithm is more robust than the tracking system using individual modalities, especially when the moving speakers occlude each other, or when the speakers are out of the view of the cameras. Some results for the video sequences from the AV16.3 dataset can be found from here. We have also developed an adpative particle filtering algorithm where the number of particles required is automatically determined by the algorithm rather than pre-defined in the initialisation [2].

References

V. Kilic, M. Barnard, W. Wang, and J. Kittler, "Audio Constrained Particle Filter Based Visual Tracking", in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2013), pp. 3627-3631, Vancouver, Canada, May 26-31, 2013. [PDF]
V. Kilic, M. Barnard, W. Wang, and J. Kittler, "Adaptive Particle Filtering Approach to Audio-Visual Tracking", in Proc. 21st European Signal Processing Conference (EUSIPCO 2013), Marrakech, Morocco, 9-13 September, 2013.

Demonstrations for Dictionary Learning and Identity Models based Multi-Speaker Tracking

Using the particle filtering tracking framework, we have developed an algorithm for multi-speaker tracking in a room environment, based on dictionary learning and identity modelling. In this algorithm [1] [2]:

First, we model the appearance of the moving speakers based on dictionary learning (DL), using an off-line training process. In the tracking phase, the histograms (coding coefficients) of the image patches derived from the learned dictionaries are used to generate the likelihood functions based on Support Vector Machine (SVM) classification. This likelihood function is then used in the measurement step of the classical particle filtering (PF) algorithm. To improve the computational efficiency of generating the histograms, a soft voting technique based on approximate Locality-constrained Soft Assignment (LcSA) is proposed to reduce the number of dictionary atoms (codewords) used for histogram encoding.

Second, an adaptive identity model is proposed to track multiple speakers whilst dealing with occlusions. This model is updated online using Maximum a Posteriori (MAP) adaptation, where we control the adaptation rate using the spatial relationship between the subjects.

Third, to enable automatic initialisation of the visual trackers, we exploit audio information, the Direction of Arrival (DOA) angle, derived from microphone array recordings. Such information provides, a priori, the number of speakers and constrains the search space for the speaker's faces.

The proposed system is tested on a number of sequences from three publicly available and challenging data corpora (AV16.3, EPFL pedestrian data set and CLEAR) with up to five moving subjects. Some demos can be downloaded from here.

References

M. Barnard, W. Wang, J. Kittler, S.M.R. Naqvi, and J.A. Chambers, "A Dictionary Learning Approach to Tracking," in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), Kyoto, Japan, March 25-30, 2012. [PDF]
M. Barnard, P.K. Koniusz, W. Wang, J. Kittler, S. M. Naqvi, and J.A. Chambers, "Robust Multi-Speaker Tracking via Dictionary Learning and Identity Modelling", IEEE Transactions on Multimedia, 2013. [PDF] (in press)

Demonstrations for Underdetermined Blind Speech Separation

Input Signals		Output Signals
Mixture 1	Mixture 2	Estimated Source 1	Estimated Source 2	Estimated Source 3	Estimated Source 4

Notes:

The two speech mixtures were generated by mixing together four source signals (Source 1, Source 2, Source 3, and Source 4) using a 4-by-2 matrix generated randomly.

The separation results were obtained by using the method described in [1] where the source separation problem is reformulated as a sparse signal recovery problem with the dictionary either pre-defined (e.g. DCT, STFT, and MDCT) or learned from speech mixtures using dictionary learning algorithms (e.g. SimCO [2], K-SVD [6], and GAD [7]). Earlier works relating to [1] have been presented in several conferences including [3], [4], and [5].

Demonstrations for Campaign results will be added soon.

References

T. Xu, W. Wang, and W. Dai, "Sparse Coding with Adaptive Dictionary Learning for Underdetermined Speech Separation", Speech Communication, 2012.
W. Dai, T. Xu, and W. Wang, "Simultaneous Codeword Optimisation (SimCO) for Dictionary Update and Learning", IEEE Transactions on Signal Processing, 2012. [PDF]
T. Xu and W. Wang, "Methods for Learning Adaptive Dictionary for Underdetermined Speech Separation," in Proc. IEEE 21st International Workshop on Machine Learning for Signal Processing (MLSP 2011), Beijing, China, Sept 18-21, 2011. [PDF]
T. Xu and W. Wang, "A Block-based Compressed Sensing Method for Underdetermined Blind Speech Separation Incorporating Binary Mask," in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2010), Dallas, Texas, USA, March 14-19, 2010.
T. Xu and W. Wang, "A Compressed Sensing Approach for Underdetermined Blind Audio Source Separation with Sparse Representations," in Proc. IEEE International Workshop on Statistical Signal Processing (SSP 2009), Cardiff, UK, August 31-Sept 3, 2009. [PDF]
M. Aharon, M. Elad, and A. Bruckstein, "K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation," IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4311-4322, 2006.
M. G. Jafari and M. D. Plumbley, "Fast dictionary learning for sparse representations of speech signals," IEEE J. Selected Topics in Signal Process., vol. 5, no. 5, pp. 1025-1031, 2011.

Free software

To be added soon.

Demonstrations for Blind Separation of Convolutive Speech Mixtures

Separation results for synthetic convolutive mixtures with different reverberation time:

RT60=30ms

Input Signals Output Signals
Stage 1 Stage 2 Stage 3
Mixture 1 Estimated Source 1 Estimated Source 1 Estimated Source 1
Mixture 2 Estimated Source 2 Estimated Source 2 Estimated Source 2

Input Signals	Output Signals
Stage 1	Stage 2	Stage 3
Mixture 1	Estimated Source 1	Estimated Source 1	Estimated Source 1
Mixture 2	Estimated Source 2	Estimated Source 2	Estimated Source 2

RT60=50ms

Input Signals Output Signals
Stage 1 Stage 2 Stage 3
Mixture 1 Estimated Source 1 Estimated Source 1 Estimated Source 1
Mixture 2 Estimated Source 2 Estimated Source 2 Estimated Source 2

Input Signals	Output Signals
Stage 1	Stage 2	Stage 3
Mixture 1	Estimated Source 1	Estimated Source 1	Estimated Source 1
Mixture 2	Estimated Source 2	Estimated Source 2	Estimated Source 2

RT60=100ms

Input Signals Output Signals
Stage 1 Stage 2 Stage 3
Mixture 1 Estimated Source 1 Estimated Source 1 Estimated Source 1
Mixture 2 Estimated Source 2 Estimated Source 2 Estimated Source 2

Input Signals	Output Signals
Stage 1	Stage 2	Stage 3
Mixture 1	Estimated Source 1	Estimated Source 1	Estimated Source 1
Mixture 2	Estimated Source 2	Estimated Source 2	Estimated Source 2

RT60=150ms

Input Signals Output Signals
Stage 1 Stage 2 Stage 3
Mixture 1 Estimated Source 1 Estimated Source 1 Estimated Source 1
Mixture 2 Estimated Source 2 Estimated Source 2 Estimated Source 2

Input Signals	Output Signals
Stage 1	Stage 2	Stage 3
Mixture 1	Estimated Source 1	Estimated Source 1	Estimated Source 1
Mixture 2	Estimated Source 2	Estimated Source 2	Estimated Source 2

RT60=200ms

Input Signals Output Signals
Stage 1 Stage 2 Stage 3
Mixture 1 Estimated Source 1 Estimated Source 1 Estimated Source 1
Mixture 2 Estimated Source 2 Estimated Source 2 Estimated Source 2

Input Signals	Output Signals
Stage 1	Stage 2	Stage 3
Mixture 1	Estimated Source 1	Estimated Source 1	Estimated Source 1
Mixture 2	Estimated Source 2	Estimated Source 2	Estimated Source 2

RT60=400ms

Input Signals Output Signals
Stage 1 Stage 2 Stage 3
Mixture 1 Estimated Source 1 Estimated Source 1 Estimated Source 1
Mixture 2 Estimated Source 2 Estimated Source 2 Estimated Source 2

Input Signals	Output Signals
Stage 1	Stage 2	Stage 3
Mixture 1	Estimated Source 1	Estimated Source 1	Estimated Source 1
Mixture 2	Estimated Source 2	Estimated Source 2	Estimated Source 2

Separation results for real recordings made in a strongly reverberant room environment:

Input Signals Output Signals
Stage 1 Stage 2 Stage 3
Mixture 1 Estimated Source 1 Estimated Source 1 Estimated Source 1
Mixture 2 Estimated Source 2 Estimated Source 2 Estimated Source 2

Input Signals	Output Signals
Stage 1	Stage 2	Stage 3
Mixture 1	Estimated Source 1	Estimated Source 1	Estimated Source 1
Mixture 2	Estimated Source 2	Estimated Source 2	Estimated Source 2

Notes:

Synthetic convolutive mixtures were obtained by mixing together two source signals (Source 1, Source 2) using a simulated room model where the reverberation time can be set explicitly. The source signals were used in [4], and may be available from here.

Real recordings are the microphone mixtures of a male speech with TV on, which were recorded by Parra & Spence (2000) [5].

The separation results were obtained using a multi-stage approach that we developed recently, see [1], [2] for details.

Stage 1 uses the constrained convolutive ICA (CCICA) [3], stage 2 uses the ideal binary mask (IBM) [6] with the mask estimated from the output of CCICA, and stage 3 uses the smoothed IBM in the cepstrum domain where the mask estimated in stage 2 is further processed using cepstral smoothing which can reduce the musical noise introduced by the binary mask.

References

T. Jan, W. Wang, and D.L. Wang, "A Multistage Approach to Blind Separation of Convolutive Speech Mixtures," Speech Communication, vol. 53, pp. 524-539, 2011. [PDF]
T. Jan, W. Wang, and D.L. Wang, "A multistage approach for blind separation of convolutive speech mixtures", Proc. IEEE ICASSP, 2009. [PDF]
W. Wang, S. Sanei, and J.A. Chambers, "Penalty Function Based Joint Diagonalization Approach for Convolutive Blind Separation of Nonstationary Sources", IEEE Transactions on Signal Processing, vol. 53, no. 5, pp. 1654-1669, May 2005. [PDF]
M.S. Pedersen, D.L. Wang, J. Larsen, and U. Kjems, "Two-microphone separation of speech mixtures", IEEE Trans. on Neural Networks, vol. 19, pp. 475-492, 2008.
L. Parra, C. Spence, "Convolutive blind source separation of non-stationary sources", IEEE Trans. on Speech and Audio Processing, pp. 320-327, May 2000.
D.L. Wang, "On ideal binary mask as the computational goal of auditory scene analysis", In Divenyi P. (ed.), Speech Separation by Humans and Machines, pp. 181-197, Kluwer Academic, Norwell MA, 2005.

Free software

To be added soon.

Data

Music Audio Samples

Audio signals used in the following references are from a collection that may be found from here.

References

W. Wang, A. Cichocki, and J.A. Chambers, "A Multiplicative Algorithm for Convolutive Non-negative Matrix Factorization Based on Squared Euclidean Distance", IEEE Transactions on Signal Processing, vol. 57, no. 7, pp. 2858-2864, July 2009. [PDF]
W. Wang, "Squared Euclidean Distance Based Convolutive Non-negative Matrix Factorization with Multiplicative Learning Rules for Audio Pattern Separation", in Proc. 7th IEEE International Symposium on Signal Processing and Information Technology (ISSPIT 2007), Cairo, Egypt, December 15-18, 2007. [PDF]

[Home] [Publications] [Research] [Teaching] [Short Bio] [Demo & Data] [Codes]

Last updated in August 2013

First created in July 2008