EEM.ssr: Tutorial questions

Home | Syllabus | Lab 1 | Lab 2 | Lab 3 | Exercises | Past exam

EEM.ssr > Exercises

Ex. 1 | Ex. 2 | Ex. 3 | Ex. 4 | Ex. 5 | Ex. 6

Week 1

Speech recognition:

Give a definition of automatic speech recognition that distinguishes it from other speech technologies.
How can speech technologies be said to increase access of people with disabilities to computer-based systems?

Speech communication:

What acronym denotes the convention for transcribing the sounds of the world's languages?
What is the difference between phones and phonemes?
What three attributes of natural speech contribute most to the non-linguistic aspects known as prosody?

Phonetics:

How can a basic understanding of phonetics facilitate the study of speech signals?
What is a diphthong? Illustrate your answer with an example.
What class of sounds includes /m/, /n/ and /ŋ/ (or /m,n,N/ in SAMPA)?
What characteristics of the acoustic signal are most useful for discriminating vowels?
Give three environmental factors that can affect the way speech is produced.
What are the three places of articulation for English plosive consonants (a.k.a. stops )?
What is the main difference between the way that the sounds /t/ and /s/ are produced?
What name is given to the effect in fluent speech where, for example, the phrase ``isn't it'' is pronounced as if it were ``in'it''?

[solutions]

Week 2

Dynamic Time Warping:

Write a pseudocode description of the DTW algorithm using the transitions shown in Fig.1 (left). Apply a distortion penalty for the horizontal (H) and steepest (S) transitions, d_H = d_S = d_μ/4, where d_μ denotes the mean distance found across the training data.
Modify your pseudocode to disallow two consecutive horizontal transitions, as shown in Fig.1 (right).
How can silence and wildcard templates be used during enrollment to help reduce end-point detection errors?

Fig.1. Permissible DTW transitions.

Speech production:

What human organ produces quasi-periodic source of voiced sounds, such as vowels?
What is fundamental frequency (also called f₀), and how is it produced?
The vocal tract comprises three main passages. The pharynx and oral cavity are two. What is the third?
The velum, larynx and jaw cooperate in the production of speech. Name two other articulators.
What is a formant and how is it produced?

Speech analysis:

What is the name of the organ in the inner ear that is responsible for converting physical vibrations into a set of nerve responses (i.e., electrical signals)?
What is the bandwidth to which the human ear responds (to one significant figure), and what are the implications?
If I calculate a DFT directly from a 40ms section of a speech signal, what will be the spacing of frequency bins in the spectrum?
Boxcar (rectangular), Kaiser and Blackman define certain kinds of window. Name three other popular window functions.
What would be an appropriate window size for a narrow-band spectrogram? [Hint: male f₀ 80-200Hz, female f₀ 150-400Hz]
Give an estimate of the SNR for a full-scale 16-bit speech signal in relation to the quantisation noise.

[solutions]

Week 3

Markov models:

Given an initial state vector π = [0.25 0.50 0.25], and state-transition probability matrix A = [0.25 0.50 0.25; 0.00 0.25 0.50; 0.00 0.00 0.25]:
1. draw the model to show the states and permissible transitions;
2. calculate the probability of the state sequence X = {1, 2, 2, 3}
Using the Markov model presented in the lectures, if today (Monday) has been rainy, what is the most likely weather
1. tomorrow,
2. in two days' time (i.e., on Wednesday)?
Calculate the probabilities of rain-rain-sun, rain-cloud-sun and rain-sun-sun, assuming π_rain = 1.
Hence, if we are told that it will be sunny on Wednesday for certain and that it's rainy today, what's the most likely weather tomorrow?

Hidden Markov models:

Considering the state-transition topologies shown in Figure 2:
1. write an expression for the state duration probability P(τ|λ) in Fig.2(a);
2. write an expression for the duration probability for each state P(τ|x=i,λ) in Fig.2(b);
3. hence derive the distribution of duration probabilities for the entire model in Fig.2(b);
4. how many terms are there in this expression for the model duration with τ=5?

(a)

(b)

Fig.2. HMM state transitions.

Feature extraction 1:

How can a bank of band-pass filters (each tuned to a different centre frequency) be used to extract a feature vector that describes the overall spectral shape of an acoustic signal at any particular time?
The real cepstrum of a digital signal sampled at 48 kHz is defined as c_s(m) = IDFT( ln|S(k)| ), where S(k) is the signal's discrete Fourier spectrum, ln|.| denotes the natural logarithm and IDFT is the inverse discrete Fourier transform. Considering only real, symmetric elements in the log-magnitude spectrum (i.e., the cos terms), draw the shapes of the first four cepstral coefficients c₀, c₁, c₂ and c₃, in the log-magnitude spectral domain.
What properties of the Mel-frequency cepstrum make it more like human auditory processing, compared to the real cepstrum?
In calculating MFCCs, what is the purpose of:
1. the log operation;
2. mel-frequency binning;
3. Discrete Cosine Transform?

[solutions]

Week 4

Hidden Markov models:

Draw a trellis diagram for the HMM in Fig.2(b) and a 5-frame observation sequence.
1. Show all the paths that arrive in the final null node after the fifth frame.
2. How many different paths are there for this model and number of observations?
Imagine you are designing an optical character recognition (OCR) system for converting images of written words into text, based on HMMs. The observations you're given come in the form of pixellated grey-scale bitmaps of a single line of writing. Explain in general terms how you would construct the following components of the system:
1. frames of feature vectors to make up the observation sequences;
2. the models (each one comprising a state or set of states);
3. suitable annotations to be used during training;
4. any special models (e.g., for dealing with blotches or blank spaces within the pictures).
How do the components of the OCR system compare to those for an ASR system designed to perform an Isolated Word Recognition task?

[solutions]

Week 5

HMM decoding:

In the Viterbi algorithm, what is the purpose of the variable ψ_t(i)?
What is the meaning of Δ^*?
Using the Viterbi algorithm, calculate the path likelihoods δ_t(i), the value of Δ^*, and use the values of ψ_t(i) to extract the best path X^*:
1. for observations O¹={G,B,B} (worked example from lecture)
2. for observations O²={R,B}
You may assume the following model parameters: π=[1 0], A=[0.8 0.2; 0.0 0.6], η=[0 0.4]^T, and B=[0.5 0.2 0.3; 0.0 0.9 0.1] where the columns of the B matrix correspond to green (G), blue (B) and red (R) events, respectively.
What is the difference between the cumulative likelihoods α_t(i) computed in the forward procedure, and those δ_t(i) computed in the Viterbi algorithm?
Floating-point variables with double precision (i.e., 4 bytes) can store values down to 1e-308, typical state-transition probs are in the order of 0.1, and the multi-dimensional output probability would be around 0.01 for a good match.
1. Given these approximations and assuming no re-scaling of the probabilities, state at what stage would it become impossible to compare competing hypotheses (i.e., different paths through the trellis)? In other words, after how many observations would you expect the likelihoods to suffer from numerical underflow?
2. With an observation frame rate of 10 ms, roughly how long would this take?
Instead of storing the likelihoods directly as in the previous question, we choose to store them as negative log probabilities using a 16-bit unsigned integer (quantising each decade into 32 levels).
1. How many decades (factors of ten) can we represent with this data type?
2. How many seconds of observations (at 10 ms) could we now process before suffering from underflow?

[solutions]

Week 6

HMM training:

Using the observation sequences O¹ and O² from Q.3 (week 5) and your derived Viterbi alignments X^*¹ and X^*², update the model parameters π, A, η and B according to the Viterbi re-estimation for multiple files.
Assuming initial values of a prototype model to be π=[1 0], A=[0.5 0.5; 0.0 0.5], η=[0 0.5]^T, B=[1/3 1/3 1/3; 1/3 1/3 1/3]:
1. Calculate the forward and backward likelihoods, α and β, for the first set of observations O¹={G,B,B}.
2. Calculate the occupation and transition likelihoods, γ and ξ.
3. Use the Baum-Welch formulae, which is an implementation of the Expectation Maximisation procedure, to re-estimate values for π, A, η and B.
Derive an expression for Baum-Welch re-estimation using multiple training files for the case of the discrete HMM.

Continuous HMMs:

Based on a univariate Gaussian pdf, b_i(o_t), derive an expression for the negative log probability density (neg-log-prob, -ln b_i) in the form of three terms: a constant, a term dependent only on the variance, and a term dependent on the observations.
1. Derive the maximum likelihood (ML) estimate of the mean μ from a known set of scalar observations, assuming a Gaussian pdf.
  [Hint: write the likelihood function as a log probability, and then differentiate.]
2. Derive the ML estimate of the variance Σ in the same way.
A 2-dimensional feature vector is made up of two independent observations which have standard deviations of 2 and 3 units respectively.
1. What are the variances of each of the two dimensions of the observation vector, o_t=[o₁(t) o₂(t)]^T?
2. Hence, write down the 2×2 covariance matrix Σ for o_t.
3. Evaluate the determinant of this matrix, |Σ|.
Sketch the following pdfs by drawing contours of equal probability density:
1. μ=[0 0]^T, Σ=[1 0; 0 ¼];
2. μ=[3 2]^T, Σ=[4 0; 0 9];
3. μ=[2 -2]^T, Σ=[2 1; 1 2];
4. μ=[-2 -2]^T, Σ=[2 -1; -1 2].
Sketch the pdf for Gaussian mixtures with the following parameters:
1. Univariate: c₁=1/3, μ₁=0, Σ₁=1 and c₂=2/3, μ₂=3, Σ₂=1;
2. Bivariate: c₁=½, μ₁=[4; 4], Σ₁=[2 1; 1 2] and c₂=½, μ₂=[4; 4], Σ₂=[2 -1; -1 2].
Derive expressions for training Gaussian-mixture output pdfs using multiple files:
1. for Viterbi re-estimation;
2. for Baum-Welch re-estimation.

[solutions]

[ CVSSP | Dept. | Fac. | Univ. ]