Acoustic model A model describing the
probabilistic behavior of the encoding of the linguistic information
in a speech signal. LVCSR systems use acoustic
units corresponding to phones or phones in context. The most
predominant approach uses continuous density hidden Markov models (HMM) to represent context dependent phones.
Acoustic parametrization (or acoustic front-end)
see Speech Analysis
Allophone A pronunciation
variant of a phoneme in a particular context, such as the realization of the phoneme /t/ in
type (aspirated /t/), butter (flapped /t/), or hot (final unreleased /t/).
Triphones and quinphones are two common models of
allophones used by speech recognizers.
ASR Accuracy The speech recognition accuracy is
defined as the 1-WER, see Word Accuracy.
Automatic Language Recognition Process
by which a computer identifies the language being spoken in a speech signal.
Automatic Speaker Recognition Process
by which a computer identifies the speaker from a speech signal.
Automatic Speech Recognition (ASR) Process
by which a computer convert a speech signal into a sequence of words. Also called Speech-to-Text Conversion.
Backoff Mechanism for smoothing the estimates of the probabilities of rare
events by relying on less specific models (acoustic or language models)
Bottleneck MLP A bottleneck MLP is
an MLP with one small layer (usually with less than 100
nodes). It is used in speech recognizers to extract the so-called "bottleneck
features" from raw spectral features.
CDHMM Continuous Density HMM (usually based on Gaussian mixtures)
Confidence score Posterior
probability associated to an hypothesis (e.g. a recognized word, an
identified speaker, ...). For a speech recognizer, the sum of the word
confidence scores is an estimate of the number of
correct words. Confidence scores are commonly evaluated by computing
the NCE metric.
DNN A Deep Neural Network is an MLP with
many layers. DNNs are used in hybrid speech recognizers to estimate
the HMM state posterior probabilities (see also TDNN,
RNN-T, and transformer).
Filler word Words like uhm, euh, ...
FIR filter A Finite Impulse Response (FIR) filter produces
an output that is the weighted sum of the current and past inputs.
Frame An acoustic feature vector
(usually MFCC) estimated on a 20-30ms signal
window (see also Speech Analysis).
Frame Rate Number of frames per second (typically 100).
GMM Gaussian Mixture Model (i.e. a
1-state CDHMM). The speech spectrum (or frame) generation process is modeled by a mixture of
multivariate normal distributions.
HMM Hidden Markov Models (or
Probabilistic functions of Markov chains). The sequence of speech
spectra (or frames) is modeled by a two-level
stochastic process. The first process is an unobservable (hidden)
Markov chain while the second process (often modeled with a GMM) is observable and depends on the Markov
chain state.
HMM state Usually an GMM. An HMM contains one or more
states, typically 3 states for a phone model.
IIR filter An Infinite Impulse Response (IIR) filter
produces an output that is the weighted sum of the current and past
inputs, and past outputs.
Language model A language model captures
the regularities in the spoken language and is used by the speech
recognizer to estimate the probability of word sequences. One of the
most popular method is the so called n-gram
model, which attempts to capture the syntactic and semantic
constraints of the language by estimating the frequencies of sequences
of n words (see also NN-LM and LLM).
A language model can be used to predict the next word given an input text.
A large language model (LLM) is a NN-LM
trained using a very large amount of text data (tera-bytes). Such multi-domain language model
will often be adapted (fine-tuned) to get the best accuracy on a specific domain. LLMs
can be used to accomplished many language processing tasks, such as sentence probability estimation,
text completion, question answering, summarization, and sentiment analysis.
Lattice A word lattice is a weighted acyclic graph where
word labels are either assigned to the graphs edges (or links) or to
the graph vertices (or nodes). Acoustic and language model weights are associated
to each edge, and a time position is associated to each vertex
Lexicon or pronunciation dictionary A list of words with pronunciations. For a speech
recognizer it includes all words known by the system, where each word has one or more prononciations with associated
probabilities.
LVCSR Large Vocabulary Speech Recognition (large vocabulary means 20k words or more).
The size of the recognition vocabulary affects the processing requirements.
MAP estimation (Maximum A Posteriori) A training
procedure that attempts to maximize the posterior probability of the
model parameters (which are therefore seen as random variables)
Pr(M|X,W) (X is the speech signal, W is the word transcription, and M
represents the model parameters).
MAP decoding A decoding procedure (speech recognition)
which attempts to maximize the posterior probability Pr(W|X,M) of the
word transcription given the speech signal X and the model M.
MFCC Mel Frequency Cepstrum Coefficients. The Mel scale
approximates the sensitivity of the human ear. Note that there are many
other frequency scales "approximating" the human ear (e.g. the Bark
scale).
MF-PLP PLP coefficients obtained from a Mel frequency power spectrum
(see also MFCC and PLP).
MLE (Maximum Likelihood Estimation) A training procedure
(the estimation of the model parameters) that attempts to maximize the
training data likelihood given the model f(X|W,M) (X is the speech
signal, W is the word transcription, and M is the model).
MLP Multi-Layer Perceptron is a class of artificial neural network. It
is a feedforward network mapping some input data to some desired output representation. It is composed
of three or more layers with nonlinear activation functions (usually sigmoids).
MMIE (Maximum Mutual Information Estimation) A
discriminative training procedure that attempts to maximize the
posterior probability of the word transcription Pr(W|X,M) (X is the
speech signal, W is the word transcription, and M is the model). This
training procedure is also called Conditional Maximum Likelihood
Estimation.
N-Gram Probabilistic language model based on an N-1 order Markov chain
N-best Top N hypotheses
A Neural network language model (NN-LM) is neural network based language model
(also called continuous space language model) where each word is represented by a real-valued vector (word embedding).
It is common to use recurrent network and transformer-based
network for language modeling.
Normalized cross entropy (NCE) The
normalized cross entropy is a metric used to evaluate confidence scores.
OOV word Out Of Vocabulary word -- Each
OOV word causes more than one recognition error (usually between 1.5
and 2 errors). An obvious way to reduce the error rate due to OOVs is
to increase the size of the vocabulary.
%OOV Out Of Vocabulary word rate.
Percent of correct words The percentage of reference words that are correctly recognized. This measure
can be used to evaluate speech recognizers whenever
insertion errors can be ignored. It is defined as the %WAcc + %Ins where %Ins is 100 times the number of
inserted words divided by the the number of reference words.
Perplexity The relevance of a language model is often measured in terms of test set
perplexity defined as pow(Prob(text|language-model),-1/n), where is n
is the number of words in the test text. The test perplexity depends
on both the language being modeled and the model. It gives a combined
estimate of how good the model is and how complex the language is.
Phone Symbol used to represent the
pronunciations in the lexicon for a speech recognizer or a speech
synthesis system. The number of phones can be somewhat smaller or
larger then the number of phonemes in the language. The phone set is
chosen to optimize the system accuracy.
Phoneme An abstract representation
of the smallest phonetic unit in a language which conveys a
distinction in meaning. For example the sounds /d/ and /t/ are
separate phonemes in English because they distinguish words such as
do and to. To illustrate phoneme differences across
languages, the two /u/-like vowels in the French words tu and
tout are not distinct phonemes in English, whereas the two
/i/-like vowels in the English words seat and sit are
not distinct phonemes in French.
Pitch or F0 The pitch is the fundamental
frequency of a (periodic or nearly periodic) speech signal. In
practice, the pitch period can be obtained from the position of the
maximum of the autocorrelation function of the signal. See also
degree of voicing, periodicity and harmonicity.
(In psychoacoustics the pitch is a subjective auditory attribute).
PLP analysis Perceptual Linear
Prediction features are derived as follows: Compute the perceptual
power spectral density (Bark scale); perform equal loudness
preemphasis and take the cubic root of the intensity
(intensity-loundness power law); apply the IDFT to get the equivalent
of the autocorrelation function; fit a linear prediction (LP) model
and transform the result into cepstral coefficients (LPCC analysis).
Quinphone (or pentaphone) Phone in
context where the context usually includes the 2 left phones and the 2 right
phones.
Recording channel Means by which the
audio signal is recorded (direct microphone, telephone, radio, etc.)
RNN A Recurrent Neural Network is a type of neural network
with loops. RNNs are commonly used for language modeling. RNN layers
can also be included in a larger DNN for acoustic modeling.
RNN-T A recurrent neural network transducer is converting an input sequence
into an output sequence. An RNN-T can be used to directly transcribe the speech input (acoustic frames)
using a single end-to-end model instead of an hybrid architecture with multiple components.
Sampling Rate Number of samples
per second used to code the speech signal (usually 16000, i.e. 16 kHz
for a bandwidth of 8 kHz). Telephone speech is sampled at 8 kHz. 16
kHz is generally regarded as sufficient for speech recognition and
synthesis. The audio standards use sample rates of 44.1 kHz (Compact
Disc) and 48 kHz (Digital Audio Tape). Note that signals must be
filtered prior to sampling, and the maximum frequency that can be
represented is half the sampling frequency. In practice a higher
sample rate is used to allow for non-ideal filters.
Sampling Resolution Number
of bits used to code each signal sample. Speech is normally stored in
16 bits. Telephony quality speech is sampled at 8 kHz with a 12 bit
dynamic range (stored in 8 bits with a non-linear function, i.e. A-law
or U-law). The dynamic range of the ear is about 20 bits.
Speaker diarization
Speaker diarization, also called speaker segmentation and clustering,
is the process of partitioning an input audio stream into homogeneous
segments according to speaker identity. Speaker partitioning is a
useful preprocessing step for an automatic speech transcription
system. By clustering segments from the same speaker, the amount of
data available for unsupervised speaker adaptation is increased, which
can significantly improve the transcription performance. One of the
major issues is that the number of speakers is unknown a priori and
needs to be automatically determined.
Spectrogram A spectrogram is
a plot of the short-term power of the signal in different frequency
bands as a function of time.
Speech Activity Detection (SAD)
Speech activity detection, also known as voice activity detection (VAD), is the process
of detecting the presence of speech in an audio signal.
Speech Analysis
Feature vector extraction from a windowed signal (20-30ms). It is
assumed that speech has short time stationarity and that a feature
vector representation captures the needed information (depending of
the task) for future processing. The most popular set of features are
cepstrum coefficients obtained with a Mel Frequency Cepstral (MFC) analysis or with a Perceptual Linear Prediction
(PLP) analysis.
Speech-Text Alignment
Process of synchronizing a speech signal with a speech transcript or closely
related text, providing time codes for words and sentences.
A synonym of Automatic Speech Recognition.
Speech-to-Text Conversion
A synonym of Automatic Speech Recognition.
A Time Delay Neural Network (TDNN) is a DNN where the context is modeled
in most layers of the network. Such DNN architecture is very effective to estimate the HMM
state likelihoods in hybrid speech recognizers.
A transformer is a DNN based on an multi-head attention mechanism with no
recurrent units. This architecture is often used for NN-LM.
Triphone (or Phone in context) A
context-dependent HMM phone model (the context usually includes the
left and right phones)
Voicing The degree of voicing is a
measure of the degree to which a signal is periodic (also called
periodicity, harmonicity or HNR). In practice, the degree of
periodicity can be obtained from the relative height of the maximum of
the autocorrelation function of the signal.
Word Accuracy The word accuracy (WAcc)
is a metric used to evaluate speech recognizers. The percent word
acccuracy is defined af %WAcc = 100 - %WER. It should be noted that
the word accuracy can be negative. The Word Error Rate (WER,
see below) is a more commonly used metric and should be preferred to
the word accuracy.
Word Error Rate The word error rate (WER) is the
commonly used metric to evaluate speech recognizers. It is a
measure of the average number of word errors taking into account three
error types: substitution (the reference word is replaced by
another word), insertion (a word is hypothesized that was not in
the reference) and deletion (a word in the reference
transcription is missed). The word error rate is defined as the sum of
these errors divided by the number of reference words. Given this
definition the percent word error can be more than 100%. The WER is
somewhat proportional to the correction cost.