Speech processing glossary | Vocapia Research

Acoustic model A model describing the probabilistic behavior of the encoding of the linguistic information in a speech signal. LVCSR systems use acoustic units corresponding to phones or phones in context. The most predominant approach uses continuous density hidden Markov models (HMM) to represent context dependent phones.

Acoustic parametrization (or acoustic front-end) see Speech Analysis

Allophone A pronunciation variant of a phoneme in a particular context, such as the realization of the phoneme /t/ in type (aspirated /t/), butter (flapped /t/), or hot (final unreleased /t/). Triphones and quinphones are two common models of allophones used by speech recognizers.

ASR Accuracy The speech recognition accuracy is defined as the 1-WER, see Word Accuracy.

Automatic Language Recognition Process by which a computer identifies the language being spoken in a speech signal.

Automatic Speaker Recognition Process by which a computer identifies the speaker from a speech signal.

Automatic Speech Recognition (ASR) Process by which a computer convert a speech signal into a sequence of words. Also called Speech-to-Text Conversion.

Backoff Mechanism for smoothing the estimates of the probabilities of rare events by relying on less specific models (acoustic or language models)

Bottleneck MLP A bottleneck MLP is an MLP with one small layer (usually with less than 100 nodes). It is used in speech recognizers to extract the so-called "bottleneck features" from raw spectral features.

CDHMM Continuous Density HMM (usually based on Gaussian mixtures)

Confidence score Posterior probability associated to an hypothesis (e.g. a recognized word, an identified speaker, ...). For a speech recognizer, the sum of the word confidence scores is an estimate of the number of correct words. Confidence scores are commonly evaluated by computing the NCE metric.

DNN A Deep Neural Network is an MLP with many layers. DNNs are used in hybrid speech recognizers to estimate the HMM state posterior probabilities (see also TDNN, RNN-T, and transformer).

Filler word Words like uhm, euh, ...

FIR filter A Finite Impulse Response (FIR) filter produces an output that is the weighted sum of the current and past inputs.

Frame An acoustic feature vector (usually MFCC) estimated on a 20-30ms signal window (see also Speech Analysis).

Frame Rate Number of frames per second (typically 100).

GMM Gaussian Mixture Model (i.e. a 1-state CDHMM). The speech spectrum (or frame) generation process is modeled by a mixture of multivariate normal distributions.

HMM Hidden Markov Models (or Probabilistic functions of Markov chains). The sequence of speech spectra (or frames) is modeled by a two-level stochastic process. The first process is an unobservable (hidden) Markov chain while the second process (often modeled with a GMM) is observable and depends on the Markov chain state.

HMM state Usually an GMM. An HMM contains one or more states, typically 3 states for a phone model.

IIR filter An Infinite Impulse Response (IIR) filter produces an output that is the weighted sum of the current and past inputs, and past outputs.

Language model A language model captures the regularities in the spoken language and is used by the speech recognizer to estimate the probability of word sequences. One of the most popular method is the so called n-gram model, which attempts to capture the syntactic and semantic constraints of the language by estimating the frequencies of sequences of n words (see also NN-LM and LLM). A language model can be used to predict the next word given an input text.

A large language model (LLM) is a NN-LM trained using a very large amount of text data (tera-bytes). Such multi-domain language model will often be adapted (fine-tuned) to get the best accuracy on a specific domain. LLMs can be used to accomplished many language processing tasks, such as sentence probability estimation, text completion, question answering, summarization, and sentiment analysis.

Lattice A word lattice is a weighted acyclic graph where word labels are either assigned to the graphs edges (or links) or to the graph vertices (or nodes). Acoustic and language model weights are associated to each edge, and a time position is associated to each vertex

Lexicon or pronunciation dictionary A list of words with pronunciations. For a speech recognizer it includes all words known by the system, where each word has one or more prononciations with associated probabilities.

LVCSR Large Vocabulary Speech Recognition (large vocabulary means 20k words or more). The size of the recognition vocabulary affects the processing requirements.

MAP estimation (Maximum A Posteriori) A training procedure that attempts to maximize the posterior probability of the model parameters (which are therefore seen as random variables) Pr(M|X,W) (X is the speech signal, W is the word transcription, and M represents the model parameters).

MAP decoding A decoding procedure (speech recognition) which attempts to maximize the posterior probability Pr(W|X,M) of the word transcription given the speech signal X and the model M.

MFCC Mel Frequency Cepstrum Coefficients. The Mel scale approximates the sensitivity of the human ear. Note that there are many other frequency scales "approximating" the human ear (e.g. the Bark scale).

MF-PLP PLP coefficients obtained from a Mel frequency power spectrum (see also MFCC and PLP).

MLE (Maximum Likelihood Estimation) A training procedure (the estimation of the model parameters) that attempts to maximize the training data likelihood given the model f(X|W,M) (X is the speech signal, W is the word transcription, and M is the model).

MLP Multi-Layer Perceptron is a class of artificial neural network. It is a feedforward network mapping some input data to some desired output representation. It is composed of three or more layers with nonlinear activation functions (usually sigmoids).

MMIE (Maximum Mutual Information Estimation) A discriminative training procedure that attempts to maximize the posterior probability of the word transcription Pr(W|X,M) (X is the speech signal, W is the word transcription, and M is the model). This training procedure is also called Conditional Maximum Likelihood Estimation.

N-Gram Probabilistic language model based on an N-1 order Markov chain

N-best Top N hypotheses

A Neural network language model (NN-LM) is neural network based language model (also called continuous space language model) where each word is represented by a real-valued vector (word embedding). It is common to use recurrent network and transformer-based network for language modeling.

Normalized cross entropy (NCE) The normalized cross entropy is a metric used to evaluate confidence scores.

OOV word Out Of Vocabulary word -- Each OOV word causes more than one recognition error (usually between 1.5 and 2 errors). An obvious way to reduce the error rate due to OOVs is to increase the size of the vocabulary.

%OOV Out Of Vocabulary word rate.

Percent of correct words The percentage of reference words that are correctly recognized. This measure can be used to evaluate speech recognizers whenever insertion errors can be ignored. It is defined as the %WAcc + %Ins where %Ins is 100 times the number of inserted words divided by the the number of reference words.

Perplexity The relevance of a language model is often measured in terms of test set perplexity defined as pow(Prob(text|language-model),-1/n), where is n is the number of words in the test text. The test perplexity depends on both the language being modeled and the model. It gives a combined estimate of how good the model is and how complex the language is.

Phone Symbol used to represent the pronunciations in the lexicon for a speech recognizer or a speech synthesis system. The number of phones can be somewhat smaller or larger then the number of phonemes in the language. The phone set is chosen to optimize the system accuracy.

Phoneme An abstract representation of the smallest phonetic unit in a language which conveys a distinction in meaning. For example the sounds /d/ and /t/ are separate phonemes in English because they distinguish words such as do and to. To illustrate phoneme differences across languages, the two /u/-like vowels in the French words tu and tout are not distinct phonemes in English, whereas the two /i/-like vowels in the English words seat and sit are not distinct phonemes in French.

Pitch or F0 The pitch is the fundamental frequency of a (periodic or nearly periodic) speech signal. In practice, the pitch period can be obtained from the position of the maximum of the autocorrelation function of the signal. See also degree of voicing, periodicity and harmonicity. (In psychoacoustics the pitch is a subjective auditory attribute).

PLP analysis Perceptual Linear Prediction features are derived as follows: Compute the perceptual power spectral density (Bark scale); perform equal loudness preemphasis and take the cubic root of the intensity (intensity-loundness power law); apply the IDFT to get the equivalent of the autocorrelation function; fit a linear prediction (LP) model and transform the result into cepstral coefficients (LPCC analysis).

Quinphone (or pentaphone) Phone in context where the context usually includes the 2 left phones and the 2 right phones.

Recording channel Means by which the audio signal is recorded (direct microphone, telephone, radio, etc.)

RNN A Recurrent Neural Network is a type of neural network with loops. RNNs are commonly used for language modeling. RNN layers can also be included in a larger DNN for acoustic modeling.

RNN-T A recurrent neural network transducer is converting an input sequence into an output sequence. An RNN-T can be used to directly transcribe the speech input (acoustic frames) using a single end-to-end model instead of an hybrid architecture with multiple components.

Sampling Rate Number of samples per second used to code the speech signal (usually 16000, i.e. 16 kHz for a bandwidth of 8 kHz). Telephone speech is sampled at 8 kHz. 16 kHz is generally regarded as sufficient for speech recognition and synthesis. The audio standards use sample rates of 44.1 kHz (Compact Disc) and 48 kHz (Digital Audio Tape). Note that signals must be filtered prior to sampling, and the maximum frequency that can be represented is half the sampling frequency. In practice a higher sample rate is used to allow for non-ideal filters.

Sampling Resolution Number of bits used to code each signal sample. Speech is normally stored in 16 bits. Telephony quality speech is sampled at 8 kHz with a 12 bit dynamic range (stored in 8 bits with a non-linear function, i.e. A-law or U-law). The dynamic range of the ear is about 20 bits.

Speaker diarization Speaker diarization, also called speaker segmentation and clustering, is the process of partitioning an input audio stream into homogeneous segments according to speaker identity. Speaker partitioning is a useful preprocessing step for an automatic speech transcription system. By clustering segments from the same speaker, the amount of data available for unsupervised speaker adaptation is increased, which can significantly improve the transcription performance. One of the major issues is that the number of speakers is unknown a priori and needs to be automatically determined.

Spectrogram A spectrogram is a plot of the short-term power of the signal in different frequency bands as a function of time.

Speech Activity Detection (SAD) Speech activity detection, also known as voice activity detection (VAD), is the process of detecting the presence of speech in an audio signal.

Speech Analysis Feature vector extraction from a windowed signal (20-30ms). It is assumed that speech has short time stationarity and that a feature vector representation captures the needed information (depending of the task) for future processing. The most popular set of features are cepstrum coefficients obtained with a Mel Frequency Cepstral (MFC) analysis or with a Perceptual Linear Prediction (PLP) analysis.

Speech-Text Alignment Process of synchronizing a speech signal with a speech transcript or closely related text, providing time codes for words and sentences. A synonym of Automatic Speech Recognition.

Speech-to-Text Conversion A synonym of Automatic Speech Recognition.

A Time Delay Neural Network (TDNN) is a DNN where the context is modeled in most layers of the network. Such DNN architecture is very effective to estimate the HMM state likelihoods in hybrid speech recognizers.

A transformer is a DNN based on an multi-head attention mechanism with no recurrent units. This architecture is often used for NN-LM.

Triphone (or Phone in context) A context-dependent HMM phone model (the context usually includes the left and right phones)

Voicing The degree of voicing is a measure of the degree to which a signal is periodic (also called periodicity, harmonicity or HNR). In practice, the degree of periodicity can be obtained from the relative height of the maximum of the autocorrelation function of the signal.

Word Accuracy The word accuracy (WAcc) is a metric used to evaluate speech recognizers. The percent word acccuracy is defined af %WAcc = 100 - %WER. It should be noted that the word accuracy can be negative. The Word Error Rate (WER, see below) is a more commonly used metric and should be preferred to the word accuracy.

Word Error Rate The word error rate (WER) is the commonly used metric to evaluate speech recognizers. It is a measure of the average number of word errors taking into account three error types: substitution (the reference word is replaced by another word), insertion (a word is hypothesized that was not in the reference) and deletion (a word in the reference transcription is missed). The word error rate is defined as the sum of these errors divided by the number of reference words. Given this definition the percent word error can be more than 100%. The WER is somewhat proportional to the correction cost.