Speech to text
conversion is the process of converting spoken words
into written texts. This process is also often called
speech recognition. Although these
terms are almost synonymous,
Speech recognition is sometimes used to describe the wider
process of extracting meaning from speech, i.e.
speech
understanding.
The term
voice recognition should be avoided
as it is often associated to the process of identifying a person from
their voice, i.e.
speaker
recognition.
Some speech-to-text systems rely on speech generation models usually
an
acoustic model,
a
language model,
and a
pronunciation model. Some
recent systems use a single end-to-end model integrating all of the
knowledge sources extracted from large amounts of speech data. This
second type of system is simpler to build, but cannot be easily adapted
to specific conditions. It is important to understand that there is no
such thing as a universal speech recognizer. To get the best
transcription quality, all the models can be specialized for a
given language, dialect, application domain, type of speech, and
communication channel.
Like any other pattern recognition technology, speech recognition
cannot be error free. The speech transcript accuracy is highly
dependent on the speaker, the style of speech and the environmental
conditions. Speech recognition is a harder process than what
people commonly think, even for a human being. Humans are used to
understanding speech, not to transcribing it, and only speech that is
well formulated can be transcribed without ambiguity.
From the user's point of view, a
speech-to-text
system can be categorized
based in its use: command and control, dialog system, text dictation, audio
document transcription, etc. Each use has specific requirements in terms of
latency, memory constraints, vocabulary size, and adaptive features.
The
VoxSigma software suite offers large
vocabulary multilingual
speech-to-text
capabilities with
state-of-the-art accuracy. It has been specifically designed for
professional users, needing to transcribe large quantities of audio
and video documents such as broadcast data, either in batch mode or
in real-time. It can also be used to analyze call-center data.
The complete voice-to-text
conversion process is done in three steps. The software first
identifies the audio segments containing speech, then it recognizes
the language being spoken if it is not known a priori, and
finally it converts the speech segments to text and time-codes. VoxSigma includes adaptive features allowing
the transcription of noisy speech such as speech with background
music. The result is a fully annotated XML document including speech
and non speech segments, speaker labels, words with time codes, high
quality confidence scores, and punctuations. This XML file can be
directly indexed by a search engine, or alternatively can be converted
into plain text.
Vocapia Research also offers services to adapt, tune or create specific models
or systems tailored to exactly match your needs. Tailoring models for your
application is the best way to ensure you get the best possible results for
your needs. High accuracy is essential to maximize your
ROI, as to a first
approximation, the cost of using a
speech-to-text
system is proportional to the system's error
rate.
Therefore using a system with a 80% accuracy (i.e. 20% error) may cost
almost twice that of using a system with a 90% accuracy (i.e. 10% error). This
is also be the case for systems with 90% and 95% accuracy,
although the difference in error rate is 5%, the first system makes twice as
many errors as the second.