Voice to Text transcription

Voice to text conversion is the process of converting spoken words into written texts. This process is also often called speech recognition. Although these terms are almost synonymous, Speech recognition is sometimes used to describe the wider process of extracting meaning from speech, i.e. speech understanding.

Some voice-to-text systems rely on speech generation models usually an acoustic model, a language model, and a pronunciation model. Some recent systems use a single end-to-end model integrating all of the knowledge sources extracted from large amounts of speech data. This second type of system is simpler to build, but cannot be easily adapted to specific conditions. It is important to understand that there is no such thing as a universal speech recognizer. To get the best transcription quality, all the models can be specialized for a given language, dialect, application domain, type of speech, and communication channel.

Like any other pattern recognition technology, speech recognition cannot be error free. The speech transcript accuracy is highly dependent on the speaker, the style of speech and the environmental conditions. Speech recognition is a harder process than what people commonly think, even for a human being. Humans are used to understanding speech, not to transcribing it, and only speech that is well formulated can be transcribed without ambiguity.

From the user's point of view, a voice-to-text system can be categorized based in its use: command and control, dialog system, text dictation, audio document transcription, etc. Each use has specific requirements in terms of latency, memory constraints, vocabulary size, and adaptive features.

The VoxSigma software suite offers large vocabulary multilingual voice-to-text capabilities with state-of-the-art accuracy. It has been specifically designed for professional users, needing to transcribe large quantities of audio and video documents such as broadcast data, either in batch mode or in real-time. It can also be used to analyze call-center data.

The complete voice-to-text conversion process is done in three steps. The software first identifies the audio segments containing speech, then it recognizes the language being spoken if it is not known a priori, and finally it converts the speech segments to text and time-codes. VoxSigma includes adaptive features allowing the transcription of noisy speech such as speech with background music. The result is a fully annotated XML document including speech and non speech segments, speaker labels, words with time codes, high quality confidence scores, and punctuations. This XML file can be directly indexed by a search engine, or alternatively can be converted into plain text.

Vocapia Research also offers services to adapt, tune or create specific models or systems tailored to exactly match your needs. Tailoring models for your application is the best way to ensure you get the best possible results for your needs. High accuracy is essential to maximize your ROI, as to a first approximation, the cost of using a voice-to-text system is proportional to the system's error rate. Therefore using a system with a 80% accuracy (i.e. 20% error) may cost almost twice that of using a system with a 90% accuracy (i.e. 10% error). This is also be the case for systems with 90% and 95% accuracy, although the difference in error rate is 5%, the first system makes twice as many errors as the second.

How does it work?

VoxSigma™