Like any other pattern recognition technology, speech recognition cannot be error free. The speech transcript accuracy is highly dependent on the speaker, the style of speech and the environmental conditions. Speech recognition is a harder process than what people commonly think, even for a human being. Humans are used to understanding speech, not to transcribing it, and only speech that is well formulated can be transcribed without ambiguity.
From the user's point of view, a voice-to-text system can be categorized based in its use: command and control, dialog system, text dictation, audio document transcription, etc. Each use has specific requirements in terms of latency, memory constraints, vocabulary size, and adaptive features.
The complete voice-to-text conversion process is done in three steps. The software first identifies the audio segments containing speech, then it recognizes the language being spoken if it is not known a priori, and finally it converts the speech segments to text and time-codes. VoxSigma includes adaptive features allowing the transcription of noisy speech such as speech with background music. The result is a fully annotated XML document including speech and non speech segments, speaker labels, words with time codes, high quality confidence scores, and punctuations. This XML file can be directly indexed by a search engine, or alternatively can be converted into plain text.
Vocapia Research also offers services to adapt, tune or create specific models or systems tailored to exactly match your needs. Tailoring models for your application is the best way to ensure you get the best possible results for your needs. High accuracy is essential to maximize your ROI, as to a first approximation, the cost of using a voice-to-text system is proportional to the system's error rate. Therefore using a system with a 80% accuracy (i.e. 20% error) may cost almost twice that of using a system with a 90% accuracy (i.e. 10% error). This is also be the case for systems with 90% and 95% accuracy, although the difference in error rate is 5%, the first system makes twice as many errors as the second.