Speech to Text technology | Voxsigma software

We are developing speech processing technologies such as core multilingual unlimited vocabulary speech recognizers for automatic speech transcription, audio indexing, and speech-text alignment.

The VoxSigma software suite provides audio segmentation and partitioning, speaker identification, language recognition, and large vocabulary speech recognition capabilities in many languages.
VoxSigma has been designed for professional users needing to process large quantities of audio and video documents such as broadcast data or call-center communications, either in batch mode or in real-time.

VoxSigma™ Software Suite

The Vocapia Research VoxSigma software suite for Linux offers state of the art performance for broadcast data and conversational data in many languages. The VoxSigma API includes Unix/Linux commands, C and C++ libraries, REST API, import XML and export in XML or JSON formats. The VoxSigma software is available both via licensing and via our web service.
In addition the Vocapia Scribe3 GUI offer a convenient way to upload, process and post-edit audio and video documents.

[Voxsigma request form]

Audio partitionning and speaker diarization

The first step in the VoxSigma processing chain is audio partitionning which includes the separation between speech and non-speech audio (such as noise or music) and speaker diarization. Speaker diarization, also called speaker segmentation and clustering, is the process of partitioning an input audio stream into homogeneous segments according to speaker identity.

Spoken language identification

Spoken language identification (LID) is the process of recognizing the language spoken in an audio document (broadcast audio, podcast, or telephone). The standard VoxSigma language identification component can recognize one of 100 languages. By default, it is assumed that each document contains only one language. However, this can be adjusted by specifying the maximum number of languages and optionally providing a limited list of possible languages. The LID module includes a customization tool that allows users to tailor the models to their specific data, such as adding a missing language or dialect.

Speech Transcription

Speech transcription (also called speech-to-text, or STT) is the process of recognizing the words spoken in an audio document, while also providing a timecode, duration, and confidence score for each identified word. The speech transcription module can handle various audio types, including multilingual and dual-channel audio. It supports an unlimited vocabulary for multiple languages, including Arabic, Cantonese, Czech, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Hungarian, Italian, Latvian, Lithuanian, Mandarin, Pashto, Persian, Polish, Portuguese, Romanian, Russian, Spanish, Swahili, Swedish, Turkish and Urdu. The STT module includes a customization tool that allows users to tailor the models to their specific data or task.

Keyword spotting and audio Indexing

Large vocabulary continuous speech recognition is a key technology that can be used to enable content-based information access in audio and video documents. Most of the linguistic information is encoded in the audio channel of audiovisual data, which once transcribed can be accessed using text-based tools. Via language identification, speech recognition, and speaker recognition, spoken document retrieval can support random access using specific criteria to relevant portions of audio documents, reducing the time needed to identify recordings in large audio/video databases. Some applications are data mining, news-on-demand, media-monitoring, and telephone speech analytics.

Speech-Text Alignment

Speech-text alignment is the process of synchronizing a speech signal with a speech transcript or closely related text, providing time codes for words and sentences. The alignment process assigns timecodes to each word and each punctuation mark in the audio transcript and provides confidence scores to identify areas where the alignment may not be perfect in particular when the provided transcript differs from what has really been said. There are many uses of this technology, including audio books, language learning, and video subtitling.

Translation

Vocapia Research also provides automatic translation technologies capable of processing both speech and text inputs. Combined with our speech recognition modules, these tools allow users to automatically transcribe and translate multilingual audio content with high accuracy. Whether translating a conversation, a broadcast segment, or written documents, the system ensures reliable results by leveraging advanced language models. This feature is particularly suited for applications such as international media monitoring or cross-lingual content analysis.

Summarization

In addition to transcription and translation, Vocapia Research offers automatic summarization technologies that can generate concise and coherent summaries and titles from either audio or text sources. By combining speech recognition, natural language processing, and semantic analysis, this module extracts the essential information from long documents, meetings, or interviews, allowing users to save time and focus on key insights. This functionality is ideal for news aggregation, corporate intelligence, and large-scale information indexing.